論文Abstract100本ノック#15 - 十の並列した脳

前回↓

ryosuke-okubo.hatenablog.com

71 SoundNet（2016）

f:id:ryosuke_okubo:20191022061307p:plain

原文：

SoundNet: Learning Sound Representations from Unlabeled Video

Abstract：

We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild.

語彙：

in the wild

訳：

野生で収集されたラベル付けされていない大量の音声データを活用して，豊かで自然な音声表現を学習する。

We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos.

語彙：

synchronization

訳：

我々は200万のラベルのないビデオを使用して音響表現を学習するために，視覚と音声の自然な同期を活用する。

Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound.

訳：

ラベルのないビデオには大規模で経済的に取得できるという利点があるが，自然な音に関する有用な信号が含まれている。

We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge.

訳：

我々はラベル付けされていないビデオをブリッジとして使用して，確立された視覚認識モデルから識別的視覚知識を音声モダリティに転送するstudent-teacher training procedureを提案する。

Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification.

訳：

我々の音響表現により，音響シーン/オブジェクト分類の標準ベンチマークでの最新の結果よりも大幅にパフォーマンスが向上する。

Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.

語彙：

emerge

訳：

視覚化によりground truth labelsなしで学習されていても。サウンドネットワークにいくつかの高レベルのセマンティクスが自動的に現れることが示唆される。

72 LPCNet（2018）

f:id:ryosuke_okubo:20191022061333p:plain

原文：

LPCNet: Improving Neural Speech Synthesis Through Linear Prediction

Abstract：

Neural speech synthesis models have recently demonstrated the ability to synthesize high quality speech for text-to-speech and compression applications.

語彙：

text-to-speech

訳：

ニューラル音声合成モデルは，最近テキスト読み上げおよび圧縮アプリケーション向けに高品質の音声を合成する能力を実証している。

These new models often require powerful GPUs to achieve real-time operation, so being able to reduce their complexity would open the way for many new applications.

語彙：

open the way

訳：

これらの新しいモデルは多くの場合リアルタイム操作を実現するために強力なGPUを必要とするため，複雑さを軽減することで多くの新しいアプリケーションに突破口が開かれるだろう。

We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis.

訳：

我々はLPCNet，線形予測とRNNを組み合わせて音声合成の効率を大幅に向上させるWaveRNNバリアントを提案する。

We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS.

訳：

LPCNetは同じネットワークサイズでWaveRNNよりも大幅に高い品質を達成できること，および3 GFLOPS未満の複雑さで高品質のLPCNet音声合成が実現可能であることを示す。

This makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.

訳：

これにより組み込みシステムや携帯電話などの低電力デバイスにニューラル合成アプリケーションを簡単に展開できる。

73 RawNet（2019）

f:id:ryosuke_okubo:20191022061402p:plain

原文：

RawNet: Fast End-to-End Neural Vocoder

Abstract：

Neural networks based vocoders have recently demonstrated the powerful ability to synthesize high quality speech.

訳：

ニューラルネットワークベースのボコーダーは，最近高品質の音声を合成する強力な能力を実証した。

These models usually generate samples by conditioning on some spectrum features, such as Mel-spectrum.

訳：

これらのモデルは通常，Mel-spectrumなどのスペクトル機能を条件としてサンプルを生成する。

However, these features are extracted by using speech analysis module including some processing based on the human knowledge.

訳：

ただし，これらの特徴は人間の知識に基づいた処理を含む音声分析モジュールを使用して抽出される。

In this work, we proposed RawNet, a truly end-to-end neural vocoder, which use a coder network to learn the higher representation of signal, and an autoregressive voder network to generate speech sample by sample.

訳：

ここで，我々はコーダーネットワークを使用して信号の高次表現を学習する真のエンドツーエンドニューラルボコーダーであるRawNetと，サンプルごとに音声サンプルを生成する自己回帰ボーダーネットワークを提案する。

The coder and voder together act like an auto-encoder network, and could be jointly trained directly on raw waveform without any human-designed features.

訳：

コーダーとボーダーはともにオートエンコーダーネットワークのように機能し，人工の機能を使用せずに生波形で直接共同で学習できる。

The experiments on the Copy-Synthesis tasks show that RawNet can achieve the comparative synthesized speech quality with LPCNet, with a smaller model architecture and faster speech generation at the inference step.

訳：

Copy-Synthesisタスクの実験はRawNetがLPCNetと比較した合成音声品質を達成できることを示している，モデルアーキテクチャは小さく推論ステップでの音声生成は高速である。

74 CycleGAN-VC（2017）

f:id:ryosuke_okubo:20191022061428p:plain

原文：

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

Abstract：

We propose a parallel-data-free voice-conversion (VC) method that can learn a mapping from source to target speech without relying on parallel data.

訳：

我々は並列データに依存せずにソースからターゲットのスピーチへのマッピングを学習できる，並列データフリーのvoice-conversion（VC）メソッドを提案する。

The proposed method is general purpose, high quality, and parallel-data free and works without any extra data, modules, or alignment procedure.

語彙：

general purpose

訳：

提案された方法は，汎用で，高品質、並列データフリーであり，追加のデータ，モジュール，またはアライメント手順なしで機能する。

It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods.

訳：

従来の多くの統計モデルベースのVCメソッドで発生するover-smoothingも回避する。

Our method, called CycleGAN-VC, uses a cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) and an identity-mapping loss.

訳：

CycleGAN-VCと呼ばれる我々の方法は，identity-mapping lossを伴う，ゲート制御されたCNNとCycleGANを使用する。

A CycleGAN learns forward and inverse mappings simultaneously using adversarial and cycle-consistency losses.

訳：

CycleGANはAdversarial LossとCycle Consistency Lossを使用して順方向マッピングと逆方向マッピングを同時に学習する。

This makes it possible to find an optimal pseudo pair from unpaired data.

訳：

これによりペアになっていないデータから最適な擬似ペアを見つけることができる。

Furthermore, the adversarial loss contributes to reducing over-smoothing of the converted feature sequence.

訳：

さらに、adversarial lossは変換された機能シーケンスのover-smoothingの削減に貢献する。

We configure a CycleGAN with gated CNNs and train it with an identity-mapping loss.

訳：

ゲート制御されたCNNを使用してCycleGANを構成し，identity-mapping lossで学習する。

This allows the mapping function to capture sequential and hierarchical structures while preserving linguistic information.

訳：

これにより，言語情報を保持しながらマッピング機能で順次および階層構造をキャプチャできる。

We evaluated our method on a parallel-data-free VC task.

訳：

我々は並列データのないVCタスクでこの方法を評価した。

An objective evaluation showed that the converted feature sequence was near natural in terms of global variance and modulation spectra.

訳：

客観的な評価により，変換された特徴シーケンスはグローバル分散と変調スペクトルの点で自然に近いことがわかった。

A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

語彙：

advantageous

訳：

主観的な評価により，変換された音声の品質は並列でデータ量が2倍の有利な条件下でガウス混合モデルベースの方法で得られた品質に匹敵することが示された。

75 StarGAN-VC（2018）

f:id:ryosuke_okubo:20191022061459p:plain

原文：

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

Abstract：

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN.

訳：

本論文ではStarGANと呼ばれるGANのバリアントを使用して，非並列多対多音声変換を可能にする方法を提案する。

Our method, which we call StarGAN-VC, is noteworthy in that it

(1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training,

(2) simultaneously learns many-to-many mappings across different attribute domains using a single generator network,

(3) is able to generate converted speech signals quickly enough to allow real-time implementations

and (4) requires only several minutes of training examples to generate reasonably realistic-sounding speech.

語彙：

noteworthy

utterances

simultaneously

implementations

訳：

StarGAN-VCと呼ばれる我々の方法はその点で注目に値する，

（1）音声発生器の学習に並行した発話，転写，または時間調整手順を必要としない

（2）単一の生成ネットワークを使用して異なる属性ドメイン間で多対多のマッピングを同時に学習する

（3）変換された音声信号をリアルタイム実装を可能にするのに十分な速さで生成できる

（4）合理的に現実的な音声を生成するための学習例は数分で済む。

Subjective evaluation experiments on a non-parallel many-to-many speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than a state-of-the-art method based on variational autoencoding GANs.

訳：

非並列多対多話者同一性変換タスクの主観評価実験により，提案された方法が変分自動符号化GANに基づく最新の方法よりも高い音質と話者の類似性を得ることが明らかになった。

次回↓

作成中