論文Abstract100本ノック#14 - 十の並列した脳

前回↓

ryosuke-okubo.hatenablog.com

66 CTC（2006）

f:id:ryosuke_okubo:20191018201048p:plain

原文：

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Abstract：

Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data.

訳：

多くの実際のシーケンス学習タスクではノイズのあるセグメント化されていない入力データからラベルのシーケンスを予測する必要がある。

In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units.

語彙：

acoustic

sub-word

訳：

たとえば音声認識では，音響信号は単語または部分語単位に転写される。

Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks.

訳：

RNNはこのようなタスクに適していると思われる強力なシーケンス学習器である。

However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited.

語彙：

so far

訳：

ただし，事前にセグメント化された学習データと，出力をラベルシーケンスに変換する後処理が必要なため，これまでのところその適用性は制限されていた。

This paper presents a novel method for training RNNs to label un-segmented sequences directly, thereby solving both problems.

語彙：

thereby

訳：

本論文ではセグメント化されていないシーケンスに直接ラベル付けするRNNを学習し，それによって両方の問題を解決する新しい方法を提示する。

An experiment on the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN.

訳：

TIMITのスピーチコーパスの実験によってbaseline HMMとhybrid HMM-RNNの両方に対する利点を示される。

67 WaveNet（2016）

f:id:ryosuke_okubo:20191018201117p:plain

原文：

WaveNet: A Generative Model for Raw Audio

Abstract：

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.

語彙：

waveforms

訳：

本論文では，未加工のオーディオ波形を生成するためのDNNであるWaveNetを紹介する。

The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones;

nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio.

訳：

このモデルは完全に確率的で自己回帰的であり，各オーディオサンプルの予測分布は以前のすべてのサンプルを条件としている；

それにもかかわらず1秒あたり数万サンプルのオーディオでデータを効率的に学習できることを示す。

When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin.

語彙：

concatenative

Mandarin

訳：

テキスト読み上げに適用したとき，最新のパフォーマンスが得られる，人間のリスナーは英語とMandarinの両方で最高のパラメトリックおよび連結システムよりもはるかに自然な音として評価する。

A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity.

訳：

1つのWaveNetは多くの異なるスピーカーの特性を同等の忠実度でキャプチャし，スピーカーIDを調整することでそれらを切り替えることができる。

When trained to model music, we find that it generates novel and often highly realistic musical fragments.

訳：

音楽をモデル化する学習を受けたとき，それは斬新でしばしば非常に現実的な音楽断片を生成することがわかる。

We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

語彙：

phoneme

訳：

また音素認識の有望な結果を返す識別モデルとして使用できることを示す。

68 Parallel WaveNet（2017）

f:id:ryosuke_okubo:20191018201148p:plain

原文：

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Abstract：

The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system.

訳：

最近開発されたWaveNetアーキテクチャは現実的な音声合成の最新技術であり，以前のどのシステムよりも多くの異なる言語でより自然な音として一貫して評価されている。

However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting.

語彙：

is ~ suited to

massively parallel computers

訳：

ただし，WaveNetは一度に1つのオーディオサンプルの順次生成に依存するため，今日の超並列計算機にはあまり適しておらず，リアルタイムのプロダクション設定で展開するのは困難である。

This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality.

訳：

本論文ではProbability Density Distillationを紹介する，これは品質に大きな違いがない学習済みWaveNetから並列フィードフォワードネットワークを学習するための新しい方法である。

The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.

語彙：

capable of

high-fidelity

訳：

結果として得られるシステムはリアルタイムよりも20倍以上高速な忠実度の高い音声サンプルを生成でき，複数の英語と日本語の音声を提供するなど，Google Assistantによってオンラインで展開される。

69 SSWS（2018）

f:id:ryosuke_okubo:20191018201218p:plain

原文：

Comprehensive evaluation of statistical speech waveform synthesis

Abstract：

Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality.

訳：

音声波形を直接予測する統計TTSシステムは最近合成品質の改善を報告している。

This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system.

訳：

この調査ではAmazonの統計的音声波形合成（SSWS）システムを評価する。

An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality.

語彙：

in-depth

訳：

SSWSの詳細な評価は品質の一貫性をよりよく理解するため多くのドメインにわたって実施される。

The results of this evaluation are validated by repeating the procedure on a separate group of testers.

訳：

この評価の結果はテスターの別のグループで手順を繰り返すことによって検証される。

Finally, an analysis of the nature of speech errors of SSWS compared to hybrid unit selection synthesis is conducted to identify the strengths and weaknesses of SSWS.

訳：

最後に，SSWSの長所と短所を特定するためにハイブリッドユニット選択合成と比較したSSWSの音声エラーの性質の分析が行われる。

Having a deeper insight into SSWS allows us to better define the focus of future work to improve this new technology.

訳：

SSWSをより深く理解することでこの新しいテクノロジーを改善するための今後の作業の焦点をより明確にすることができる。

70 MelNet（2019）

f:id:ryosuke_okubo:20191018201244p:plain

原文：

MelNet: A Generative Model for Audio in the Frequency Domain

Abstract：

Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps.

訳：

オーディオの1秒が何万ものタイムステップに及ぶため，オーディオ波形の高レベル構造をキャプチャすることは困難である。

While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms.

語彙：

tractably

訳：

長距離の依存関係を時間領域で直接モデル化することは困難だが，スペクトログラムなどの2次元の時間-周波数表現でより扱いやすくモデル化できることを示す。

By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve.

語彙：

leveraging

in conjunction

訳：

この表現上の利点を，非常に表現力の高い確率モデルとマルチスケール生成手順とともに活用することで，時間領域モデルがまだ達成していないタイムスケールで構造をキャプチャする高忠実度のオーディオサンプルを生成できるモデルを設計する。

We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis

---showing improvements over previous approaches in both density estimates and human judgments.

訳：

我々はこのモデルを無条件の音声生成，音楽生成，テキスト音声合成などさまざまな音声生成タスクに適用する，

密度推定と人間の判断の両方において以前のアプローチよりも改善されている。

次回↓

ryosuke-okubo.hatenablog.com