DeepLearning論文をまとめる #1 Image-to-Image Translation with Conditional Adversarial Networks（Pix2Pix））

本記事では，DeepLearningの論文について重要と思った単語と文章についてまとめていく。#1ではPix2Pixについて扱う。

Abstract
1. Introduction
2. Related work
3. Method
- 3.1. Objective
- 3.2. Network architectures
4. Experiments
5. Conclusion
雑感

Abstract

KeyWord

conditional adversarial networks

Sentence

We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems.

訳：我々は，画像から画像への変換の問題に対する一般的な解決策として，条件付き敵対ネットワークを調査する。

We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.

訳：このアプローチが，ラベルマップから写真を合成したり，エッジマップからオブジェクトを再構築したり，画像を色付けしたりするのに効果的であることを示している。

1. Introduction

f:id:ryosuke_okubo:20190805180405p:plain — Figure 1

KeyWord

CNN（Convolutional Neural Network）
GAN（Generative Adversarial Networks）[24]
cGAN（conditional GAN）[41]

Sentence

Our goal in this paper is to develop a common framework for all these problems.

訳：本論文の目的は，これらすべての問題に対する共通のフレームワークを開発することである。

If we take a naive approach and ask the CNN to minimize the Euclidean distance between predicted and ground truth pixels, it will tend to produce blurry results [43, 62].

This is because Euclidean distance is minimized by averaging all plausible outputs, which causes blurring.

訳：素朴なアプローチで予測された値と真の値との間のユークリッド距離を最小にするようCNNに計算させると、ぼやけた結果が出る傾向にある[43、62]。

これは，すべてのもっともらしい出力を平均することによってユークリッド距離が最小化されて，それがぼやけを引き起こすためである。

It would be highly desirable if we could instead specify only a high-level goal, like “make the output indistinguishable from reality”, and then automatically learn a loss function appropriate for satisfying this goal.

Fortunately, this is exactly what is done by the recently proposed Generative Adversarial Networks (GANs) [24, 13, 44, 52, 63].

訳：代わりに「出力を現実と区別できないようにする」といった上位レベルの目標のみを指定して，その目標を満たすのに適した損失関数を自動的に学習できるのであれば非常に望ましい。

幸いなことに、これはまさに最近提案されたGenerative Adversarial Networks（GAN）によって行われることである。

Our primary contribution is to demonstrate that on a wide variety of problems, conditional GANs produce reasonable results.

Our second contribution is to present a simple framework sufficient to achieve good results, and to analyze the effects of several important architectural choices.

訳：私たちの主な貢献は，さまざまな問題で条件付きGANが妥当な結果を生み出すことを実証することである。

第二の貢献は、良い結果を得るのに十分な単純な枠組みを提示し、いくつかの重要な建築上の選択の影響を分析することである。

省略。

3. Method

KeyWord

discriminator
generator
L1
L2
dropout
DCGAN[44]
convolution-BatchNorm-ReLu[29]
encoder-decoder network（例：Context Encoders[43]）
U-Net[50]
PatchGAN
minibatch SGD
Adam solver

Sentence

The discriminator, D, learns to classify between fake (synthesized by the generator) and real {edge, photo} tuples.

The generator, G, learns to fool the discriminator.

Unlike an unconditional GAN, both the generator and discriminator observe the input edge map.（Figure 2）

訳：discriminator（D）は，（generatorによって合成された）偽物と実際の{edge、photo}タプルを区別することを学習する。

generator（G）はdiscriminatorをだますことを学習する。

f:id:ryosuke_okubo:20190805184546p:plain — Figure 2

3.1. Objective

Previous approaches have found it beneficial to mix the GAN objective with a more traditional loss, such as L2 distance [43].

訳：従来のアプローチでは，GANをL2距離などの伝統的な損失と組み合わせることが有益であることがわかった[43]。

3.2. Network architectures

We adapt our generator and discriminator architectures from those in [44]（※DCGAN）.

Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu [29].

訳：[44]のものから私たちの発生器と弁別器アーキテクチャを適合させる。

generatorとdiscriminatorはどちらもconvolution-BatchNorm-ReLu [29]の形式のモジュールを使用する。

3.2.1 Generator with skips

Many previous solutions [43, 55, 30, 64, 59] to problems in this area have used an encoder-decoder network [26].

~

Such a network requires that all information flow pass through all the layers, including the bottleneck.

訳：この分野の問題に対する多くの以前の解決法［４３、５５、３０、６４、５９］はエンコーダ - デコーダネットワーク［２６］を使用してきた。

〜

このようなネットワークでは，ボトルネックを含め，すべての情報がすべてのレイヤを通過する必要がある。

To give the generator a means to circumvent the bottle neck for information like this, we add skip connections, following the general shape of a “U-Net” [50].

訳：このような情報を得るためにボトルネックを回避する手段をジェネレータに与えるために、「U-Net」の一般的な形に従ってスキップ接続を追加する[50]。

f:id:ryosuke_okubo:20190805212446p:plain — Figure 3

3.2.2 Markovian discriminator (PatchGAN)

Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches.

訳：したがって，パッチの規模で構造にペナルティを課すだけの識別アーキテクチャ，これをPatchGANと呼ぶ，を設計する。

3.3. Optimization and inference

We use minibatch SGD and apply the Adam solver [32], with a learning rate of 0.0002, and momentum parameters β1 = 0.5, β2 = 0.999.

訳：学習率0.0002、運動量パラメータβ1= 0.5、β2= 0.999でミニバッチSGDを使用し、Adamソルバー[32]を適用する。

4. Experiments

KeyWord

Amazon Mechanical Turk（AMT）
FCN-8s

Sentence

To explore the generality of conditional GANs, we test the method on a variety of tasks and datasets, including both graphics tasks, like photo generation, and vision tasks, like semantic segmentation:

訳：条件付きGANの一般性を探るために，我々は，写真生成のようなグラフィックスタスクと，意味的セグメンテーションのようなビジョンタスクの両方を含む，さまざまなタスクとデータセットでメソッドをテストする。

4.1. Evaluation metrics

Evaluating the quality of synthesized images is an open and difficult problem [52].

~

First, we run “real vs. fake” perceptual studies on Amazon Mechanical Turk (AMT).

~

Second, we measure whether or not our synthesized cityscapes are realistic enough that off-the-shelf recognition system can recognize the objects in them.

訳：合成画像の品質を評価することは、オープンで困難な問題である[52]。

〜

まず、Amazon Mechanical Turk（AMT）で「本物対偽物」の知覚的研究を行う。

〜

次に、合成された都市景観が既製の認識システムでその中のオブジェクトを認識できるほど現実的であるかどうかを測定する。

To this end, we adopt the popular FCN-8s [39] architecture for semantic segmentation, and train it on the cityscapes dataset.

訳：この目的のために，我々は意味論的セグメンテーションのために人気のあるFCN-8 [39]アーキテクチャを採用し，それを都市景観データセット上で訓練する。

4.2. Analysis of the objective function

L1 alone leads to reasonable but blurry results.

The cGAN alone (setting λ = 0 in Eqn. 4) gives much sharper results but introduces visual artifacts on certain applications.

Adding both terms together (with λ = 100) reduces these artifacts.

訳：L1だけでは、合理的だがぼやけた結果になる。

cGAN単独（式4でλ= 0に設定）では，はるかにシャープな結果が得られるが，特定のアプリケーションでは視覚的な影響が生じる。

両方の項を（λ= 100で）一緒に追加すると，これらのアーティファクトが減少する。

f:id:ryosuke_okubo:20190805214357p:plain — Figure 4

Clearly, it is important, in this case, that the loss measure the quality of the match between input and output, and indeed cGAN performs much better than GAN.

訳：明らかに，この場合，損失が入力と出力の間の一致の質を測ること，そして確かにcGANがGANよりはるかに良く機能することが重要である。

f:id:ryosuke_okubo:20190805214710p:plain — Table 1

A striking effect of conditional GANs is that they produce sharp images, hallucinating spatial structure even where it does not exist in the input label map.

訳：条件付きGANの顕著な効果は，入力ラベルマップに存在しない場合でも，シャープな画像が生成され，空間構造が幻覚的になることである。

f:id:ryosuke_okubo:20190805214959p:plain — Figure 7

4.3. Analysis of the generator architecture

The advantages of the U-Net appear not to be specific to conditional GANs: when both U-Net and encoder-decoder are trained with an L1 loss, the U-Net again achieves the superior results.

訳：U-Netの利点は条件付きGANに固有のものではない：U-Netとエンコーダ/デコーダの両方がL1損失で学習された場合，U-Netは再び優れた結果を達成する。

f:id:ryosuke_okubo:20190806203336p:plain

4.4. From PixelGANs to PatchGANs to ImageGANs

The PixelGAN has no effect on spetial sharpness but does increase the colorfulness of the results (quantified in Figure 7).

訳：PixelGANは空間の鮮明さには影響しないが,結果の色鮮やかさは増す（図7で定量化）。

f:id:ryosuke_okubo:20190806203624p:plain

f:id:ryosuke_okubo:20190806203726p:plain

4.5. Perceptual validation

f:id:ryosuke_okubo:20190806203858p:plain

4.6. Semantic segmentation

To our knowledge, this is the first demonstration of GANs successfully generating “labels”, which are nearly discrete, rather than “images”, with their continuous valued variation.

訳：我々の知る限りでは，これはGANが「イメージ」ではなくほぼ離散的な「ラベル」を連続的に評価されたバリエーションで生成することに成功した最初の実装である。

f:id:ryosuke_okubo:20190806204105p:plain

4.7. Community-driven Research

省略。

5. Conclusion

省略。

雑感

さまざまな道具が単語として現れてくるので，それらが何に使うものなのかは知っておいたほうが読みやすい。

次回↓

作成中

十の並列した脳

何でも勉強する，毎週月木曜に投稿予定

DeepLearning論文をまとめる #1 Image-to-Image Translation with Conditional Adversarial Networks（Pix2Pix））

Abstract

1. Introduction

3. Method

3.1. Objective

3.2. Network architectures

3.2.1 Generator with skips

3.2.2 Markovian discriminator (PatchGAN)

3.3. Optimization and inference

4. Experiments

4.1. Evaluation metrics

4.2. Analysis of the objective function

4.3. Analysis of the generator architecture

4.4. From PixelGANs to PatchGANs to ImageGANs

4.5. Perceptual validation

4.6. Semantic segmentation

4.7. Community-driven Research

5. Conclusion

雑感

Abstract

1. Introduction

2. Related work

3. Method

3.1. Objective

3.2. Network architectures

3.2.1 Generator with skips

3.2.2 Markovian discriminator (PatchGAN)

3.3. Optimization and inference

4. Experiments

4.1. Evaluation metrics

4.2. Analysis of the objective function

4.3. Analysis of the generator architecture

4.4. From PixelGANs to PatchGANs to ImageGANs

4.5. Perceptual validation

4.6. Semantic segmentation

4.7. Community-driven Research

5. Conclusion

雑感