論文Abstract100本ノック#18 - 十の並列した脳

前回↓

ryosuke-okubo.hatenablog.com

86~100は強化学習について扱う。

参考：

https://qiita.com/shionhonda/items/ec05aade07b5bea78081

86 Deep Q-Network（2013）

f:id:ryosuke_okubo:20191101175525p:plain

原文：

Playing Atari with Deep Reinforcement Learning

Abstract：

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.

語彙：

reinforcement learning

訳：

我々は強化学習を使用して高次元の感覚入力から直接コントロールポリシーを正常に学習する最初のディープラーニングモデルを紹介する。

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

訳：

このモデルはCNNであり，Q学習のバリアントで学習され，その入力は生のピクセルであり出力は将来の報酬を推定する値関数である。

We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm.

訳：

我々はArcade Learning Environmentの7つのAtari 2600ゲームにこの方法を適用し，アーキテクチャや学習アルゴリズムを調整しない。

We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

訳：

6つのゲームで以前のすべてのアプローチよりも優れており，3つのゲームで人間の専門家を上回っている。

87 Double Deep Q Network（2015）

原文：

Deep Reinforcement Learning with Double Q-learning

Abstract：

The popular Q-learning algorithm is known to overestimate action values under certain conditions.

語彙：

overestimate

訳：

一般的なQ学習アルゴリズムは特定の条件下でアクション値を過大評価することが知られている。

It was not previously known whether, in practice, such overestimations are common, whether they harm performance, and whether they can generally be prevented.

訳：

実際には，このような過大評価が一般的であるかどうか，パフォーマンスに悪影響を与えるかどうか，一般的に防止できるかどうかは以前は知られていなかった。

In this paper, we answer all these questions affirmatively.

語彙：

affirmatively

訳：

本論文では，これらすべての質問に肯定的に答える。

In particular, we first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain.

語彙：

substantial

訳：

特に，Q学習とDNNを組み合わせた最近のDQN アルゴリズムは，Atari 2600ドメインの一部のゲームでかなり過大評価されていることを最初に示す。

We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation.

訳：

次に，表形式設定で導入されたDouble Q-learning algorithmの背後にある考え方が，大規模な関数近似で機能するように一般化できることを示す。

We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.

語彙：

hypothesized

訳：

我々はDQN アルゴリズムへの特定の適応を提案し，結果として得られるアルゴリズムが，仮説として観測された過大評価を減らすだけでなく，これがいくつかのゲームではるかに優れたパフォーマンスにつながることを示す。

88 Rainbow（2017）

f:id:ryosuke_okubo:20191101175547p:plain

原文：

Rainbow: Combining Improvements in Deep Reinforcement Learning

Abstract：

The deep reinforcement learning community has made several independent improvements to the DQN algorithm.

訳：

深層強化学習コミュニティはDQN アルゴリズムにいくつかの独立した改善を加えた。

However, it is unclear which of these extensions are complementary and can be fruitfully combined.

語彙：

fruitfully

訳：

ただし，これらの拡張機能のどれが補完的なものであるかは不明であり，効果的に組み合わせることができる。

This paper examines six extensions to the DQN algorithm and empirically studies their combination.

訳：

本論文ではDQN アルゴリズムの6つの拡張機能を調査して，それらの組み合わせを経験的に研究する。

Our experiments show that the combination provides state-of-the-art performance on the Atari 2600 benchmark, both in terms of data efficiency and final performance.

訳：

我々の実験はデータの効率性と最終的なパフォーマンスの両方の点で，この組み合わせがAtari 2600ベンチマークで最先端のパフォーマンスを提供することを示す。

We also provide results from a detailed ablation study that shows the contribution of each component to overall performance.

語彙：

overall

訳：

また，全体的なパフォーマンスに対する各コンポーネントの寄与を示す詳細なablation studyの結果も提供する。

89 Dueling Network（2015）

f:id:ryosuke_okubo:20191101175607p:plain

原文：

Dueling Network Architectures for Deep Reinforcement Learning

Abstract：

In recent years there have been many successes of using deep representations in reinforcement learning.

訳：

近年，強化学習で深い表現を使用することで多くの成功があった。

Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders.

語彙：

conventional

訳：

しかしながら，これらのアプリケーションの多くは畳み込みネットワーク，LSTM，またはオートエンコーダなどの従来のアーキテクチャを使用している。

In this paper, we present a new neural network architecture for model-free reinforcement learning.

訳：

本論文では，モデルなしの強化学習のための新しいニューラルネットワークアーキテクチャを紹介する。

Our dueling network represents two separate estimators:

one for the state value function and one for the state-dependent action advantage function.

訳：

dueling networkは2つの独立した見積もりを表す：

1つは状態関数で，もう1つは状態依存の行動アドバンテージ関数である。

The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm.

語彙：

underlying

訳：

このファクタリングの主な利点は，基礎となる強化学習アルゴリズムに変更を加えることなく，行動全体で学習を一般化することである。

Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions.

語彙：

in the presence of

訳：

我々の結果はこのアーキテクチャが多くの同様の価値のある行動の存在下でより良いポリシー評価につながることを示す。

Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.

訳：

さらに，duelingアーキテクチャによりRL agentはAtari 2600ドメインで最先端のパフォーマンスを発揮できる。

90 Gorila（2015）

f:id:ryosuke_okubo:20191101175629p:plain

原文：

Massively Parallel Methods for Deep Reinforcement Learning

Abstract：

We present the first massively distributed architecture for deep reinforcement learning.

語彙：

massively

訳：

我々は深層強化学習のための最初の大規模分散アーキテクチャを紹介する。

This architecture uses four main components:

parallel actors that generate new behaviour;

parallel learners that are trained from stored experience;

a distributed neural network to represent the value function or behaviour policy;

and a distributed store of experience.

語彙：

behaviour

訳：

このアーキテクチャは4つの主要コンポーネントを使用する：

新しい振る舞いを生成する並列アクター；

蓄積された経験から学習された並行学習者；

価値関数またはbehaviour policyを表す分散ニューラルネットワーク；

経験の分散蓄積。

We used our architecture to implement the Deep Q-Network algorithm (DQN).

訳：

我々はこのアーキテクチャを使用してDeep Q-Networkアルゴリズム（DQN）を実装した。

Our distributed algorithm was applied to 49 games from Atari 2600 games from the Arcade Learning Environment, using identical hyperparameters.

語彙：

identical

訳：

我々の分散アルゴリズムは，同一のハイパーパラメーターを使用して，Arcade Learning EnvironmentのAtari 2600ゲームの49ゲームに適用された。

Our performance surpassed non-distributed DQN in 41 of the 49 games and also reduced the wall-time required to achieve these results by an order of magnitude on most games.

語彙：

wall-time

order of magnitude

訳：

我々のパフォーマンスは49のゲームのうち41で非分散DQNを上回り，またほとんどのゲームでこれらの結果を達成するために必要なwall-timeを1桁削減した。

次回↓

ryosuke-okubo.hatenablog.com