論文Abstract100本ノック#19 - 十の並列した脳

前回↓

ryosuke-okubo.hatenablog.com

91 Ape-X（2018）

f:id:ryosuke_okubo:20191105210452p:plain

原文：

Distributed Prioritized Experience Replay

Abstract：

We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible.

語彙：

at scale

訳：

我々は大規模な深層強化学習のための分散アーキテクチャを提案する，これによりエージェントは以前よりもはるかに多くのデータを効果的に学習できる。

The algorithm decouples acting from learning:

the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory;

the learner replays samples of experience and updates the neural network.

訳：

このアルゴリズムは、actingをlearningから切り離す：

actorsは共有されたニューラルネットワークに従ってアクションを選択することで環境独自のインスタンスと対話し，共有のexperience replayメモリに結果のexperienceを蓄積する；

learnerはexperienceのサンプルを再生し，ニューラルネットワークを更新する。

The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors.

訳：

アーキテクチャは優先順位付けされたexperience replayに依存して，actorsによって生成された最も重要なデータのみに焦点を合わせる。

Our architecture substantially improves the state of the art on the Arcade Learning Environment, achieving better final performance in a fraction of the wall-clock training time.

語彙：

a fraction of

wall-clock

訳：

当社のアーキテクチャはアーケード学習環境の最新技術を大幅に改善し，わずかな学習時間で優れた最終パフォーマンスを実現する。

92 R2D2（2019）

f:id:ryosuke_okubo:20191105210522p:plain

原文：

RECURRENT EXPERIENCE REPLAY IN DISTRIBUTED REINFORCEMENT LEARNING

Abstract：

Building on the recent successes of distributed training of RL agents, in this paper we investigate the training of RNN-based RL agents from distributed prioritized experience replay.

訳：

RL agentsの分散学習の最近の成果に基づいて，本論文で我々は，分散優先順位付けされたexperience replayからRNNベースのRL agentsの学習を調査する。

We study the effects of parameter lag resulting in representational drift and recurrent state staleness and empirically derive an improved training strategy.

語彙：

representational

staleness

derive

訳：

代表的なドリフトと再発状態の陳腐化をもたらすパラメーターラグの影響を研究し，経験的に改善された学習戦略を導出する。

Using a single network architecture and fixed set of hyperparameters, the resulting agent, Recurrent Replay Distributed DQN, quadruples the previous state of the art on Atari-57, and matches the state of the art on DMLab-30.

語彙：

quadruples

訳：

単一のネットワークアーキテクチャと固定されたハイパーパラメーターセットを使用して，結果のエージェントであるRecurrent Replay Distributed DQNは，Atari-57の最新技術を4倍にし，DMLab-30の最新技術と一致する。

It is the first agent to exceed human-level performance in 52 of the 57 Atari games.

訳：

それは57のAtari gamesのうち52で人間レベルのパフォーマンスを超えた最初のエージェントである。

93 A3C（2016）

f:id:ryosuke_okubo:20191105210544p:plain

原文：

Asynchronous Methods for Deep Reinforcement Learning

Abstract：

We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers.

語彙：

asynchronous

訳：

我々はDNNコントローラーの最適化に非同期勾配降下を使用する，深層強化学習のための概念的にシンプルで軽量なフレームワークを提案する。

We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers.

語彙：

stabilizing effect

訳：

我々は4つの標準強化学習アルゴリズムの非同期バリアントを提示し，並列のactor-learnersが学習に安定化効果をもたらし，4つの方法すべてがニューラルネットワークコントローラーを正常に学習できることを示す。

The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU.

訳：

actor-criticの非同期バリアントである最高のパフォーマンスを発揮する方法は，GPUではなく単一のマルチコアCPUで半分の時間で学習しながらも，Atari ドメインの現在の最先端技術を上回る。

Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

訳：

さらに，非同期のactor-criticは，視覚入力を使用してランダムな3D迷路をナビゲートする新しいタスクだけでなく，さまざまな連続的なモーター制御の問題にも成功することを示す。

94 DDPG（2015）

f:id:ryosuke_okubo:20191105210617p:plain

原文：

Continuous control with deep reinforcement learning

Abstract：

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain.

訳：

我々はDeep Q-Learningの成功の根底にあるアイデアを継続的な行動ドメインに適合させる。

We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces.

訳：

連続的な行動空間で動作できる決定論的なpolicy gradientに基づいたモデルフリーのアルゴリズムであるactor-criticを提示する。

Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving.

語彙：

dexterous

manipulation

訳：

同じ学習アルゴリズム，ネットワークアーキテクチャ，ハイパーパラメーターを使用して，当社のアルゴリズムはcartpole swing-up，器用な操作，脚の移動，車の運転などの古典的な問題を含む，20を超える物理シミュレーションタスクをロバストに解決する。

Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives.

訳：

我々のアルゴリズムは，ドメインとその派生物のダイナミクスに完全にアクセスできるプランニングアルゴリズムによって発見されたものとパフォーマンスが競合するpoliciesを見つけることができる。

We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end:

directly from raw pixel inputs.

訳：

さらに多くのタスクについてアルゴリズムがend-to-endでpoliciesを学習できることを示す：

生のピクセル入力から直接。

95 TRPO（2015）

f:id:ryosuke_okubo:20191105210645p:plain

原文：

Trust Region Policy Optimization

Abstract：

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement.

訳：

我々は単調な改善を保証しながらpoliciesを最適化するための反復手順を説明する。

By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO).

訳：

理論的に正当化された手順にいくつかの近似を行うことにより，Trust Region Policy Optimization（TRPO）と呼ばれる実用的なアルゴリズムを開発する。

This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks.

訳：

このアルゴリズムは自然なpolicy gradient methodsに似ており，ニューラルネットワークなどの大規模な非線形policiesの最適化に効果的である。

Our experiments demonstrate its robust performance on a wide variety of tasks:

learning simulated robotic swimming, hopping, and walking gaits;

and playing Atari games using images of the screen as input.

訳：

我々の実験はさまざまなタスクでそのロバストなパフォーマンスを実証している：

シミュレートされたロボットの水泳、ホッピング、およびwalking gaitsの学習；

また，画面の画像を入力として使用してAtariゲームをプレイする。