論文Abstract100本ノック#20 - 十の並列した脳

前回↓

ryosuke-okubo.hatenablog.com

96 PPO（2017）

f:id:ryosuke_okubo:20191114211139p:plain

原文：

Abstract：

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.

訳：

我々は環境との相互作用によるデータのサンプリングと，確率的勾配上昇を使用した「surrogate」目的関数の最適化を交互に行う，強化学習のためのpolicy gradient methodsの新しいファミリーを提案する。

Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates.

訳：

標準のpolicy gradient methodsはデータサンプルごとに1つの勾配更新を実行するが，我々はミニバッチ更新の複数のエポックを可能にする新しい目的関数を提案する。

The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).

訳：

proximal policy optimization（PPO）と呼ばれる新しい方法には，TRPOの利点があるが，実装がはるかに簡単で，より一般的で，サンプルの複雑さ（経験的に）が優れている。

Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.

訳：

我々の実験ではシミュレートされたロボットの移動やAtariゲームのプレイなど，ベンチマークタスクのコレクションでPPOをテストし，PPOが他のオンラインpolicy gradient methodsよりも優れており，全体的なサンプルの複雑さ，シンプルさ、および壁時間のバランスが有利であることを示す。

97 ACER（2016）

f:id:ryosuke_okubo:20191114211201p:plain

原文：

Sample Efficient Actor-Critic with Experience Replay

Abstract：

This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems.

訳：

本論文では安定性とサンプル効率が高く，離散57ゲームのAtari ドメインやいくつかの連続制御問題を含む，困難な環境で非常に優れたパフォーマンスを発揮する、actor-criticの深層強化学習エージェントを紹介する。

To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.

語彙：

訳：

これを達成するために，本論文ではバイアス補正による切り捨てられた重要度サンプリング，確率論的なデュエリングネットワークアーキテクチャ，および新しいTRPOの方法を含むいくつかの革新を紹介する。

98 UNREAL（2016）

f:id:ryosuke_okubo:20191114211225p:plain

原文：

Reinforcement Learning with Unsupervised Auxiliary Tasks

Abstract：

Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward.

訳：

深層強化学習エージェントは，累積報酬を直接最大化することにより最先端の結果を達成した。

However, environments contain a much wider variety of possible training signals.

訳：

ただし，環境にははるかに多様な可能な学習信号が含まれている。

In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning.

訳：

本論文では，強化学習によって同時に他の多くの擬似報酬機能も最大化するエージェントを紹介する。

All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards.

語彙：

in the absence of

訳：

これらのタスクはすべて，教師なし学習のように，外部からの報酬がなくても発達し続ける共通の表現を共有している。

We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task.

訳：

我々はまた，学習が実際のタスクの最も関連性の高い側面に迅速に適応できるように，この表現を外的報酬に集中させるための新しいメカニズムも導入する。

Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880% expert human performance, and a challenging suite of first-person, three-dimensional Labyrinth tasks leading to a mean speedup in learning of 10× and averaging 87% expert human performance on Labyrinth.

訳：

我々のエージェントは，Atariの以前の最先端技術を大幅に上回り，平均して880％の熟練した人間のパフォーマンス，およびchallenging suite of first-person，three-dimensional Labyrinthタスクで平均で10倍の学習の加速，平均して87％の専門的な人間のパフォーマンスをLabyrinthで達成した。

99 NAC（2008）

原文：

Natural Actor-Critic

Abstract：

This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic.

訳：

本論文では新しいモデルを使用しない強化学習アーキテクチャであるNatural Actor-Criticについて説明する。

The actor updates are based on stochastic policy gradients employing Amari’s natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression.

訳：

actorの更新はAmariの自然勾配アプローチを採用した確率的方策勾配に基づいているが，criticは線形回帰によって自然方策勾配と値関数の追加パラメーターの両方を同時に取得する。

We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients.

訳：

自然な方策勾配によるactorの改善は，選択された方策表現の座標フレームに依存せず，通常の方策の勾配よりも効率的に推定できるため，特に魅力的であることを示す。

The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation.

訳：

criticは方策勾配互換関数近似によって動機付けられた特別な基底関数パラメーター化を利用する。

We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke’s Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms.

訳：

オリジナルのActor-CriticやBradtkeの線形2次Q学習などいくつかのよく知られている強化学習方法が，実際にはNatural Actor-Criticアルゴリズムであることを示す。

Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.

訳：

経験的評価は以前の方法と比較した本手法の有効性を示し，また擬人化ロボットアームの制御の学習への適用性を示している。

100 AlphaStar（2019）

原文：

AlphaStar: An Evolutionary Computation Perspective

Abstract：

In January 2019, DeepMind revealed AlphaStar to the world-the first artificial intelligence (AI) system to beat a professional player at the game of StarCraft II-representing a milestone in the progress of AI.

訳：

2019年1月，DeepMindはAlphaStarをStarCraft IIのゲームでプロのプレーヤーを破った人工知能（AI）システムとして世界に初めて公開した，これはAIの進歩のマイルストーンである。

AlphaStar draws on many areas of AI research, including deep learning, reinforcement learning, game theory, and evolutionary computation (EC).

訳：

AlphaStarは，ディープラーニング，強化学習，ゲーム理論，進化計算（EC）などAI研究の多くの分野を活用している。

In this paper we analyze AlphaStar primarily through the lens of EC, presenting a new look at the system and relating it to many concepts in the field.

訳：

本論文では，主にECのレンズを通してAlphaStarを分析し，システムの新しい外観を提示し，フィールドの多くの概念に関連付ける。