論文Abstract100本ノック#17 - 十の並列した脳

前回↓

ryosuke-okubo.hatenablog.com

81 Santa（2015）

f:id:ryosuke_okubo:20191029205813p:plain

原文：

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Abstract：

Stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods are Bayesian analogs to popular stochastic optimization methods;

however, this connection is not well studied.

訳：

確率的勾配マルコフ連鎖モンテカルロ（SG-MCMC）メソッドは一般的な確率的最適化メソッドのベイジアン類似である；

ただし，この接続は十分に研究されていない。

We explore this relationship by applying simulated annealing to an SGMCMC algorithm.

語彙：

simulated annealing

訳：

我々は焼きなまし法をSGMCMCアルゴリズムに適用してこの関係を調査する。

Furthermore, we extend recent SG-MCMC methods with two key components:

i) adaptive preconditioners (as in ADAgrad or RMSprop),

and ii) adaptive element-wise momentum weights.

訳：

さらに，最近のSG-MCMCメソッドを2つの主要コンポーネントで拡張する：

i）適応型前提条件（ADAgradまたはRMSpropなど）

ii）適応的な要素ごとの運動量の重み

The zero-temperature limit gives a novel stochastic optimization method with adaptive element-wise momentum weights, while conventional optimization methods only have a shared, static momentum weight.

訳：

ゼロ温度の制限は適応的な要素ごとの運動量の重みを持つ新しい確率的最適化手法を提供するが，従来の最適化手法は共有された静的な運動量の重みのみを持つ。

Under certain assumptions, our theoretical analysis suggests the proposed simulated annealing approach converges close to the global optima.

語彙：

assumptions

訳：

特定の仮定の下で，我々の理論的分析は提案された焼きなまし法によるアプローチがグローバルな最適値の近くに収束することを示唆する。

Experiments on several deep neural network models show state-of-the-art results compared to related stochastic optimization algorithms.

訳：

いくつかのDNNモデルの実験では，関連する確率的最適化アルゴリズムと比較した最新の結果が示されている。

82 GD by GD（2016）

原文：

Learning to learn by gradient descent by gradient descent

Abstract：

The move from hand-designed features to learned features in machine learning has been wildly successful.

訳：

機械学習において手作業で設計された特徴から学習された特徴への移行は，大成功を収めている。

In spite of this, optimization algorithms are still designed by hand.

語彙：

In spite of this

訳：

それにもかかわらず，最適化アルゴリズムは依然として手作業で設計されている。

In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way.

訳：

本論文では，最適化アルゴリズムの設計を学習問題としてどのようにキャストできるかを示し，アルゴリズムが関心のある問題の構造を自動的に活用することを学習できるようにする。

Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure.

訳：

LSTMによって実装された当社の学習アルゴリズムは，学習されたタスクにおいて一般的な手作業の競合他社よりも優れており，同様の構造を持つ新しいタスクに一般化されている。

We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.

訳：

我々はこれを単純な凸問題，ニューラルネットワークの学習，ニューラルアートによる画像のスタイリングなど，いくつかのタスクで実証する。

83 AdaSecant（2017）

f:id:ryosuke_okubo:20191029205840p:plain

原文：

A Robust Adaptive Stochastic Gradient Method for Deep Learning

Abstract：

Stochastic gradient algorithms are the main focus of large-scale optimization problems and led to important successes in the recent advancement of the deep learning algorithms.

訳：

確率的勾配アルゴリズムは大規模な最適化問題の主な焦点であり，ディープラーニングアルゴリズムの最近の進歩において重要な成功をもたらした。

The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients.

訳：

SGDの収束は学習率の慎重な選択と勾配の確率的推定におけるノイズの量に依存する。

In this paper, we propose an adaptive learning rate algorithm, which utilizes stochastic curvature information of the loss function for automatically tuning the learning rates.

訳：

本論文では，学習率を自動的に調整するために損失関数の確率的曲率情報を利用する，適応学習率アルゴリズムを提案する。

The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients.

訳：

損失関数の要素ごとの曲率に関する情報は確率的一次勾配の局所統計から推定される。

We further propose a new variance reduction technique to speed up the convergence.

訳：

我々はさらに収束を高速化する新しいバリアンス削減手法を提案する。

In our experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.

訳：

DNNを使用した実験では，一般的な確率的勾配アルゴリズムと比較して優れたパフォーマンスが得られた。

84 AMSGrad（2019）

f:id:ryosuke_okubo:20191029205904p:plain

原文：

On the Convergence of Adam and Beyond

Abstract：

Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients.

訳：

RMSProp，Adam，Adadelta，Nadamなどのディープネットワークの学習に使用されている最近提案されたいくつかの確率的最適化手法は，過去の2乗勾配の指数移動平均の平方根でスケーリングされた勾配更新の使用に基づいている。

In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings).

訳：

多くのアプリケーションで，例えば大きな出力スペースで学習すると，これらのアルゴリズムが最適解（または非凸の設定の臨界点）に収束しないことが経験的に観察されている。

We show that one cause for such failures is the exponential moving average used in the algorithms.

訳：

このような失敗の原因の1つは，アルゴリズムで使用される指数移動平均であることを示す。

We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm.

訳：

我々はAdamが最適なソリューションに収束しない単純な凸最適化設定の明示的な例を提供し，Adamアルゴリズムの以前の分析に関する正確な問題を説明する。

Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.

訳：

我々の分析はこのようなアルゴリズムに過去の勾配の「長期記憶」を与えることで収束の問題を修正できることを示唆し，収束の問題を修正するだけでなくしばしば経験的パフォーマンスの改善にもつながるAdamアルゴリズムの新しいバリアントを提案する。

85 AdaBound＆AMSBound（2019）

f:id:ryosuke_okubo:20191029205930p:plain

原文：

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Abstract：

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates.

訳：

AdaGrad，RMSprop，Adamなどの適応型最適化手法が，学習率に関する要素ごとのスケーリング用語を使用して迅速な学習プロセスを実現するために提案されてきてた。

Though prevailing, they are observed to generalize poorly compared with SGD or even fail to converge due to unstable and extreme learning rates.

語彙：

unstable

訳：

普及しているものの，SGDと比較して一般化が不十分であるか，不安定で極端な学習率のために収束しないことさえある。

Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods.

語彙：

tackle

訳：

最近の研究でこの問題に取り組むためにAMSGradなどのいくつかのアルゴリズムが提案されたが，既存の方法を大幅に改善することはできなかった。

In our paper, we demonstrate that extreme learning rates can lead to poor performance.

訳：

我々の論文では，極端な学習率がパフォーマンスの低下につながる可能性があることを示す。

We provide new variants of Adam and AMSGrad, called AdaBound and AMSBound respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD and give a theoretical proof of convergence.

訳：

我々はAdaBoundおよびAMSBoundと呼ばれる，それぞれAdamおよびAMSGradの新しいバリアントを提供する，これらは学習率に動的境界を採用して適応法からSGDへの段階的かつスムーズな移行を実現し，収束の理論的証明を提供する。

We further conduct experiments on various popular tasks and models, which is often insufficient in previous work.

語彙：

insufficient

訳：

さまざまな一般的なタスクとモデルの実験をさらに行うが，これは以前の作業では不十分な場合がある。

Experimental results show that new variants can eliminate the generalization gap between adaptive methods and SGD and maintain higher learning speed early in training at the same time.

訳：

実験結果は，新しいバリアントが適応法とSGD間の一般化のギャップを解消し，同時にトレーニングの早い段階でより高い学習速度を維持できることを示す。