論文Abstract100本ノック番外　第55回コンピュータビジョン勉強会＠関東にて紹介された論文

第55回コンピュータビジョン勉強会＠関東にて紹介された論文について，いつも通りのノリで進めていく。

1 Probabilistic Face Embeddings

f:id:ryosuke_okubo:20191019065229p:plain

原文：

Probabilistic Face Embeddings

Abstract：

Embedding methods have achieved success in face recognition by comparing facial features in a latent semantic space.

訳：

埋め込み方法は，潜在的な意味空間で顔の特徴を比較することによって顔認識で成功を収めている。

However, in a fully unconstrained face setting, the facial features learned by the embedding model could be ambiguous or may not even be present in the input face, leading to noisy representations.

語彙：

unconstrained

ambiguous

leading to

訳：

ただし，完全に制約のない顔の設定では，埋め込みモデルによって学習された顔の特徴があいまいになるか入力顔に存在しない場合があり，ノイズの多い表現につながる。

We propose Probabilistic Face Embeddings (PFEs), which represent each face image as a Gaussian distribution in the latent space.

訳：

我々はProbabilistic Face Embeddings（PFEs）を提案する，これは各顔画像を潜在空間のガウス分布として表す。

The mean of the distribution estimates the most likely feature values while the variance shows the uncertainty in the feature values.

訳：

分布の平均は最も可能性の高い特徴値を推定し，分散は特徴値の不確実性を示す。

Probabilistic solutions can then be naturally derived for matching and fusing PFEs using the uncertainty information.

訳：

不確実性情報を使用して，PFEをマッチングおよび融合するためのProbabilistic solutionsを自然に導出できる。

Empirical evaluation on different baseline models, training datasets and benchmarks show that the proposed method can improve the face recognition performance of deterministic embeddings by converting them into PFEs.

訳：

さまざまなベースラインモデル，トレーニングデータセット，およびベンチマークに関する経験的評価により，提案された方法はPFEに変換することによって決定論的埋め込みの顔認識パフォーマンスを改善できることが示されている。

The uncertainties estimated by PFEs also serve as good indicators of the potential matching accuracy, which are important for a risk-controlled recognition system.

語彙：

indicators

訳：

PFEによって推定される不確実性は潜在的な照合精度の優れた指標としても機能する，これはリスク管理された認識システムにとって重要である。

2 Copy-and-Paste Networks for Deep Video Inpainting

f:id:ryosuke_okubo:20191019065251p:plain

原文：

Copy-and-Paste Networks for Deep Video Inpainting

Abstract：

We present a novel deep learning based algorithm for video inpainting.

語彙：

inpainting

訳：

ビデオ修復のための新しいディープラーニングベースのアルゴリズムを紹介する。

Video inpainting is a process of completing corrupted or missing regions in videos.

訳：

ビデオ修復はビデオ内の破損または欠落した領域を埋めるプロセスである。

Video inpainting has additional challenges compared to image inpainting due to the extra temporal information as well as the need for maintaining the temporal coherency.

訳：

ビデオ修復には余分な時間情報と時間的一貫性を維持する必要があるため，画像修復に比べて追加の課題がある。

We propose a novel DNN-based framework called the Copy-and-Paste Networks for video inpainting that takes advantage of additional information in other frames of the video.

訳：

我々はビデオの他のフレームの追加情報を利用するビデオ修復のためのCopy-and-Paste Networksと呼ばれる新しいDNNベースのフレームワークを提案する。

The network is trained to copy corresponding contents in reference frames and paste them to fill the holes in the target frame.

語彙：

corresponding

訳：

このネットワークは参照フレームの対応するコンテンツをコピーして貼り付け，ターゲットフレームの穴を埋めるように学習されている。

Our network also includes an alignment network that computes affine matrices between frames for the alignment, enabling the network to take information from more distant frames for robustness.

訳：

我々のネットワークにはアライメント用のフレーム間のアフィン行列を計算するアライメントネットワークも含まれており、ネットワークはより堅牢なフレームから情報を取得できる。

Our method produces visually pleasing and temporally coherent results while running faster than the state-of-the-art optimization-based method.

語彙：

pleasing

訳：

我々の方法は最先端の最適化ベースの方法よりも高速に実行しながら，視覚的に心地よく時間的に一貫した結果を生成する。

In addition, we extend our framework for enhancing over/under exposed frames in videos.

訳：

さらに，ビデオ内の露出のover/underを強化するためフレームワークを拡張する。

Using this enhancement technique, we were able to significantly improve the lane detection accuracy on road videos.

訳：

この強化技術を使用して，道路ビデオの車線検出精度を大幅に改善することができた。

3 Learning Meshes for Dense Visual SLAM

f:id:ryosuke_okubo:20191019065315p:plain

原文：

Learning Meshes for Dense Visual SLAM

Abstract：

Estimating motion and surrounding geometry of a moving camera remains a challenging inference problem.

語彙：

inference

訳：

移動中のカメラの動きと周囲の形状を推定することは依然として困難な推論問題である。

From an information theoretic point of view, estimates should get better as more information is included, such as is done in dense SLAM, but this is strongly dependent on the validity of the underlying models.

語彙：

estimates

underlying

訳：

情報理論の観点からは，dense SLAMで行われるように，より多くの情報が含まれるほど推定は良くなるはずだが，これは基礎となるモデルの妥当性に強く依存する。

In the present paper, we use triangular meshes as both compact and dense geometry representation.

語彙：

geometry

訳：

本論文では，コンパクトで高密度な幾何学表現として三角形メッシュを使用する。

To allow for simple and fast usage, we propose a view-based formulation for which we predict the in-plane vertex coordinates directly from images and then employ the remaining vertex depth components as free variables.

語彙：

coordinates

訳：

シンプルで高速な使用を可能にするために，我々は画像から直接面内の頂点座標を予測し，残りの頂点深度コンポーネントを自由変数として使用する，ビューベースの定式化を提案する。

Flexible and continuous integration of information is achieved through the use of a residual based inference technique.

語彙：

情報の柔軟で継続的な統合は残差ベースの推論手法を使用して実現される。

This so-called factor graph encodes all information as mapping from free variables to residuals, the squared sum of which is minimised during inference.

語彙：

so-called

factor graph

訳：

このいわゆる因子グラフはすべての情報を自由変数から残差へのマッピングとしてエンコードする，残差の2乗和は推論中に最小化される。

We propose the use of different types of learnable residuals, which are trained end-to-end to increase their suitability as information bearing models and to enable accurate and reliable estimation.

訳：

我々はさまざまなタイプの学習可能な残差の使用を提案する，これは情報を含むモデルとしての適合性を高め，正確で信頼できる推定を可能にするためにエンドツーエンドで学習される。

Detailed evaluation of all components is provided on both synthetic and real data which confirms the practicability of the presented approach.

語彙：

practicability

訳：

すべてのコンポーネントの詳細な評価は，提示されたアプローチの実行可能性を確認する合成データと実データの両方で提供される。

4 Learning Single Camera Depth Estimation using Dual-Pixels

f:id:ryosuke_okubo:20191019065338p:plain

原文：

Learning Single Camera Depth Estimation using Dual-Pixels

Abstract：

Deep learning techniques have enabled rapid progress in monocular depth estimation, but their quality is limited by the ill-posed nature of the problem and the scarcity of high quality datasets.

語彙：

monocular

scarcity

訳：

深層学習技術により単眼深度推定の急速な進歩が可能になったが，その品質は問題の不適切な性質と高品質のデータセットの不足により制限されている。

We estimate depth from a single camera by leveraging the dual-pixel auto-focus hardware that is increasingly common on modern camera sensors.

訳：

我々は最新のカメラセンサーでますます一般的になっているdual-pixel auto-focusハードウェアを活用して，1台のカメラから深度を推定する。

Classic stereo algorithms and prior learning-based depth estimation techniques underperform when applied on this dual-pixel data, the former due to too-strong assumptions about RGB image matching, and the latter due to not leveraging the understanding of optics of dual-pixel image formation.

語彙：

underperform

assumptions

訳：

従来のステレオアルゴリズムと従来の学習ベースの深度推定手法は，このデュアルピクセルデータに適用するとパフォーマンスが低下する，前者はRGB画像マッチングに関する非常に強い仮定によるものであり，後者はデュアルピクセル画像形成の光学の理解を活用しないためである。

To allow learning based methods to work well on dual-pixel imagery, we identify an inherent ambiguity in the depth estimated from dual-pixel cues, and develop an approach to estimate depth up to this ambiguity.

訳：

学習ベースの方法がデュアルピクセル画像でうまく機能するように，我々はデュアルピクセルの手がかりから推定される深さの固有のあいまいさを特定し，このあいまいさまでの深さを推定するアプローチを開発する。

Using our approach, existing monocular depth estimation techniques can be effectively applied to dual-pixel data, and much smaller models can be constructed that still infer high quality depth.

訳：

このアプローチを使用すると，既存の単眼深度推定技術をデュアルピクセルデータに効果的に適用でき，さらに高品質の深度を推測するはるかに小さなモデルを構築できる。

To demonstrate this, we capture a large dataset of in-the-wild 5-viewpoint RGB images paired with corresponding dual-pixel data, and show how view supervision with this data can be used to learn depth up to the unknown ambiguities.

訳：

これを示すために，我々は対応するデュアルピクセルデータとペアになった野生の5視点RGB画像の大規模なデータセットをキャプチャし，このデータを使用したビュー監視を使用して未知のあいまいさまでの深さを学習する方法を示す。

On our new task, our model is 30% more accurate than any prior work on learning-based monocular or stereoscopic depth estimation.

語彙：

stereoscopic

訳：

新しいタスクでは，学習ベースの単眼または立体深度推定に関する従来の作業よりもモデルの精度が30％高くなります。

5 YOLACT: Real-time Instance Segmentation

f:id:ryosuke_okubo:20191019065400p:plain

原文：

YOLACT: Real-time Instance Segmentation

Abstract：

We present a simple, fully-convolutional model for real-time instance segmentation that achieves 29.8 mAP on MS COCO at 33 fps evaluated on a single Titan Xp, which is significantly faster than any previous competitive approach.

訳：

我々は単一のTitan Xpで評価された33 fpsのMS COCOで29.8 mAPを達成するreal-time instance segmentationのシンプルで完全な畳み込みモデルを提示する，これは従来の競合アプローチよりも大幅に高速である。

Moreover, we obtain this result after training on only one GPU.

訳：

さらに，我々は1つのGPUのみで学習した後にこの結果を取得する。

We accomplish this by breaking instance segmentation into two parallel subtasks:

(1) generating a set of prototype masks

and (2) predicting per-instance mask coefficients.

語彙：

coefficients

訳：

これはインスタンスのセグメンテーションを2つの並列サブタスクに分割することで実現する：

（1）プロトタイプマスクセットの生成

（2）インスタンスごとのマスク係数の予測。

Then we produce instance masks by linearly combining the prototypes with the mask coefficients.

訳：

次に，プロトタイプをマスク係数と線形結合することによりインスタンスマスクを作成する。

We find that because this process doesn't depend on repooling, this approach produces very high-quality masks and exhibits temporal stability for free.

語彙：

exhibits

訳：

このプロセスは再プーリングに依存しないため，このアプローチは非常に高品質のマスクを生成し，無料で一時的な安定性を示す。

Furthermore, we analyze the emergent behavior of our prototypes and show they learn to localize instances on their own in a translation variant manner, despite being fully-convolutional.

訳：

さらに，プロトタイプの緊急の動作を分析し，完全な畳み込みにもかかわらず，インスタンスが翻訳バリアントの方法で独自にローカライズすることを学習することを示す。