論文Abstract100本ノック#8 - 十の並列した脳

前回↓

ryosuke-okubo.hatenablog.com

36 YOLO（2015）
37 SSD（2015）
38 Mask R-CNN（2017）
39 RetinaNet（2017）
40 M2Det（2018）

36 YOLO（2015）

f:id:ryosuke_okubo:20190920175744p:plain

原文：

You Only Look Once: Unified, Real-Time Object Detection

Abstract：

We present YOLO, a new approach to object detection.

訳：

物体検出の新しいアプローチであるYOLOを紹介する。

Prior work on object detection repurposes classifiers to perform detection.

語彙：

repurposes

訳：

物体検出に関する先行研究では，検出を実行するために分類を再利用している。

Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.

訳：

代わりに，空間的に分離されたバウンディングボックスと関連するクラス確率に対する回帰問題として物体検出をフレーム化する。

A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.

訳：

単一のニューラルネットワークは1回の評価で完全な画像から直接バウンディングボックスとクラス確率を予測する。

Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

訳：

検出パイプライン全体が単一のネットワークであるため，検出性能に基づいてエンドツーエンドで直接最適化できる。

Our unified architecture is extremely fast.

訳：

統合アーキテクチャは非常に高速である。

Our base YOLO model processes images in real-time at 45 frames per second.

訳：

基本的なYOLOモデルは毎秒45フレームでリアルタイムに画像を処理する。

A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.

語彙：

astounding

訳：

ネットワークの小型バージョンであるFast YOLOは，他のリアルタイム検出器の2倍のmAPを達成しながら毎秒155フレームという驚異的な処理を行う。

Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists.

語彙：

far less

false detections

訳：

最先端の検出システムと比較して，YOLOはより多くのローカライズエラーを発生させるが何もない場合に誤検出を予測する可能性ははるかに低い。

Finally, YOLO learns very general representations of objects.

訳：

最後に，YOLOは物体の非常に一般的な表現を学習する。

It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.

語彙：

artwork

訳：

PicassoデータセットとPeople-Artデータセットの両方で自然画から絵画に一般化すると，DPMやR-CNNを含む他のすべての検出方法よりもはるかに優れている。

37 SSD（2015）

f:id:ryosuke_okubo:20190920175812p:plain

原文：

SSD: Single Shot MultiBox Detector

Abstract：

We present a method for detecting objects in images using a single deep neural network.

訳：

単一のDNNを使用して画像内の物体を検出する方法を示す。

Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location.

語彙：

discretizes

aspect ratios

訳：

SSDという名前のアプローチは，バウンディングボックスの出力空間を機能マップの場所ごとに異なるアスペクト比とスケールでデフォルトボックスのセットに離散化する。

At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape.

語彙：

presence

adjustments

訳：

予測時に，ネットワークは各デフォルトボックス内の各物体カテゴリの存在のスコアを生成し，物体の形状によりよく一致するようにボックスの調整を生成する。

Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.

語彙：

resolutions

handle

訳：

さらに，ネットワークはさまざまな解像度の複数の機能マップからの予測を組み合わせて，さまざまなサイズの物体を自然に処理する。

Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network.

語彙：

relative

subsequent

encapsulates

訳：

SSDモデルは物体の提案を必要とする方法に比べて単純であり，提案の生成とそれに続くピクセルまたは機能のリサンプリング段階を完全に排除しすべての計算を単一のネットワークにカプセル化する。

This makes SSD easy to train and straightforward to integrate into systems that require a detection component.

語彙：

straightforward

訳：

これによりSSDの学習が容易になり検出要素を必要とするシステムに簡単に統合できる。

Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference.

語彙：

confirm

utilize

訳：

PASCAL VOC，MS COCO，およびILSVRCデータセットの実験結果はSSDが追加の物体提案ステップを利用する方法に匹敵する精度を持ちはるかに高速であると同時に，学習と推論の両方に統一されたフレームワークを提供することを確認する。

For 300×300 input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for 500×500 input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model.

訳：

300×300入力の場合，SSDはNvidia Titan Xで58 FPSのVOC2007テストで72.1％mAPを達成し，500×500入力の場合，SSDは75.1％mAPを達成し，同等の最先端のFaster R-CNNモデルよりも優れている。

Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size.

訳：

他のシングルステージ方式と比較して，SSDは入力画像サイズが小さくても精度がはるかに高くなる。

Code is available at https://github.com/weiliu89/caffe/tree/ssd.

訳：

コードはhttps://github.com/weiliu89/caffe/tree/ssdで入手できる。

38 Mask R-CNN（2017）

f:id:ryosuke_okubo:20190920175832p:plain

原文：

Mask R-CNN

Abstract：

We present a conceptually simple, flexible, and general framework for object instance segmentation.

語彙：

conceptually

訳：

物体インスタンスのセグメンテーションのための概念的にシンプル，柔軟，一般的なフレームワークを示す。

Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.

訳：

我々のアプローチは画像内の物体を効率的に検出すると同時にインスタンスごとに高品質のセグメンテーションマスクを生成する。

The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

訳：

Mask R-CNNと呼ばれる方法は，バウンディングボックス認識の既存のブランチと並行してオブジェクトマスクを予測するためのブランチを追加することによりFaster R-CNNを拡張する。

Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.

訳：

Mask R-CNNは学習が簡単で5 fpsで実行されるFaster R-CNNにわずかなオーバーヘッドしか追加しない。

Moreover, Mask R-CNN is easy to generalize to other tasks,

e.g., allowing us to estimate human poses in the same framework.

訳：

さらに，Mask R-CNNは他のタスクに簡単に一般化できる，

たとえば，同じフレームワークで人間のポーズを推定できる。

We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection.

訳：

インスタンスのセグメンテーション，バウンディングボックスによる物体の検出，人物のキーポイントの検出など，COCOの一連の課題の3つすべてで最高の結果を示している。

Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners.

語彙：

bells and whistles

訳：

余計なものがない場合，Mask R-CNNはCOCO 2016チャレンジ受賞者を含むすべてのタスクで既存の単一モデルエントリよりも優れている。

We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.

訳：

シンプルで効果的なアプローチが強固なベースラインとして機能し，インスタンスレベルの認識に関する今後の研究を容易にすることを期待する。

Code has been made available at: https://github.com/facebookresearch/Detectron

訳：

コードは次の場所から入手できる。

https://github.com/facebookresearch/Detectron

39 RetinaNet（2017）

f:id:ryosuke_okubo:20190920175851p:plain

原文：

Focal Loss for Dense Object Detection

Abstract：

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations.

語彙：

popularized

訳：

これまでの最高精度の物体検出器はR-CNNにより一般化された2段階アプローチに基づいており，分類器が候補となる物体位置の疎なセットに適用される。

In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far.

訳：

対照的に、可能な物体の位置の定期的な高密度サンプリングに適用される1段検出器はより高速でシンプルであるが，2段検出器の精度よりも劣っている。

In this paper, we investigate why this is the case.

訳：

本論文では，なぜそうなのかを調査する。

We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause.

語彙：

encountered

訳：

高密度検出器の学習中に発生する極度の前景と背景のクラスの不均衡が中心的な原因であることを発見した。

We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples.

訳：

よく分類された例に割り当てられた損失の重みを小さくするように，標準クロスエントロピー損失を再形成することによってこのクラスの不均衡に対処することを提案する。

Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.

語彙：

Focal Loss

vast number of

overwhelming

訳：

我々の新しいFocal Lossは学習をハードサンプルのまばらなセットに集中させ，学習中に膨大な数の簡単なネガが検出器を圧迫するのを防ぐ。

To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet.

訳：

損失の有効性を評価するために，RetinaNetと呼ばれる単純な高密度検出器を設計および学習する。

Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.

語彙：

surpassing

訳：

我々の結果は，RetinaNetがfocal lossで学習された場合，既存のすべての最先端の2段検出器の精度を超えながら従来のの1段検出器の速度に一致できることを示す。

Code is at:

https://github.com/facebookresearch/Detectron

訳：

コードは次のとおりである：

https://github.com/facebookresearch/Detectron

40 M2Det（2018）

f:id:ryosuke_okubo:20190920175917p:plain

原文：

M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

Abstract：

Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (e.g., DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e.g., Mask R-CNN, DetNet) to alleviate the problem arising from scale variation across object instances.

訳：

Feature pyramidsは物体インスタンス間のスケールのばらつきから生じる問題を軽減するため，最先端の1段物体検出器（DSDS、RetinaNet、RefineDetなど）と2段物体検出器（Mask R-CNN、DetNetなど）の両方で広く活用されている。

Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multi-scale, pyramidal architecture of the backbones which are actually designed for object classification task.

訳：

feature pyramidsを備えたこれらの物体検出器は有望な結果を達成するが，物体分類タスク用に実際に設計されたバックボーンの固有のマルチスケールピラミッドアーキテクチャに従って特徴ピラミッドを構築するだけであるため，いくつかの制限がある。

Newly, in this work, we present a method called Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales.

訳：

新たにここでは，異なるスケールの物体を検出するためのより効果的なfeature pyramidsを構築するためのMulti-Level Feature Pyramid Network（MLFPN）と呼ばれる方法を提示する。

First, we fuse multi-level features (i.e. multiple layers) extracted by backbone as the base feature.

訳：

最初に，基幹機能としてバックボーンによって抽出されたマルチレベル機能（つまり複数のレイヤー）を融合する。

Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each u-shape module as the features for detecting objects.

訳：

次に，基幹機能をThinned U-shape ModulesとFeature Fusion Modulesが交互に並ぶブロックに送り込み，各u-shape moduleのデコーダレイヤーを物体検出機能として利用する。

Finally, we gather up the decoder layers with equivalent scales (sizes) to develop a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels.

訳：

最後に，同等のスケール（サイズ）を持つデコーダレイヤーを集めて物体検出用のfeature pyramidを開発する，各特徴マップは複数レベルのレイヤー（特徴）で構成される。

To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, which gets better detection performance than state-of-the-art one-stage detectors.

訳：

提案されたMLFPNの有効性を評価するために，M2DetをSSDのアーキテクチャに統合することでM2Detと呼ばれる強力なエンドツーエンドの1段物体検出器を設計および学習する，それにより最先端の1段検出器よりも優れた検出性能が得られる。

Specifically, on MS-COCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which is the new state-of-the-art results among one-stage detectors.

訳：

具体的には，MS-COCOベンチマークでは，M2Detはシングルスケール推論戦略で11.8 FPSの速度で41.0のAPを達成し，マルチスケール推論戦略で44.2のAPを達成した，これは1段検出器の中で最高の結果である。