論文Abstract100本ノック#7 - 十の並列した脳

After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection.

訳：

既存のエッジと勾配ベースの記述子を確認した後，histograms of oriented gradient（HOG）記述子のグリッドが人間の検出において既存の特徴セットを大幅に上回ることを実験的に示す。

We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results.

語彙：

influence on

fine-scale

binning

coarse

overlapping

訳：

計算の各段階がパフォーマンスに与える影響を調査して，結果として重複する記述子ブロックでの細かいスケールの勾配，細かい方向のビニング，比較的粗い空間ビニング，および高品質のローカルコントラスト正規化がすべて重要であると結論付ける。

The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

語彙：

near-perfect

original

訳：

新しいアプローチでは元あるMIT歩行者データベースをほぼ完全に分離できるため，さまざまなポーズのバリエーションと背景を持つ1800以上の注釈がついた人間の画像を含むより難しいデータセットを導入する。

32 RCNN（2013）

f:id:ryosuke_okubo:20190915180841p:plain

原文：

Rich feature hierarchies for accurate object detection and semantic segmentation

Abstract：

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years.

語彙：

canonical

plateaued

訳：

正規のPASCAL VOCデータセットで測定された物体検出性能は，ここ数年で横ばいになっている。

The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context.

語彙：

best-performing

訳：

最適な方法は，複数の低レベルの画像機能と高レベルのコンテキストを組み合わせた複雑なアンサンブルシステムである。

In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012---achieving a mAP of 53.3%.

訳：

本論文では，VOC 2012の以前の最高結果と比較して平均平均精度（mAP）を30％以上向上させる（つまり53.3％のmAPを達成する）シンプルで拡張性のある検出アルゴリズムを提案する。

Our approach combines two key insights:

(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects

and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost.

語彙：

high-capacity

proposals

scarce

pre-training

訳：

私たちのアプローチは2つの重要な洞察を組み合わせている：

（1）物体をローカライズおよびセグメント化するために大容量のCNNをボトムアップ領域の提案に適用できる

（2）ラベル付きトレーニングデータが不足している場合，補助タスクの教師付き事前学習に続いてドメイン固有の微調整を行うと，パフォーマンスが大幅に向上する

Since we combine region proposals with CNNs, we call our method R-CNN:

Regions with CNN features.

訳：

領域の提案をCNNと組み合わせるため，この手法をR-CNNと呼ぶ（Regions with CNN features）。

We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture.

訳：

またR-CNNと，同様のCNNアーキテクチャに基づいて最近提案されたsliding-window検出器であるOverFeatを比較する。

We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset.

訳：

R-CNNは200クラスのILSVRC2013 detection datasetでOverFeatよりも大幅に優れていることがわかる。

Source code for the complete system is available at http://www.rossgirshick.info/ .

訳：

完全なシステムのソースコードはhttp：//www.rossgirshick.info/で入手できる。

33 SPP-net（2014）

f:id:ryosuke_okubo:20190915180905p:plain

原文：

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Abstract：

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224x224) input image.

訳：

既存のCNNには固定サイズ（たとえば、224x224）の入力画像が必要である。

This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale.

語彙：

artificial

sub-images

arbitrary

訳：

この要件は「人工的」であり任意のサイズ/スケールの画像または部分画像の認識精度を低下させる可能性がある。

In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement.

語彙：

equip

訳：

ここでは，上記の要件を排除するために，ネットワークに別のプーリング戦略である「spatial pyramid pooling」を使用する。

The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale.

語彙：

regardless

訳：

SPP-netと呼ばれる新しいネットワーク構造は，画像のサイズ/スケールに関係なく固定長の表現を生成できる。

Pyramid pooling is also robust to object deformations.

訳：

Pyramid poolingはオブジェクトの変形に対してもロバストである。

With these advantages, SPP-net should in general improve all CNN-based image classification methods.

訳：

これらの利点により，SPP-netは一般にすべてのCNNベースの画像分類方法を改善するであろう。

On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs.

語彙：

despite

訳：

ImageNet 2012データセットでは，SPP-netが異なる設計にもかかわらずさまざまなCNNアーキテクチャの精度を高めることを実証している。

On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning.

訳：

Pascal VOC 2007およびCaltech101データセットでは，SPP-netは単一のフルイメージ表現を使用して微調整なしで最先端の分類結果を達成する。

The power of SPP-net is also significant in object detection.

訳：

SPP-netの能力は物体検出においても重要である。

Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors.

語彙：

entire

pool

訳：

SPP-netを使用して，画像全体から特徴マップを1回だけ計算し，任意の領域（部分画像）に特徴をプールして検出器を学習するための固定長表現を生成する。

This method avoids repeatedly computing the convolutional features.

語彙：

avoids

訳：

この方法により畳み込み特徴を繰り返し計算する必要がなくなる。

In processing test images, our method is 24-102x faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007.

訳：

テスト画像の処理では，Pascal VOC 2007でより優れたまたは同等の精度を達成しながらも，R-CNN方式より24〜102倍高速である。

In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams.

訳：

ILSVRC 2014では，38のチームすべてで物体検出で2位，画像分類で3位にランク付けされている。

This manuscript also introduces the improvement made for this competition.

語彙：

manuscript

訳：

本稿ではこのコンペティションのためになされた改善も紹介する。

34 Fast R-CNN（2015）

f:id:ryosuke_okubo:20190915180935p:plain

原文：

Fast R-CNN

Abstract：

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection.

訳：

本論文では物体検出のためのFast Region-based Convolutional Network method（Fast R-CNN）を提案する。

Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks.

語彙：

previous work

訳：

Fast R-CNNは先行研究に基づいて構築されており，CNNを使用して物体提案を効率的に分類する。

Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy.

訳：

先行研究と比較して，Fast R-CNNはいくつかのイノベーションを採用して，学習とテストの速度を向上させ検出精度も向上させている。

Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012.

訳：

Fast R-CNNは非常に深いVGG16ネットワークをR-CNNよりも9倍高速に学習し，テスト時の213倍の速度で，PASCAL VOC 2012でより高いmAPを実現する。

Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate.

訳：

SPPnetと比較して，Fast R-CNNはVGG16を3倍高速に学習し，10倍高速にテストし、より正確である。

Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.

訳：

Fast R-CNNはPythonとC ++（Caffeを使用）で実装され，https://github.com/rbgirshick/fast-rcnnのオープンソースMITライセンスの下で利用可能である。

35 Faster R-CNN（2015）

f:id:ryosuke_okubo:20190915181001p:plain

原文：

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Abstract：

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.

語彙：

hypothesize

訳：

最先端の物体検出ネットワークは物体の位置を仮説化するための領域提案アルゴリズムに依存している。

Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck.

訳：

SPPnetやFast R-CNNなどの進歩によりこれらの検出ネットワークの実行時間が短縮され，領域提案の計算がボトルネックになっている。

In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

語彙：

cost-free

訳：

ここでは，全画像の畳み込み機能を検出ネットワークと共有するRegion Proposal Network（RPN）を導入し，ほぼ無償での領域提案を可能にする。

An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position.

語彙：

simultaneously

訳：

RPNは各位置でオブジェクトの境界とオブジェクトのスコアを同時に予測する全畳み込みネットワークである。

The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.

訳：

RPNはFast R-CNNが検出に使用する高品質の領域提案を生成するためにエンドツーエンドで学習される。

We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look.

語彙：

merge

terminology

unified

訳：

さらに畳み込み機能を共有することでRPNとFast R-CNNを単一のネットワークにマージした，最近注目されているニューラルネットワークの用語である「attention」メカニズムを使用して，RPN コンポーネントは統一されたネットワークにどこを見ればよいかを伝える。

For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.

訳：

非常に深いVGG-16モデルの場合，検出システムはGPUで5fps（すべてのステップを含む）のフレームレートを持ち，PASCAL VOC 2007，2012およびMS COCOデータセットで画像ごとに300件の提案だけのとき最先端のオブジェクト検出精度を達成する。