1 Introduction
No. | Survey title | References | Year | Venue | Content |
---|---|---|---|---|---|
1 | Monocular pedestrian detection: survey and experiments |
Enzweiler and Gavrila (2009) | 2009 | PAMI | An evaluation of three pedestrian detectors |
2 | Survey of pedestrian detection for advanced driver assistance systems |
Geronimo et al. (2010) | 2010 | PAMI | A survey of pedestrian detection for advanced driver assistance systems |
3 | Pedestrian detection: an evaluation of the state of the art |
Dollar et al. (2012) | 2012 | PAMI | A thorough and detailed evaluation of detectors in monocular images |
4 | Detecting faces in images: a survey |
Yang et al. (2002) | 2002 | PAMI | First survey of face detection from a single image |
5 | A survey on face detection in the wild: past, present and future |
Zafeiriou et al. (2015) | 2015 | CVIU | A survey of face detection in the wild since 2000 |
6 | On road vehicle detection: a review |
Sun et al. (2006) | 2006 | PAMI | A review of vision based on-road vehicle detection systems |
7 | Text detection and recognition in imagery: a survey |
Ye and Doermann (2015) | 2015 | PAMI | A survey of text detection and recognition in color imagery |
8 | Toward category level object recognition |
Ponce et al. (2007) | 2007 | Book | Representative papers on object categorization, detection, and segmentation |
9 | The evolution of object categorization and the challenge of image abstraction |
Dickinson et al. (2009) | 2009 | Book | A trace of the evolution of object categorization over 4 decades |
10 | Context based object categorization: a critical survey |
Galleguillos and Belongie (2010) | 2010 | CVIU | A review of contextual information for object categorization |
11 | 50 years of object recognition: directions forward |
Andreopoulos and Tsotsos (2013) | 2013 | CVIU | A review of the evolution of object recognition systems over 5 decades |
12 | Visual object recognition | Grauman and Leibe (2011) | 2011 | Tutorial | Instance and category object recognition techniques |
13 | Object class detection: a survey |
Zhang et al. (2013) | 2013 | ACM CS | Survey of generic object detection methods before 2011 |
14 | Feature representation for statistical learning based object detection: a review | Li et al. (2015b) | 2015 | PR | Feature representation methods in statistical learning based object detection, including handcrafted and deep learning based features |
15 | Salient object detection: a survey |
Borji et al. (2014) | 2014 | arXiv | A survey for salient object detection |
16 | Representation learning: a review and new perspectives |
Bengio et al. (2013) | 2013 | PAMI | Unsupervised feature learning and deep learning, probabilistic models, autoencoders, manifold learning, and deep networks |
17 | Deep learning |
LeCun et al. (2015) | 2015 | Nature | An introduction to deep learning and applications |
18 | A survey on deep learning in medical image analysis |
Litjens et al. (2017) | 2017 | MIA | A survey of deep learning for image classification, object detection, segmentation and registration in medical image analysis |
19 | Recent advances in convolutional neural networks |
Gu et al. (2018) | 2017 | PR | A broad survey of the recent advances in CNN and its applications in computer vision, speech and natural language processing |
20 | Tutorial: tools for efficient object detection | − | 2015 | ICCV15 | A short course for object detection only covering recent milestones |
21 | Tutorial: deep learning for objects and scenes | − | 2017 | CVPR17 | A high level summary of recent work on deep learning for visual recognition of objects and scenes |
22 | Tutorial: instance level recognition | − | 2017 | ICCV17 | A short course of recent advances on instance level recognition, including object detection, instance segmentation and human pose prediction |
23 | Tutorial: visual recognition and beyond | − | 2018 | CVPR18 | A tutorial on methods and principles behind image classification, object detection, instance segmentation, and semantic segmentation |
24 | Deep learning for generic object detection | Ours | 2019 | VISI | A comprehensive survey of deep learning for generic object detection |
1.1 Comparison with Previous Reviews
1.2 Scope
2 Generic Object Detection
2.1 The Problem
2.2 Main Challenges
2.2.1 Accuracy Related Challenges
2.2.2 Efficiency and Scalability Related Challenges
2.3 Progress in the Past 2 Decades
3 A Brief Introduction to Deep Learning
Dataset name | Total images | Categories | Images per category | Objects per image | Image size | Started year | Highlights |
---|---|---|---|---|---|---|---|
PASCAL VOC (2012) (Everingham et al. 2015) | 11, 540 | 20 | 303–4087 | 2.4 | \(470\times 380\) | 2005 | Covers only 20 categories that are common in everyday life; Large number of training images; Close to real-world applications; Significantly larger intraclass variations; Objects in scene context; Multiple objects in one image; Contains many difficult samples |
ImageNet (Russakovsky et al. 2015) | 14 millions+ | 21, 841 | − | 1.5 | \(500\times 400\) | 2009 | Large number of object categories; More instances and more categories of objects per image; More challenging than PASCAL VOC; Backbone of the ILSVRC challenge; Images are object-centric |
MS COCO (Lin et al. 2014) | 328,000+ | 91 | − | 7.3 | \(640\times 480\) | 2014 | Even closer to real world scenarios; Each image contains more instances of objects and richer object annotation information; Contains object segmentation notation data that is not available in the ImageNet dataset |
Places (Zhou et al. 2017a) | 10 millions\(+\) | 434 | − | − | \(256\times 256\) | 2014 | The largest labeled dataset for scene recognition; Four subsets Places365 Standard, Places365 Challenge, Places 205 and Places88 as benchmarks |
Open Images (Kuznetsova et al. 2018) | 9 millions\(+\) | 6000\(+\) | − | 8.3 | Varied | 2017 | Annotated with image level labels, object bounding boxes and visual relationships; Open Images V5 supports large scale object detection, object instance segmentation and visual relationship detection |
4 Datasets and Performance Evaluation
4.1 Datasets
Challenge | Object classes | Number of images | Number of annotated objects | Summary (Train\(+\)Val) | |||||
---|---|---|---|---|---|---|---|---|---|
Train | Val | Test | Train | Val | Images | Boxes | Boxes/Image | ||
PASCAL VOC object detection challenge | |||||||||
VOC07 | 20 | 2501 | 2510 | 4952 | 6301(7844) | 6307(7818) | 5011 | 12, 608 | 2.5 |
VOC08 | 20 | 2111 | 2221 | 4133 | 5082(6337) | 5281(6347) | 4332 | 10, 364 | 2.4 |
VOC09 | 20 | 3473 | 3581 | 6650 | 8505(9760) | 8713(9779) | 7054 | 17, 218 | 2.3 |
VOC10 | 20 | 4998 | 5105 | 9637 | 11, 577(13, 339) | 11, 797(13, 352) | 10, 103 | 23, 374 | 2.4 |
VOC11 | 20 | 5717 | 5823 | 10, 994 | 13, 609(15, 774) | 13, 841(15, 787) | 11, 540 | 27, 450 | 2.4 |
VOC12 | 20 | 5717 | 5823 | 10, 991 | 13, 609(15, 774) | 13, 841(15, 787) | 11, 540 | 27, 450 | 2.4 |
ILSVRC object detection challenge | |||||||||
ILSVRC13 | 200 | 395, 909 | 20, 121 | 40, 152 | 345, 854 | 55, 502 | 416, 030 | 401, 356 | 1.0 |
ILSVRC14 | 200 | 456, 567 | 20, 121 | 40, 152 | 478, 807 | 55, 502 | 476, 668 | 534, 309 | 1.1 |
ILSVRC15 | 200 | 456, 567 | 20, 121 | 51, 294 | 478, 807 | 55, 502 | 476, 668 | 534, 309 | 1.1 |
ILSVRC16 | 200 | 456, 567 | 20, 121 | 60, 000 | 478, 807 | 55, 502 | 476, 668 | 534, 309 | 1.1 |
ILSVRC17 | 200 | 456, 567 | 20, 121 | 65, 500 | 478, 807 | 55, 502 | 476, 668 | 534, 309 | 1.1 |
MS COCO object detection challenge | |||||||||
MS COCO15 | 80 | 82, 783 | 40, 504 | 81, 434 | 604, 907 | 291, 875 | 123, 287 | 896, 782 | 7.3 |
MS COCO16 | 80 | 82, 783 | 40, 504 | 81, 434 | 604, 907 | 291, 875 | 123, 287 | 896, 782 | 7.3 |
MS COCO17 | 80 | 118, 287 | 5000 | 40, 670 | 860, 001 | 36, 781 | 123, 287 | 896, 782 | 7.3 |
MS COCO18 | 80 | 118, 287 | 5000 | 40, 670 | 860, 001 | 36, 781 | 123, 287 | 896, 782 | 7.3 |
OICOD18 | 500 | 1, 643, 042 | 100, 000 | 99, 999 | 11, 498, 734 | 696, 410 | 1, 743, 042 | 12, 195, 144 | 7.0 |
4.2 Evaluation Criteria
-
The predicted category c equals the ground truth label \(c_g\).
-
The overlap ratio IOU (Intersection Over Union) (Everingham et al. 2010; Russakovsky et al. 2015)between the predicted BB b and the ground truth \(b^g\) is not smaller than a predefined threshold \(\varepsilon \), where \(\cap \) and cup denote intersection and union, respectively. A typical value of \(\varepsilon \) is 0.5.$$\begin{aligned} \text {IOU}(b,b^g)=\frac{{ area}\,(b\cap b^g)}{{ area}\,(b\cup b^g)}, \end{aligned}$$(4)
Metric | Meaning | Definition and description | |
---|---|---|---|
TP | True positive | A true positive detection, per Fig. 10 | |
FP | False positive | A false positive detection, per Fig. 10 | |
\(\beta \) | Confidence threshold | A confidence threshold for computing \(P(\beta )\) and \(R(\beta )\) | |
\(\varepsilon \) | IOU threshold | VOC | Typically around 0.5 |
ILSVRC | \(\min (0.5,\frac{wh}{(w+10)(h+10)})\); \(w\times h\) is the size of a GT box | ||
MS COCO | Ten IOU thresholds \(\varepsilon \in \{0.5:0.05:0.95\}\) | ||
\(P(\beta )\) | Precision | The fraction of correct detections out of the total detections returned by the detector with confidence of at least \(\beta \) | |
\(R(\beta )\) | Recall | The fraction of all \(N_c\) objects detected by the detector having a confidence of at least \(\beta \) | |
AP | Average Precision | Computed over the different levels of recall achieved by varying the confidence \(\beta \) | |
mAP | mean Average Precision | VOC | AP at a single IOU and averaged over all classes |
ILSVRC | AP at a modified IOU and averaged over all classes | ||
MS COCO | \(AP_{\textit{coco}}\): mAP averaged over ten IOUs: \(\{0.5:0.05:0.95\}\); | ||
\(AP^{\text {IOU}=0.5}_{{ coco}}\): mAP at \(\hbox {IOU}=0.50\) (PASCAL VOC metric); | |||
\(AP^{\text {IOU}=0.75}_{{ coco}}\): mAP at \(\hbox {IOU}=0.75\) (strict metric); | |||
\(AP^{\text {small}}_{{ coco}}\): mAP for small objects of area smaller than \(32^2\); | |||
\(AP^{\text {medium}}_{{ coco}}\): mAP for objects of area between \(32^2\) and \(96^2\); | |||
\(AP^{\text {large}}_{{ coco}}\): mAP for large objects of area bigger than \(96^2\); | |||
AR | Average Recall | The maximum recall given a fixed number of detections per image, averaged over all categories and IOU thresholds | |
AR | Average Recall | MS COCO | \( AR^{\text {max}=1}_{{ coco}}\): AR given 1 detection per image; |
\(AR^{\text {max}=10}_{{ coco}}\): AR given 10 detection per image; | |||
\(AR^{\text {max}=100}_{{ coco}}\): AR given 100 detection per image; | |||
\(AR^{\text {small}}_{{ coco}}\): AR for small objects of area smaller than \(32^2\); | |||
\(AR^{\text {medium}}_{{ coco}}\): AR for objects of area between \(32^2\) and \(96^2\); | |||
\(AR^{\text {large}}_{{ coco}}\): AR for large objects of area bigger than \(96^2\); |
5 Detection Frameworks
5.1 Region Based (Two Stage) Frameworks
5.2 Unified (One Stage) Frameworks
6 Object Representation
No. | DCNN architecture | #Paras (\(\times 10^6\)) | #Layers (CONV+FC) | Test error (Top 5) | First used in | Highlights |
---|---|---|---|---|---|---|
1 | AlexNet (Krizhevsky et al. 2012b) | 57 | \(5+2 \) | \(15.3\%\) | Girshick et al. (2014) | The first DCNN found effective for ImageNet classification; the historical turning point from hand-crafted features to CNN; Winning the ILSVRC2012 Image classification competition |
2 | ZFNet (fast) (Zeiler and Fergus 2014) | 58 | \( 5+2 \) | \(14.8\%\) |
He et al. (2014) | Similar to AlexNet, different in stride for convolution, filter size, and number of filters for some layers |
3 | OverFeat (Sermanet et al. 2014) | 140 | \( 6+2\) | \(13.6\%\) |
Sermanet et al. (2014) | Similar to AlexNet, different in stride for convolution, filter size, and number of filters for some layers |
4 | VGGNet (Simonyan and Zisserman 2015) | 134 | \(13+2 \) | \(6.8\%\) | Girshick (2015) | Increasing network depth significantly by stacking \(3\times 3\) convolution filters and increasing the network depth step by step |
5 | GoogLeNet (Szegedy et al. 2015) | 6 | 22 | \(6.7\%\) | Szegedy et al. (2015) | Use Inception module, which uses multiple branches of convolutional layers with different filter sizes and then concatenates feature maps produced by these branches. The first inclusion of bottleneck structure and global average pooling |
6 | Inception v2 (Ioffe and Szegedy 2015) | 12 | 31 | \(4.8\%\) |
Howard et al. (2017) | Faster training with the introduce of batch normalization |
7 | Inception v3 (Szegedy et al. 2016) | 22 | 47 | \(3.6\%\) | Inclusion of separable convolution and spatial resolution reduction | |
8 | YOLONet (Redmon et al. 2016) | 64 | \(24+1\) | − |
Redmon et al. (2016) | A network inspired by GoogLeNet used in YOLO detector |
9 | ResNet50 (He et al. 2016) | 23.4 | 49 | \(3.6\%\) (ResNets) |
He et al. (2016) | With identity mapping, substantially deeper networks can be learned |
10 | ResNet101 (He et al. 2016) | 42 | 100 |
He et al. (2016) | Requires fewer parameters than VGG by using the global average pooling and bottleneck introduced in GoogLeNet | |
11 | InceptionResNet v1 (Szegedy et al. 2017) | 21 | 87 | \(3.1\%\) (Ensemble) | Combination of identity mapping and Inception module, with similar computational cost of Inception v3, but faster training process | |
12 | InceptionResNet v2 Szegedy et al. (2017) | 30 | 95 | (Huang et al. 2017b) | A costlier residual version of Inception, with significantly improved recognition performance | |
13 | Inception v4 Szegedy et al. (2017) | 41 | 75 | An Inception variant without residual connections, with roughly the same recognition performance as InceptionResNet v2, but significantly slower | ||
14 | ResNeXt (Xie et al. 2017) | 23 | 49 | \(3.0\%\) | Xie et al. (2017) | Repeating a building block that aggregates a set of transformations with the same topology |
15 | DenseNet201 (Huang et al. 2017a) | 18 | 200 | − | Zhou et al. (2018b) | Concatenate each layer with every other layer in a feed forward fashion. Alleviate the vanishing gradient problem, encourage feature reuse, reduction in number of parameters |
16 | DarkNet (Redmon and Farhadi 2017) | 20 | 19 | − |
Redmon and Farhadi (2017) | Similar to VGGNet, but with significantly fewer parameters |
17 | MobileNet (Howard et al. 2017) | 3.2 | \(27+1\) | − | Howard et al. (2017) | Light weight deep CNNs using depth-wise separable convolutions |
18 | SE ResNet (Hu et al. 2018b) | 26 | 50 | \(2.3\%\) (SENets) | Hu et al. (2018b) | Channel-wise attention by a novel block called Squeeze and Excitation. Complementary to existing backbone CNNs |
6.1 Popular CNN Architectures
6.2 Methods For Improving Object Representation
6.2.1 Handling of Object Scale Variations
Group | Detector name | Region proposal | Backbone DCNN | Pipelined used | mAP@IoU \(=\) 0.5 | mAP | Published in | Highlights | ||
---|---|---|---|---|---|---|---|---|---|---|
VOC07 | VOC12 | COCO | COCO | |||||||
(1) Single detection with multilayer features | ION (Bell et al. 2016) | SS+EB MCG+RPN | VGG16 | Fast RCNN | 79.4 (07+12) | 76.4 (07+12) | 55.7 | 33.1 | CVPR16 | Use features from multiple layers; use spatial recurrent neural networks for modeling contextual information; the Best Student Entry and the 3rd overall in the COCO detection challenge 2015 |
HyperNet (Kong et al. 2016) | RPN | VGG16 | Faster RCNN | 76.3 (07+12) | 71.4 (07T+12) | − | − | CVPR16 | Use features from multiple layers for both region proposal and region classification | |
PVANet (Kim et al. 2016) | RPN | PVANet | Faster RCNN | 84.9 (07+12+CO) | 84.2 (07T+12+CO) | − | − | NIPSW16 | ||
(2) Detection at multiple layers | SDP+CRC (Yang et al. 2016b) | EB | VGG16 | Fast RCNN | 69.4 (07) | − | − | − | CVPR16 | Use features in multiple layers to reject easy negatives via CRC, and then classify remaining proposals using SDP |
MSCNN (Cai et al. 2016) | RPN | VGG | Faster RCNN | Only Tested on KITTI | ECCV16 | Region proposal and classification are performed at multiple layers; includes feature upsampling; end to end learning | ||||
MPN (Zagoruyko et al. 2016) | SharpMask (Pinheiro et al. 2016) | VGG16 | Fast RCNN | − | − | 51.9 | 33.2 | BMVC16 | Concatenate features from different convolutional layers and features of different contextual regions; loss function for multiple overlap thresholds; ranked 2nd in both the COCO15 detection and segmentation challenges | |
DSOD (Shen et al. 2017) | Free | DenseNet | SSD | 77.7 (07+12) | 72.2 (07T+12) | 47.3 | 29.3 | ICCV17 | Concatenate feature sequentially, like DenseNet. Train from scratch on the target dataset without pre-training | |
RFBNet (Liu et al. 2018b) | Free | VGG16 | SSD | 82.2 (07+12) | 81.2 (07T+12) | 55.7 | 34.4 | ECCV18 | Propose a multi-branch convolutional block similar to Inception (Szegedy et al. 2015), but using dilated convolution | |
(3) Combination of (1) and (2) | DSSD (Fu et al. 2017) | Free | ResNet101 | SSD | 81.5 (07+12) | 80.0 (07T+12) | 53.3 | 33.2 | 2017 | Use Conv-Deconv, as shown in Fig. 17c1, c2 |
FPN (Lin et al. 2017a) | RPN | ResNet101 | Faster RCNN | − | − | 59.1 | 36.2 | CVPR17 | Use Conv-Deconv, as shown in Fig. 17a1, a2; Widely used in detectors | |
TDM (Shrivastava et al. 2017) | RPN | ResNet101 VGG16 | Faster RCNN | − | − | 57.7 | 36.8 | CVPR17 | Use Conv-Deconv, as shown in Fig. 17b2 | |
RON (Kong et al. 2017) | RPN | VGG16 | Faster RCNN | 81.3 (07+12+CO) | 80.7 (07T+12+CO) | 49.5 | 27.4 | CVPR17 | Use Conv-deconv, as shown in Fig. 17d2; Add the objectness prior to significantly reduce object search space | |
ZIP (Li et al. 2018a) | RPN | Inceptionv2 | Faster RCNN | 79.8 (07+12) | − | − | − | IJCV18 | Use Conv-Deconv, as shown in Fig. 17f1. Propose a map attention decision (MAD) unit for features from different layers | |
STDN (Zhou et al. 2018b) | Free | DenseNet169 | SSD | 80.9 (07+12) | − | 51.0 | 31.8 | CVPR18 | A new scale transfer module, which resizes features of different scales to the same scale in parallel | |
RefineDet (Zhang et al. 2018a) | RPN | VGG16 ResNet101 | Faster RCNN | 83.8 (07+12) | 83.5 (07T+12) | 62.9 | 41.8 | CVPR18 | Use cascade to obtain better and less anchors. Use Conv-deconv, as shown in Fig. 17e2 to improve features | |
PANet (Liu et al. 2018c) | RPN | ResNeXt101 +FPN | Mask RCNN | − | − | 67.2 | 47.4 | CVPR18 | Shown in Fig. 17g. Based on FPN, add another bottom-up path to pass information between lower and topmost layers; adaptive feature pooling. Ranked 1st and 2nd in COCO 2017 tasks | |
DetNet (Li et al. 2018b) | RPN | DetNet59+FPN | Faster RCNN | − | − | 61.7 | 40.2 | ECCV18 | Introduces dilated convolution into the ResNet backbone to maintain high resolution in deeper layers; Shown in Fig. 17i | |
FPR (Kong et al. 2018) | − | VGG16 ResNet101 | SSD | 82.4 (07+12) | 81.1 (07T+12) | 54.3 | 34.6 | ECCV18 | Fuse task oriented features across different spatial locations and scales, globally and locally; Shown in Fig. 17h | |
M2Det (Zhao et al. 2019) | − | SSD | VGG16 ResNet101 | − | − | 64.6 | 44.2 | AAAI19 | Shown in Fig. 17j, newly designed top down path to learn a set of multilevel features, recombined to construct a feature pyramid for object detection | |
(4) Model geometric transforms | DeepIDNet (Ouyang et al. 2015) | SS+ EB | AlexNet ZFNet OverFeat GoogLeNet | RCNN | 69.0 (07) | − | − | 25.6 | CVPR15 | Introduce a deformation constrained pooling layer, jointly learned with convolutional layers in existing DCNNs. Utilize the following modules that are not trained end to end: cascade, context modeling, model averaging, and bounding box location refinement in the multistage detection pipeline |
DCN (Dai et al. 2017) | RPN | ResNet101 IRN | RFCN | 82.6 (07+12) | − | 58.0 | 37.5 | CVPR17 | Design deformable convolution and deformable RoI pooling modules that can replace plain convolution in existing DCNNs | |
DPFCN (Mordan et al. 2018) | AttractioNet (Gidaris and Komodakis 2016) | ResNet | RFCN | 83.3 (07+12) | 81.2 (07T+12) | 59.1 | 39.1 | IJCV18 | Design a deformable part based RoI pooling layer to explicitly select discriminative regions around object proposals |
6.3 Handling of Other Intraclass Variations
-
Geometric transformations,
-
Occlusions, and
-
Image degradations.
7 Context Modeling
Group | Detector name | Region proposal | Backbone DCNN | Pipelined Used | mAP@IoU \(=\) 0.5 | mAP | Published in | Highlights | |
---|---|---|---|---|---|---|---|---|---|
VOC07 | VOC12 | COCO | |||||||
Global context | SegDeepM (Zhu et al. 2015) | SS+CMPC | VGG16 | RCNN | VOC10 | VOC12 | − | CVPR15 | Additional features extracted from an enlarged object proposal as context information |
DeepIDNet (Ouyang et al. 2015) | SS+EB | AlexNet ZFNet | RCNN | 69.0 (07) | − | − | CVPR15 | Use image classification scores as global contextual information to refine the detection scores of each object proposal | |
ION (Bell et al. 2016) | SS+EB | VGG16 | Fast RCNN | 80.1 | 77.9 | 33.1 | CVPR16 | The contextual information outside the region of interest is integrated using spatial recurrent neural networks | |
CPF (Shrivastava and Gupta 2016) | RPN | VGG16 | Faster RCNN | 76.4 (07+12) | 72.6 (07T+12) | − | ECCV16 | Use semantic segmentation to provide top-down feedback | |
Local context | MRCNN (Gidaris and Komodakis 2015) | SS | VGG16 | SPPNet | 78.2 (07+12) | 73.9 (07+12) | − | ICCV15 | Extract features from multiple regions surrounding or inside the object proposals. Integrate the semantic segmentation-aware features |
CRAFT (Yang et al. 2016a) | Inception v2 ResNet269 PolyNet (Zhang et al. 2017) | Fast RCNN | 77.2 (07+12) | − | 27.0 | ECCV16 TPAMI18 | A GBDNet module to learn the relations of multiscale contextualized regions surrounding an object proposal; GBDNet passes messages among features from different context regions through convolution between neighboring support regions in two directions | ||
ACCNN (Li et al. 2017b) | SS | VGG16 | Fast RCNN | 72.0 (07+12) | 70.6 (07T+12) | − | TMM17 | Use LSTM to capture global context. Concatenate features from multi-scale contextual regions surrounding an object proposal. The global and local context features are concatenated for recognition | |
CoupleNet (Zhu et al. 2017a) | RPN | ResNet101 | RFCN | 82.7 (07+12) | 80.4 (07T+12) | 34.4 | ICCV17 | Concatenate features from multiscale contextual regions surrounding an object proposal. Features of different contextual regions are then combined by convolution and element-wise sum | |
SMN (Chen and Gupta 2017) | RPN | VGG16 | Faster RCNN | 70.0 (07) | − | − | ICCV17 | Model object-object relationships efficiently through a spatial memory network. Learn the functionality of NMS automatically | |
ORN (Hu et al. 2018a) | RPN | ResNet101 +DCN | Faster RCNN | − | − |
39.0
| CVPR18 | Model the relations of a set of object proposals through the interactions between their appearance features and geometry. Learn the functionality of NMS automatically | |
SIN (Liu et al. 2018d) | RPN | VGG16 | Faster RCNN | 76.0 (07+12) | 73.1 (07T+12) | 23.2 | CVPR18 | Formulate object detection as graph-structured inference, where objects are graph nodes and relationships the edges |
7.1 Global Context
7.2 Local Context
8 Detection Proposal Methods
Proposer name | Backbone network | Detector tested | Recall@IoU (VOC07) | Detection results (mAP) | Published in | Highlights | ||||
---|---|---|---|---|---|---|---|---|---|---|
0.5 | 0.7 | 0.9 | VOC07 | VOC12 | COCO | |||||
Bounding box object proposal methods | ||||||||||
MultiBox1 (Erhan et al. 2014) | AlexNet | RCNN | − | − | − | 29.0 (10) (12) | − | − | CVPR14 | Learns a class agnostic regressor on a small set of 800 predefined anchor boxes. Do not share features for detection |
DeepBox (Kuo et al. 2015) | VGG16 | Fast RCNN | 0.96 (1000) | 0.84 (1000) | 0.15 (1000) | − | − | 37.8 (500) (IoU@0.5) | ICCV15 | Use a lightweight CNN to learn to rerank proposals generated by EdgeBox. Can run at 0.26s per image. Do not share features for detection |
VGG16 | Faster RCNN | 0.97 (300) 0.98 (1000) | 0.79 (300) 0.84 (1000) | 0.04 (300) 0.04 (1000) | 73.2 (300) (07+12) | 70.4 (300) (07++12) | 21.9 (300) | NIPS15 | The first to generate object proposals by sharing full image convolutional features with detection. Most widely used object proposal method. Significant improvements in detection speed | |
DeepProposal (Ghodrati et al. 2015) | VGG16 | Fast RCNN | 0.74 (100) 0.92 (1000) | 0.58 (100) 0.80 (1000) | 0.12 (100) 0.16 (1000) | 53.2 (100) (07) | − | − | ICCV15 | Generate proposals inside a DCNN in a multiscale manner. Share features with the detection network |
CRAFT (Yang et al. 2016a) | VGG16 | Faster RCNN | 0.98 (300) | 0.90 (300) | 0.13 (300) | 75.7 (07+12) | 71.3 (12) | − | CVPR16 | Introduced a classification network (i.e. two class Fast RCNN) cascade that comes after the RPN. Not sharing features extracted for detection |
AZNet (Lu et al. 2016) | VGG16 | Fast RCNN | 0.91 (300) | 0.71 (300) | 0.11 (300) | 70.4 (07) | − | 22.3 | CVPR16 | Use coarse-to-fine search: start from large regions, then recursively search for subregions that may contain objects. Adaptively guide computational resources to focus on likely subregions |
ZIP (Li et al. 2018a) | Inception v2 | Faster RCNN | 0.85 (300) COCO | 0.74 (300) COCO | 0.35 (300) COCO | 79.8 (07+12) | − | − | IJCV18 | Generate proposals using conv-deconv network with multilayers; Proposed a map attention decision (MAD) unit to assign the weights for features from different layers |
DeNet (TychsenSmith and Petersson 2017) | ResNet101 | Fast RCNN | 0.82 (300) | 0.74 (300) | 0.48 (300) | 77.1 (07+12) | 73.9 (07++12) | 33.8 | ICCV17 | A lot faster than Faster RCNN; Introduces a bounding box corner estimation for predicting object proposals efficiently to replace RPN; Does not require predefined anchors |
Proposer name | Backbone network | Detector tested | Box proposals (AR, COCO) | Segment proposals (AR, COCO) | Published in | Highlights | ||||
---|---|---|---|---|---|---|---|---|---|---|
Segment proposal methods | ||||||||||
DeepMask (Pinheiro et al. 2015) | VGG16 | Fast RCNN | 0.33 (100), 0.48 (1000) | 0.26 (100), 0.37 (1000) | NIPS15 | First to generate object mask proposals with DCNN; Slow inference time; Need segmentation annotations for training; Not sharing features with detection network; Achieved mAP of \(69.9\%\) (500) with Fast RCNN | ||||
InstanceFCN (Dai et al. 2016a) | VGG16 | − | − | 0.32 (100), 0.39 (1000) | ECCV16 | |||||
SharpMask (Pinheiro et al. 2016) | MPN (Zagoruyko et al. 2016) | Fast RCNN | 0.39 (100), 0.53 (1000) | 0.30 (100), 0.39 (1000) | ECCV16 | Leverages features at multiple convolutional layers by introducing a top-down refinement module. Does not share features with detection network. Needs segmentation annotations for training | ||||
FastMask (Hu et al. 2017) | ResNet39 | − | 0.43 (100), 0.57 (1000) | 0.32 (100), 0.41 (1000) | CVPR17 | Generates instance segment proposals efficiently in one-shot manner similar to SSD (Liu et al. 2016). Uses multiscale convolutional features. Uses segmentation annotations for training |
9 Other Issues
Detector name | Region proposal | Backbone DCNN | Pipelined used | VOC07 results | VOC12 results | COCO results | Published in | Highlights |
---|---|---|---|---|---|---|---|---|
MegDet (Peng et al. 2018) | RPN | ResNet50+FPN | FasterRCNN | − | − | 52.5 | CVPR18 | Allow training with much larger minibatch size than before by introducing cross GPU batch normalization; Can finish the COCO training in 4 hours on 128 GPUs and achieved improved accuracy; Won COCO2017 detection challenge |
SNIP (Singh et al. 2018b) | RPN | RFCN | − | − | 48.3 | CVPR18 | A new multiscale training scheme. Empirically examined the effect of up-sampling for small object detection. During training, only select objects that fit the scale of features as positive samples | |
SNIPER (Singh et al. 2018b) | RPN | ResNet101+DCN | Faster RCNN | − | − | 47.6 | 2018 | An efficient multiscale training strategy. Process context regions around ground-truth instances at the appropriate scale |
OHEM (Shrivastava et al. 2016) | SS | VGG16 | Fast RCNN | 78.9 (07+12) | 76.3 (07++12) | 22.4 | CVPR16 | A simple and effective Online Hard Example Mining algorithm to improve training of region based detectors |
FactorNet (Ouyang et al. 2016) | SS | GooglNet | RCNN | − | − | − | CVPR16 | Identify the imbalance in the number of samples for different object categories; propose a divide-and-conquer feature learning scheme |
Chained Cascade (Cai and Vasconcelos 2018) | SS CRAFT | VGG Inceptionv2 | Fast RCNN, Faster RCNN | 80.4 (07+12) (SS+VGG) | − | − | ICCV17 | Jointly learn DCNN and multiple stages of cascaded classifiers. Boost detection accuracy on PASCAL VOC 2007 and ImageNet for both fast RCNN and Faster RCNN using different region proposal methods |
Cascade RCNN (Cai and Vasconcelos 2018) | RPN | VGG ResNet101 +FPN | Faster RCNN | − | − | 42.8 | CVPR18 | Jointly learn DCNN and multiple stages of cascaded classifiers, which are learned using different localization accuracy for selecting positive samples. Stack bounding box regression at multiple stages |
RetinaNet (Lin et al. 2017b) | − | ResNet101 +FPN | RetinaNet | − | − | 39.1 | ICCV17 | Propose a novel Focal Loss which focuses training on hard examples. Handles well the problem of imbalance of positive and negative samples when training a one-stage detector |
10 Discussion and Conclusion
10.1 State of the Art Performance
-
Innovations such as multilayer feature combination (Lin et al. 2017a; Shrivastava et al. 2017; Fu et al. 2017), deformable convolutional networks (Dai et al. 2017), deformable RoI pooling (Ouyang et al. 2015; Dai et al. 2017), heavier heads (Ren et al. 2016; Peng et al. 2018), and lighter heads (Li et al. 2018c);
-
Different detection proposal methods and different numbers of object proposals;
Detector name | RP | Backbone DCNN | Input ImgSize | VOC07 results | VOC12 results | Speed (FPS) | Published in | Source code | Highlights and Disadvantages |
---|---|---|---|---|---|---|---|---|---|
RCNN (Girshick et al. 2014) | SS | AlexNet | Fixed | 58.5 (07) | 53.3 (12) | \(<0.1\) | CVPR14 | Caffe Matlab | Highlights: First to integrate CNN with RP methods; Dramatic performance improvement over previous state of the artP |
Disadvantages: Multistage pipeline of sequentially-trained (External RP computation, CNN finetuning, each warped RP passing through CNN, SVM and BBR training); Training is expensive in space and time; Testing is slow | |||||||||
SPPNet (He et al. 2014) | SS | ZFNet | Arbitrary | 60.9 (07) | − | \(<1\) | ECCV14 | Caffe Matlab | Highlights: First to introduce SPP into CNN architecture; Enable convolutional feature sharing; Accelerate RCNN evaluation by orders of magnitude without sacrificing performance; Faster than OverFeat |
Disadvantages: Inherit disadvantages of RCNN; Does not result in much training speedup; Fine-tuning not able to update the CONV layers before SPP layer | |||||||||
Fast RCNN (Girshick 2015) | SS | AlexNet VGGM VGG16 | Arbitrary | 70.0 (VGG) (07+12) | 68.4 (VGG) (07++12) | \(<1\) | ICCV15 | Caffe Python | Highlights: First to enable end-to-end detector training (ignoring RP generation); Design a RoI pooling layer; Much faster and more accurate than SPPNet; No disk storage required for feature caching |
Disadvantages: External RP computation is exposed as the new bottleneck; Still too slow for real time applications | |||||||||
Faster RCNN (Ren et al. 2015) | RPN | ZFnet VGG | Arbitrary | 73.2 (VGG) (07+12) | 70.4 (VGG) (07++12) | \(<5\) | NIPS15 | Caffe Matlab Python | Highlights: Propose RPN for generating nearly cost-free and high quality RPs instead of selective search; Introduce translation invariant and multiscale anchor boxes as references in RPN; Unify RPN and Fast RCNN into a single network by sharing CONV layers; An order of magnitude faster than Fast RCNN without performance loss; Can run testing at 5 FPS with VGG16 |
Disadvantages: Training is complex, not a streamlined process; Still falls short of real time | |||||||||
RCNN\(\ominus \)R (Lenc and Vedaldi 2015) | New | ZFNet +SPP | Arbitrary | 59.7 (07) | − | \(<5\) | BMVC15 | − | Highlights: Replace selective search with static RPs; Prove the possibility of building integrated, simpler and faster detectors that rely exclusively on CNN |
Disadvantages: Falls short of real time; Decreased accuracy from poor RPs | |||||||||
RFCN (Dai et al. 2016c) | RPN | ResNet101 | Arbitrary | 80.5 (07+12) 83.6 (07+12+CO) | 77.6 (07++12) 82.0 (07++12+CO) | \(<10\) | NIPS16 | Caffe Matlab | Highlights: Fully convolutional detection network; Design a set of position sensitive score maps using a bank of specialized CONV layers; Faster than Faster RCNN without sacrificing much accuracy |
Disadvantages: Training is not a streamlined process; Still falls short of real time | |||||||||
Mask RCNN (He et al. 2017) | RPN | ResNet101 ResNeXt101 | Arbitrary | 50.3 (ResNeXt101) (COCO Result) | \(<5\) | ICCV17 | Caffe Matlab Python | Highlights: A simple, flexible, and effective framework for object instance segmentation; Extends Faster RCNN by adding another branch for predicting an object mask in parallel with the existing branch for BB prediction; Feature Pyramid Network (FPN) is utilized; Outstanding performance | |
Disadvantages: Falls short of real time applications | |||||||||
OverFeat (Sermanet et al. 2014) | − | AlexNet like | Arbitrary | − | − | \(<0.1\) | ICLR14 | c++ | Highlights: Convolutional feature sharing; Multiscale image pyramid CNN feature extraction; Won the ISLVRC2013 localization competition; Significantly faster than RCNN |
Disadvantages: Multi-stage pipeline sequentially trained; Single bounding box regressor; Cannot handle multiple object instances of the same class; Too slow for real time applications | |||||||||
YOLO (Redmon et al. 2016) | − | GoogLeNet like | Fixed | 66.4 (07+12) | 57.9 (07++12) | \(<25\) (VGG) | CVPR16 | DarkNet | Highlights: First efficient unified detector; Drop RP process completely; Elegant and efficient detection framework; Significantly faster than previous detectors; YOLO runs at 45 FPS, Fast YOLO at 155 FPS; |
Disadvantages: Accuracy falls far behind state of the art detectors; Struggle to localize small objects | |||||||||
YOLOv2 (Redmon and Farhadi 2017) | − | DarkNet | Fixed | 78.6 (07+12) | 73.5 (07++12) | \(<50\) | CVPR17 | DarkNet | Highlights: Propose a faster DarkNet19; Use a number of existing strategies to improve both speed and accuracy; Achieve high accuracy and high speed; YOLO9000 can detect over 9000 object categories in real time |
Disadvantages: Not good at detecting small objects | |||||||||
SSD (Liu et al. 2016) | − | VGG16 | Fixed | 76.8 (07+12) 81.5 (07+12+CO) | 74.9 (07++12) 80.0 (07++12+CO) | \(<60\) | ECCV16 | Caffe Python | Highlights: First accurate and efficient unified detector; Effectively combine ideas from RPN and YOLO to perform detection at multi-scale CONV layers; Faster and significantly more accurate than YOLO; Can run at 59 FPS; |
Disadvantages: Not good at detecting small objects |
10.2 Summary and Discussion
-
When large computational cost is allowed, two-stage detectors generally produce higher detection accuracies than one-stage, evidenced by the fact that most winning approaches used in famous detection challenges like are predominantly based on two-stage frameworks, because their structure is more flexible and better suited for region based classification. The most widely used frameworks are Faster RCNN (Ren et al. 2015), RFCN (Dai et al. 2016c) and Mask RCNN (He et al. 2017).
-
One-stage detectors like YOLO (Redmon et al. 2016) and SSD (Liu et al. 2016) are generally faster than two-stage ones, because of avoiding preprocessing algorithms, using lightweight backbone networks, performing prediction with fewer candidate regions, and making the classification subnetwork fully convolutional. However, two-stage detectors can run in real time with the introduction of similar techniques. In any event, whether one stage or two, the most time consuming step is the feature extractor (backbone network) (Law and Deng 2018; Ren et al. 2015).
-
Using image pyramids: They are simple and effective, helping to enlarge small objects and to shrink large ones. They are computationally expensive, but are nevertheless commonly used during inference for better accuracy.
-
Using features from convolutional layers of different resolutions: In early work like SSD (Liu et al. 2016), predictions are performed independently, and no information from other layers is combined or merged. Now it is quite standard to combine features from different layers, e.g. in FPN (Lin et al. 2017a).
-
Using anchor boxes of different scales and aspect ratios: Drawbacks of having many parameters, and scales and aspect ratios of anchor boxes are usually heuristically determined.
10.3 Research Directions
-
Working in an open world: being robust to any number of environmental changes, being able to evolve or adapt.
-
Object detection under constrained conditions: learning from weakly labeled data or few bounding box annotations, wearable devices, unseen object categories etc.
-
Object detection in other modalities: video, RGBD images, 3D point clouds, lidar, remotely sensed imagery etc.