Top

International Journal of Computer Vision

Published in:

17-07-2018

End-to-End Learning of Latent Deformable Part-Based Representations for Object Detection

Authors: Taylor Mordan, Nicolas Thome, Gilles Henaff, Matthieu Cord

Published in: International Journal of Computer Vision | Issue 11-12/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Object detection methods usually represent objects through rectangular bounding boxes from which they extract features, regardless of their actual shapes. In this paper, we apply deformations to regions in order to learn representations better fitted to objects. We introduce DP-FCN, a deep model implementing this idea by learning to align parts to discriminative elements of objects in a latent way, i.e. without part annotation. This approach has two main assets: it builds invariance to local transformations, thus improving recognition, and brings geometric information to describe objects more finely, leading to a more accurate localization. We further develop both features in a new model named DP-FCN2.0 by explicitly learning interactions between parts. Alignment is done with an in-network joint optimization of all parts based on a CRF with custom potentials, and deformations are influencing localization through a bilinear product. We validate our models on PASCAL VOC and MS COCO datasets and show significant gains. DP-FCN2.0 achieves state-of-the-art results of 83.3 and 81.2% on VOC 2007 and 2012 with VOC data only.

previous article Slanted Stixels: A Way to Represent Steep Streets

next article A Differential Approach to Shape from Polarisation: A Level-Set Characterisation

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Azizpour, H., & Laptev, I.(2012). Object detection using strongly-supervised deformable part models. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 836–849).

Bell, S., Zitnick, L., Bala, K., & Girshick, R.(2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Ben-Younes, H., Cadène, R., Thome, N., & Cord M. (2017). MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision (ICCV).

Chandra, S., Usunier, N., Kokkinos, I. (2017). Dense and low-rank gaussian CRFs using deep embeddings. In Proceedings of the IEEE international conference on computer vision (ICCV).

Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. (2015). Semantic image segmentation with deep convolutional nets and fully connected CRFs. In Proceedings of the international conference on learning representations (ICLR).

Dai, J., He, K., Li, Y., Ren, S., & Sun, J. (2016a). Instance-sensitive fully convolutional networks. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 534–549).

Dai, J., Li, Y., He, K., & Sun, J. (2016b). R-FCN: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (NIPS).

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV).

Durand, T., Mordan, T., Thome, N., & Cord, M. (2017). WILDCAT: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Everingham, M., Eslami, A., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.CrossRef

Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(9), 1627–1645.CrossRef

Fidler, S., Mottaghi, R., Yuille, A., & Urtasun, R. (2013). Bottom-up segmentation for top-down detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3294–3301).

Gidaris, S., & Komodakis, N.(2015). Object detection via a multi-region and semantic segmentation-aware CNN model. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1134–1142).

Gidaris, S., & Komodakis, N. (2016a). Attend refine repeat: Active box proposal generation via in-out localization. In Proceedings of the British machine vision conference (BMVC).

Gidaris, S., & Komodakis, N.(2016b). LocNet: Improving localization accuracy for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Girshick, R. (2015). Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1440–1448).

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).

Girshick, R., Iandola, F., Darrell, T., & Malik, J. (2015). Deformable part models are convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 437–446).

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9), 1904–1916.CrossRef

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). HyperNet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Krähenbühl, P., & Koltun, V. (2011) Efficient inference in fully connected CRFs with Gaussian ddge potentials. In Advances in neural information processing systems (NIPS) (pp. 109–117).

Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS) (pp. 1097–1105).

Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (ICML).

LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.CrossRef

Li, Y., Qi, H., Dai, J., Ji, X., & Wei, Y. (2017). Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Lin, D., Shen, X., Lu, C., & Jia, J. (2015). Deep LAC: Deep localization, alignment and classification for fine-grained recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1666–1674).

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, L. (2014). Microsoft COCO: Common objects in context. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 740–755).

Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (ICCV).

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S.(2016). SSD: Single shot multibox detector. In Proceedings of the IEEE European conference on computer vision (ECCV).

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3431–3440).

Mordan, T., Thome, N., Cord, M., & Henaff, G. (2017). Deformable part-based fully convolutional network for object detection. In Proceedings of the British machine vision conference (BMVC).

Ott, P., & Everingham, M. (2011). Shared parts for deformable part-based models. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1513–1520).

Pinheiro, P., Lin, T. Y., Collobert, R., & Dollár, P. (2016) Learning to refine object segments. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 75–91).

Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS) (pp. 91–99).

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.MathSciNetCrossRef

Savalle, P. A., Tsogkas, S., Papandreou, G., & Kokkinos, I. (2014). Deformable part models with CNN features. In Proceedings of the IEEE European conference on computer vision (ECCV), parts and attributes workshop.

Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Sicre, R., Avrithis, Y., Kijak, E., & Jurie, F. (2017). Unsupervised part learning for visual recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Simon, M., & Rodner, E. (2015). Neural activation constellations: Unsupervised part model discovery with convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1143–1151).

Simonyan, K., & Zisserman, A. (2015) Very deep convolutional networks for large-scale image recognition. In Proceedings of the international conference on learning representations (ICLR).

Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3), 279–311.MathSciNetCrossRef

Wan, L., Eigen, D., & Fergus, R. (2015). End-to-end integration of a convolution network, deformable parts model and non-maximum suppression. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 851–859).

Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Joint object and part segmentation using deep learned potentials. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1573–1581).

Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In Proceedings of the international conference on learning representations (ICLR).

Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. In Proceedings of the British machine vision conference (BMVC).

Zagoruyko, S., Lerer, A., Lin, T. Y., Pinheiro, P., Gross, S., Chintala, S., & Dollar, P. (2016). A multipath network for object detection. In Proceedings of the British machine vision conference (BMVC).

Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., & Metaxas, D. (2016). SPDA-CNN: Unifying semantic part detection and abstraction for fine-grained recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1143–1152).

Zhang, N., Donahue, J., Girshick, R., & Darrell, T. (2014). Part-based R-CNNs for fine-grained category detection. In Proceedings of the IEEE European conference on computer vision (ECCV) (pp. 834–849).

Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 1529–1537).

Zhu, L., Chen, Y., Yuille, A., & Freeman, W. (2010). Latent hierarchical structural learning for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1062–1069).

Title: End-to-End Learning of Latent Deformable Part-Based Representations for Object Detection
Authors: Taylor Mordan
Nicolas Thome
Gilles Henaff
Matthieu Cord
Publication date: 17-07-2018
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 11-12/2019
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-018-1109-z

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 11-12/2019

Cross-Domain Image Matching with Deep Feature Maps

Reflectance and Shape Estimation with a Light Field Camera Under Natural Illumination

A Differential Approach to Shape from Polarisation: A Level-Set Characterisation

Special Issue on Machine Vision

The Devil is in the Decoder: Classification, Regression and GANs

Learning to Predict 3D Surfaces of Sculptures from Single and Multiple Views

Premium Partner