Skip to main content
Erschienen in: International Journal of Computer Vision 3/2019

20.06.2018

Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection

verfasst von: Hongyang Li, Yu Liu, Wanli Ouyang, Xiaogang Wang

Erschienen in: International Journal of Computer Vision | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we propose a zoom-out-and-in network for generating object proposals. A key observation is that it is difficult to classify anchors of different sizes with the same set of features. Anchors of different sizes should be placed accordingly based on different depth within a network: smaller boxes on high-resolution layers with a smaller stride while larger boxes on low-resolution counterparts with a larger stride. Inspired by the conv/deconv structure, we fully leverage the low-level local details and high-level regional semantics from two feature map streams, which are complimentary to each other, to identify the objectness in an image. A map attention decision (MAD) unit is further proposed to aggressively search for neuron activations among two streams and attend the most contributive ones on the feature learning of the final loss. The unit serves as a decision-maker to adaptively activate maps along certain channels with the solely purpose of optimizing the overall training loss. One advantage of MAD is that the learned weights enforced on each feature channel is predicted on-the-fly based on the input context, which is more suitable than the fixed enforcement of a convolutional kernel. Experimental results on three datasets demonstrate the effectiveness of our proposed algorithm over other state-of-the-arts, in terms of average recall for region proposal and average precision for object detection.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
2
The first row is the inner product of two vectors, resulting in a scalar gradient; while the second is the common vector multiplication by a scalar, resulting in a vector also.
 
3
Direction ‘matches’ means the included angle between two vectors in multi-dimensional space is within \(90{^{\circ }}\) degrees; and ‘departs’ means the angle falls at [90, 180].
 
Literatur
Zurück zum Zitat Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.CrossRef Alexe, B., Deselaers, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.CrossRef
Zurück zum Zitat Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR. Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In CVPR.
Zurück zum Zitat Bell, S., Zitnick, C. L., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR. Bell, S., Zitnick, C. L., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR.
Zurück zum Zitat Chavali, N., Agrawal, H., Mahendru, A., & Batra, D. (2016). Object-proposal evaluation protocol is ‘gameable’. In: CVPR. Chavali, N., Agrawal, H., Mahendru, A., & Batra, D. (2016). Object-proposal evaluation protocol is ‘gameable’. In: CVPR.
Zurück zum Zitat Cheng, M., Zhang, Z., Lin, W., & Torr, P. H. S. (2014). BING: binarized normed gradients for objectness estimation at 300fps. In CVPR. Cheng, M., Zhang, Z., Lin, W., & Torr, P. H. S. (2014). BING: binarized normed gradients for objectness estimation at 300fps. In CVPR.
Zurück zum Zitat Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In NIPS. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In NIPS.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Zurück zum Zitat Endres, I., & Hoiem, D. (2014). Category-independent object proposals with diverse ranking. IEEE Transactions on PAMI, 36, 222–234.CrossRef Endres, I., & Hoiem, D. (2014). Category-independent object proposals with diverse ranking. IEEE Transactions on PAMI, 36, 222–234.CrossRef
Zurück zum Zitat Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.CrossRef Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.CrossRef
Zurück zum Zitat Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., & Gool, L. V. (2016). DeepProposals: Hunting objects and actions by cascading deep convolutional layers. arXiv:1606.04702. Ghodrati, A., Diba, A., Pedersoli, M., Tuytelaars, T., & Gool, L. V. (2016). DeepProposals: Hunting objects and actions by cascading deep convolutional layers. arXiv:​1606.​04702.
Zurück zum Zitat Gidaris, S., & Komodakis, N. (2016). Attend refine repeat: Active box proposal generation via in-out localization. In BMVC. Gidaris, S., & Komodakis, N. (2016). Attend refine repeat: Active box proposal generation via in-out localization. In BMVC.
Zurück zum Zitat Girshick, R. (2015). Fast R-CNN. In ICCV. Girshick, R. (2015). Fast R-CNN. In ICCV.
Zurück zum Zitat Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Zurück zum Zitat Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.CrossRef Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538, 471–476.CrossRef
Zurück zum Zitat Hariharan, B., Arbelez, P., Girshick, R., & Malik, J. (2014). Hypercolumns for object segmentation and fine-grained localization. In CVPR. Hariharan, B., Arbelez, P., Girshick, R., & Malik, J. (2014). Hypercolumns for object segmentation and fine-grained localization. In CVPR.
Zurück zum Zitat Hayder, Z., He, X., & Salzmann, M. (2016). Learning to co-generate object proposals with a deep structured network. In CVPR. Hayder, Z., He, X., & Salzmann, M. (2016). Learning to co-generate object proposals with a deep structured network. In CVPR.
Zurück zum Zitat He, S. & Lau, R. W. (2015). Oriented object proposals. In: ICCV. He, S. & Lau, R. W. (2015). Oriented object proposals. In: ICCV.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.
Zurück zum Zitat Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2015). What makes for effective detection proposals? IEEE Transactions on PAMI, 38, 814–830.CrossRef Hosang, J., Benenson, R., Dollár, P., & Schiele, B. (2015). What makes for effective detection proposals? IEEE Transactions on PAMI, 38, 814–830.CrossRef
Zurück zum Zitat Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR. Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., & Murphy, K. (2017). Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR.
Zurück zum Zitat Humayun, A., Li, F., & Rehg, J. M. (2014). Rigor: Reusing inference in graph cuts for generating object regions. In CVPR. Humayun, A., Li, F., & Rehg, J. M. (2014). Rigor: Reusing inference in graph cuts for generating object regions. In CVPR.
Zurück zum Zitat Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Zurück zum Zitat Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.
Zurück zum Zitat Jie, Z., Liang, X., Feng, J., Lu, W. F., Tay, E. H. F., & Yan, S. (2016). Scale-aware pixelwise object proposal networks. IEEE Transactions on Image Processing, 25, 4525–4539.MathSciNetCrossRef Jie, Z., Liang, X., Feng, J., Lu, W. F., Tay, E. H. F., & Yan, S. (2016). Scale-aware pixelwise object proposal networks. IEEE Transactions on Image Processing, 25, 4525–4539.MathSciNetCrossRef
Zurück zum Zitat Kaiming, H., Xiangyu, Z., Shaoqing, R., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. Kaiming, H., Xiangyu, Z., Shaoqing, R., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.
Zurück zum Zitat Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR. Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In CVPR.
Zurück zum Zitat Krahenbuhl, P., & Koltun, V. (2014). Geodesic object proposals. In ECCV. Krahenbuhl, P., & Koltun, V. (2014). Geodesic object proposals. In ECCV.
Zurück zum Zitat Krahenbuhl, P., & Koltun, V. (2015). Learning to propose objects. In CVPR. Krahenbuhl, P., & Koltun, V. (2015). Learning to propose objects. In CVPR.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS, (pp. 1106–1114). Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS, (pp. 1106–1114).
Zurück zum Zitat Kuo, W., Hariharan, B., & Malik, J. (2015). DeepBox: Learning objectness with convolutional networks. In ICCV. Kuo, W., Hariharan, B., & Malik, J. (2015). DeepBox: Learning objectness with convolutional networks. In ICCV.
Zurück zum Zitat Li, H., Liu, Y., Ouyang, W., & Wang, X. (2017a). Zoom out-and-in network with recursive training for object proposal. arXiv:1702.05711. Li, H., Liu, Y., Ouyang, W., & Wang, X. (2017a). Zoom out-and-in network with recursive training for object proposal. arXiv:​1702.​05711.
Zurück zum Zitat Li, H., Liu, Y., Zhang, X., An, Z., Wang, J., Chen, Y., & Tong, J. (2017b). Do we really need more training data for object localization. In IEEE international conference on image processing. IEEE. Li, H., Liu, Y., Zhang, X., An, Z., Wang, J., Chen, Y., & Tong, J. (2017b). Do we really need more training data for object localization. In IEEE international conference on image processing. IEEE.
Zurück zum Zitat Li, H., Ouyang, W., & Wang, X. (2016). Multi-bias non-linear activation in deep neural networks. In ICML. Li, H., Ouyang, W., & Wang, X. (2016). Multi-bias non-linear activation in deep neural networks. In ICML.
Zurück zum Zitat Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR. Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In CVPR.
Zurück zum Zitat Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollar, P. (2014). Microsoft COCO: Common objects in context. arXiv preprint:1405.0312. Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., & Dollar, P. (2014). Microsoft COCO: Common objects in context. arXiv preprint:1405.0312.
Zurück zum Zitat Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S. (2016). SSD: Single shot multibox detector. In ECCV. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., & Reed, S. (2016). SSD: Single shot multibox detector. In ECCV.
Zurück zum Zitat Liu, Y., Li, H., & Wang, X. (2017a). Learning deep features via congenerous cosine loss for person recognition. arXiv:1702.06890. Liu, Y., Li, H., & Wang, X. (2017a). Learning deep features via congenerous cosine loss for person recognition. arXiv:​1702.​06890.
Zurück zum Zitat Liu, Y., Li, H., Yan, J., Wei, F., Wang, X., & Tang, X. (2017b). Recurrent scale approximation for object detection in cnn. In IEEE international conference on computer vision. Liu, Y., Li, H., Yan, J., Wei, F., Wang, X., & Tang, X. (2017b). Recurrent scale approximation for object detection in cnn. In IEEE international conference on computer vision.
Zurück zum Zitat Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Zurück zum Zitat Manén, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim’s algorithm. In ICCV. Manén, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim’s algorithm. In ICCV.
Zurück zum Zitat Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.
Zurück zum Zitat Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In ICCV. Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In ICCV.
Zurück zum Zitat Pinheiro, P. O., Collobert, R., & Dollar, P. (2015). Learning to segment object candidates. In NIPS. Pinheiro, P. O., Collobert, R., & Dollar, P. (2015). Learning to segment object candidates. In NIPS.
Zurück zum Zitat Pinheiro, P. O., Lin, T.-Y., Collobert, R., & Dollr, P. (2016). Learning to refine object segments. In ECCV. Pinheiro, P. O., Lin, T.-Y., Collobert, R., & Dollr, P. (2016). Learning to refine object segments. In ECCV.
Zurück zum Zitat Pont-Tuset, J., & Gool, L. V. (2015). Boosting object proposals: From pascal to coco. In CVPR. Pont-Tuset, J., & Gool, L. V. (2015). Boosting object proposals: From pascal to coco. In CVPR.
Zurück zum Zitat Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR. Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR.
Zurück zum Zitat Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS.
Zurück zum Zitat Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. arXiv:1505.04597. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. arXiv:​1505.​04597.
Zurück zum Zitat Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR.
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
Zurück zum Zitat Sun, C., Paluri, M., Collobert, R., Nevatia, R., & Bourdev, L. (2016). ProNet: Learning to propose object-specific boxes for cascaded neural networks. In CVPR. Sun, C., Paluri, M., Collobert, R., Nevatia, R., & Bourdev, L. (2016). ProNet: Learning to propose object-specific boxes for cascaded neural networks. In CVPR.
Zurück zum Zitat Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 10, 154–171.CrossRef Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 10, 154–171.CrossRef
Zurück zum Zitat Wang, X., Shrivastava, A., & Gupta, A. (2017). A-fast-rcnn: Hard positive generation via adversary for object detection. In CVPR. Wang, X., Shrivastava, A., & Gupta, A. (2017). A-fast-rcnn: Hard positive generation via adversary for object detection. In CVPR.
Zurück zum Zitat Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In ECCV. Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A discriminative feature learning approach for deep face recognition. In ECCV.
Zurück zum Zitat Zitnick, L., & Dollar, P. (2014). Edge Boxes: Locating object proposals from edges. In ECCV. Zitnick, L., & Dollar, P. (2014). Edge Boxes: Locating object proposals from edges. In ECCV.
Metadaten
Titel
Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection
verfasst von
Hongyang Li
Yu Liu
Wanli Ouyang
Xiaogang Wang
Publikationsdatum
20.06.2018
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 3/2019
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-018-1101-7

Weitere Artikel der Ausgabe 3/2019

International Journal of Computer Vision 3/2019 Zur Ausgabe

Premium Partner