Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 6/2021

03-01-2021 | Original Article

Attention-based context aggregation network for monocular depth estimation

Authors: Yuru Chen, Haitao Zhao, Zhengwei Hu, Jingchao Peng

Published in: International Journal of Machine Learning and Cybernetics | Issue 6/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Depth estimation is a traditional computer vision task, which plays a crucial role in understanding 3D scene geometry. Recently, algorithms that combine the multi-scale features extracted by the dilated convolution based block (atrous spatial pyramid pooling, ASPP) have gained significant improvements in depth estimation. However, the discretized and predefined dilation kernels cannot capture the continuous context information that differs in diverse scenes and easily introduce the grid artifacts. This paper proposes a novel algorithm, called attention-based context aggregation network (ACAN) for depth estimation. A supervised self-attention model is designed and utilized to adaptively learn the task-specific similarities between different pixels to model the continuous context information. Moreover, a soft ordinal inference is proposed to transform the predicted probabilities to continuous depth values which reduce the discretization error (about 1% decrease in RMSE). ACAN achieves state-of-the-art performance on public monocular depth-estimation benchmark datasets. The source code of ACAN can be found in https://​github.​com/​miraiaroha/​ACAN.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
1.
go back to reference Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images 7576(1):746–760 Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images 7576(1):746–760
2.
go back to reference Simon M, Milz S, Amende K, Gross HM (2018) Complex-yolo: real-time 3d object detection on point clouds Simon M, Milz S, Amende K, Gross HM (2018) Complex-yolo: real-time 3d object detection on point clouds
3.
go back to reference Tateno K, Tombari F, Laina I, Navab N (2017) Cnn-slam: real-time dense monocular slam with learned depth prediction. p 6565–6574 Tateno K, Tombari F, Laina I, Navab N (2017) Cnn-slam: real-time dense monocular slam with learned depth prediction. p 6565–6574
4.
go back to reference Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. 3D Vision (3DV), 2016 fourth international conference on. p 239–248. IEEE Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. 3D Vision (3DV), 2016 fourth international conference on. p 239–248. IEEE
5.
go back to reference Ghosh S, Pal A, Jaiswal S, Santosh KC, Das N, Nasipuri M (2019) Segfast-v2: Semantic image segmentation with less parameters in deep learning for autonomous driving. Int J Mach Learn Cybern 10(11):3145–3154CrossRef Ghosh S, Pal A, Jaiswal S, Santosh KC, Das N, Nasipuri M (2019) Segfast-v2: Semantic image segmentation with less parameters in deep learning for autonomous driving. Int J Mach Learn Cybern 10(11):3145–3154CrossRef
6.
go back to reference Hirschmüller H (2005) Accurate and efficient stereo processing by semi-global matching and mutual information. IEEE computer society conference on computer vision and pattern recognition. p 807–814 Hirschmüller H (2005) Accurate and efficient stereo processing by semi-global matching and mutual information. IEEE computer society conference on computer vision and pattern recognition. p 807–814
7.
go back to reference Roberts R, Sinha SN, Szeliski R, Steedly D (2011) Structure from motion for scenes with large duplicate structures. IEEE conference on computer vision and pattern recognition. p 3137–3144 Roberts R, Sinha SN, Szeliski R, Steedly D (2011) Structure from motion for scenes with large duplicate structures. IEEE conference on computer vision and pattern recognition. p 3137–3144
8.
go back to reference Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Int Conf Neural Inf Process Syst. 1:2366–2374 Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Int Conf Neural Inf Process Syst. 1:2366–2374
9.
go back to reference Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. pp. 2650–2658 Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. pp. 2650–2658
10.
go back to reference Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. International conference on medical image computing and computer-assisted intervention. p 234–241 Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. International conference on medical image computing and computer-assisted intervention. p 234–241
11.
go back to reference Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. p 483–499 Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. p 483–499
12.
go back to reference Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. p 4724–4732 Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. p 4724–4732
13.
go back to reference Huang J, Lee AB, Mumford D (2000) Statistics of range images. Comput Vis Pattern Recogn. Proceedings IEEE conference on. vol.1. p 324–331 Huang J, Lee AB, Mumford D (2000) Statistics of range images. Comput Vis Pattern Recogn. Proceedings IEEE conference on. vol.1. p 324–331
15.
go back to reference LC Chen, G Papandreou, I Kokkinos, K Murphy, AL Yuille (2018) Deeplab Semantic image segmentation with deep convolutional nets atrous convolution and fully connected. IEEE Trans Pattern Anal Mach Intell 40(4): 834–848CrossRef LC Chen, G Papandreou, I Kokkinos, K Murphy, AL Yuille (2018) Deeplab Semantic image segmentation with deep convolutional nets atrous convolution and fully connected. IEEE Trans Pattern Anal Mach Intell 40(4): 834–848CrossRef
16.
go back to reference Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation
17.
go back to reference Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, Cottrell G (2018) Understanding convolution for semantic segmentation. IEEE winter conference on applications of computer vision. p 1451–1460 Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, Cottrell G (2018) Understanding convolution for semantic segmentation. IEEE winter conference on applications of computer vision. p 1451–1460
18.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need
19.
go back to reference Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks
21.
go back to reference Saxena A, Chung SH, Ng AY (2005) Learning depth from single monocular images. International conference on neural information processing systems. p 1161–1168 Saxena A, Chung SH, Ng AY (2005) Learning depth from single monocular images. International conference on neural information processing systems. p 1161–1168
22.
go back to reference Saxena A, Sun M, Ng AY (2007) Learning 3-d scene structure from a single still image. IEEE international conference on computer vision. p 1–8 Saxena A, Sun M, Ng AY (2007) Learning 3-d scene structure from a single still image. IEEE international conference on computer vision. p 1–8
23.
go back to reference Liu B, Gould S, Koller D (2010) Single image depth estimation from predicted semantic labels. Comput Vis Pattern Recogn. p 1253–1260 Liu B, Gould S, Koller D (2010) Single image depth estimation from predicted semantic labels. Comput Vis Pattern Recogn. p 1253–1260
24.
go back to reference Ladicky L, Shi J, Pollefeys M (2014) Pulling things out of perspective. IEEE Conf Comput Vis Pattern Recogn 9:89–96 Ladicky L, Shi J, Pollefeys M (2014) Pulling things out of perspective. IEEE Conf Comput Vis Pattern Recogn 9:89–96
25.
go back to reference Junjie H, Ozay M, Zhang Y, Okatani T (2018) Toward higher resolution maps with accurate object boundaries, revisiting single image depth estimation Junjie H, Ozay M, Zhang Y, Okatani T (2018) Toward higher resolution maps with accurate object boundaries, revisiting single image depth estimation
26.
go back to reference Han Yan, Shunli Zhang, Yu Zhang, and Li Zhang. Monocular depth estimation with guidance of surface normal map. Neurocomputing, 280:86–100, 2018CrossRef Han Yan, Shunli Zhang, Yu Zhang, and Li Zhang. Monocular depth estimation with guidance of surface normal map. Neurocomputing, 280:86–100, 2018CrossRef
27.
go back to reference Junning Zhang, Qunxing Su, Pengyuan Liu, Chao Xu, and Yanlong Chen. Unsupervised learning of monocular depth and ego-motion with spacešctemporal-centroid loss. International Journal of Machine Learning and Cybernetics, 11(3), 615–627, 2020CrossRef Junning Zhang, Qunxing Su, Pengyuan Liu, Chao Xu, and Yanlong Chen. Unsupervised learning of monocular depth and ego-motion with spacešctemporal-centroid loss. International Journal of Machine Learning and Cybernetics, 11(3), 615–627, 2020CrossRef
28.
go back to reference Roy A, Todorovic S (2016) Monocular depth estimation using neural regression forest. Comput Vis Pattern Recogn. p 5506–5514 Roy A, Todorovic S (2016) Monocular depth estimation using neural regression forest. Comput Vis Pattern Recogn. p 5506–5514
29.
go back to reference Zwald L, Lambertlacroix S (2012) The berhu penalty and the grouped effect. Statistics Zwald L, Lambertlacroix S (2012) The berhu penalty and the grouped effect. Statistics
30.
go back to reference Garg R, Vijay Kumar BG, Carneiro G, Reid I (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. European conference on computer vision. p 740–756 Garg R, Vijay Kumar BG, Carneiro G, Reid I (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. European conference on computer vision. p 740–756
31.
go back to reference Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. Comput Vis Pattern Recogn. 1:6602–6611 Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. Comput Vis Pattern Recogn. 1:6602–6611
32.
go back to reference Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612CrossRef Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612CrossRef
33.
go back to reference Heise P, Klose S, Jensen B, Knoll A (2014) Pm-huber: Patchmatch with huber regularization for stereo matching. IEEE international conference on computer vision. p 2360–2367 Heise P, Klose S, Jensen B, Knoll A (2014) Pm-huber: Patchmatch with huber regularization for stereo matching. IEEE international conference on computer vision. p 2360–2367
34.
go back to reference Saining Xie and Zhuowen Tu. Holistically-nested edge detection. International Journal of Computer Vision, 125(1–3), 3–18, 2015MathSciNet Saining Xie and Zhuowen Tu. Holistically-nested edge detection. International Journal of Computer Vision, 125(1–3), 3–18, 2015MathSciNet
35.
go back to reference Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. p 636–644 Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. p 636–644
36.
go back to reference Kim Y, Jung H, Min D, Sohn K (2018) Deep monocular depth estimation via integration of global and local predictions. IEEE Trans Image Process Publ IEEE Sig Process Soc 99:1–1 Kim Y, Jung H, Min D, Sohn K (2018) Deep monocular depth estimation via integration of global and local predictions. IEEE Trans Image Process Publ IEEE Sig Process Soc 99:1–1
37.
go back to reference Xu D, Ricci E, Ouyang W, Wang X, Sebe N (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. p 161–169 Xu D, Ricci E, Ouyang W, Wang X, Sebe N (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. p 161–169
38.
go back to reference Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. IEEE conference on computer vision and pattern recognition. p 5162–5170 Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. IEEE conference on computer vision and pattern recognition. p 5162–5170
39.
go back to reference Li B, Shen C, Dai Y, Van Den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. Comput Vis Pattern Recogn. p 1119–1127 Li B, Shen C, Dai Y, Van Den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. Comput Vis Pattern Recogn. p 1119–1127
40.
go back to reference F. Liu, C. Shen, G. Lin, and I Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis & Machine Intelligence, 38(10), 2024–2039, 2015CrossRef F. Liu, C. Shen, G. Lin, and I Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis & Machine Intelligence, 38(10), 2024–2039, 2015CrossRef
41.
go back to reference Zhang Z, Xu C, Yang J, Gao J, Cui Z (2018) Progressive hard-mining network for monocular depth estimation. IEEE Trans Image Process. 99:1–1MathSciNetMATH Zhang Z, Xu C, Yang J, Gao J, Cui Z (2018) Progressive hard-mining network for monocular depth estimation. IEEE Trans Image Process. 99:1–1MathSciNetMATH
42.
go back to reference Li B, Dai Y, He M (2018) Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recogn Li B, Dai Y, He M (2018) Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference. Pattern Recogn
43.
go back to reference Moukari M, Picard S, Simon L, Jurie F (2018) Deep multi-scale architectures for monocular depth estimation. arXiv preprint arXiv:1806.03051 Moukari M, Picard S, Simon L, Jurie F (2018) Deep multi-scale architectures for monocular depth estimation. arXiv preprint arXiv:​1806.​03051
44.
go back to reference Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE conference on computer vision and pattern recognition. p. 2002–2011 Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE conference on computer vision and pattern recognition. p. 2002–2011
45.
go back to reference Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PHS (2015) Conditional random fields as recurrent neural networks. p 1529–1537 Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PHS (2015) Conditional random fields as recurrent neural networks. p 1529–1537
46.
go back to reference Lin G, Shen C, Reid I, Van Dan Hengel A (2015) Efficient piecewise training of deep structured models for semantic segmentation. p 3194–3203 Lin G, Shen C, Reid I, Van Dan Hengel A (2015) Efficient piecewise training of deep structured models for semantic segmentation. p 3194–3203
47.
go back to reference Cao Y, Wu Zi, Shen C (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans Circ Syst Video Technol. 99:1–1 Cao Y, Wu Zi, Shen C (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans Circ Syst Video Technol. 99:1–1
48.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. p. 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition. p. 770–778
49.
go back to reference Zia T, Abbas A, Habib U, Khan MS (2020) Learning deep hierarchical and temporal recurrent neural networks with residual learning. Int J Mach Learn Cybern 11(4):873–882CrossRef Zia T, Abbas A, Habib U, Khan MS (2020) Learning deep hierarchical and temporal recurrent neural networks with residual learning. Int J Mach Learn Cybern 11(4):873–882CrossRef
50.
go back to reference Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015) Learning deep features for discriminative localization. p 2921–2929 Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2015) Learning deep features for discriminative localization. p 2921–2929
52.
go back to reference Li R, Xian K, Shen C, Cao Z, Lu H, Hang L (2018) Deep attention-based classification network for robust depth prediction Li R, Xian K, Shen C, Cao Z, Lu H, Hang L (2018) Deep attention-based classification network for robust depth prediction
53.
go back to reference Niu Z, Zhou M, Wang L, Gao X, Hua G (2016) Ordinal regression with multiple output cnn for age estimation. The IEEE conference on computer vision and pattern recognition (CVPR) Niu Z, Zhou M, Wang L, Gao X, Hua G (2016) Ordinal regression with multiple output cnn for age estimation. The IEEE conference on computer vision and pattern recognition (CVPR)
54.
go back to reference Geiger A (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. IEEE conference on computer vision and pattern recognition. p 3354–3361 Geiger A (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. IEEE conference on computer vision and pattern recognition. p 3354–3361
55.
go back to reference Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252, 2015MathSciNetCrossRef Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252, 2015MathSciNetCrossRef
Metadata
Title
Attention-based context aggregation network for monocular depth estimation
Authors
Yuru Chen
Haitao Zhao
Zhengwei Hu
Jingchao Peng
Publication date
03-01-2021
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 6/2021
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-020-01251-y

Other articles of this Issue 6/2021

International Journal of Machine Learning and Cybernetics 6/2021 Go to the issue