Top

International Journal of Computer Vision

Published in:

23-10-2023

Towards a Unified Network for Robust Monocular Depth Estimation: Network Architecture, Training Strategy and Dataset

Authors: Mochu Xiang, Yuchao Dai, Feiyu Zhang, Jiawei Shi, Xinyu Tian, Zhensong Zhang

Published in: International Journal of Computer Vision | Issue 4/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Robust monocular depth estimation (MDE) aims at learning a unified model that works across diverse real-world scenes, which is an important and active topic in computer vision. In this paper, we present Megatron_RVC, our winning solution for the monocular depth challenge in the Robust Vision Challenge (RVC) 2022, where we tackle the challenging problem from three perspectives: network architecture, training strategy and dataset. In particular, we made three contributions towards robust MDE: (1) we built a neural network with high capacity to enable flexible and accurate monocular depth predictions, which contains dedicated components to provide content-aware embeddings and to improve the richness of the details; (2) we proposed a novel mixing training strategy to handle real-world images with different aspect ratios, resolutions and apply tailored loss functions based on the properties of their depth maps; (3) to train a unified network model that covers diverse real-world scenes, we used over 1 million images from different datasets. As of 3rd October 2022, our unified model ranked consistently first across three benchmarks (KITTI, MPI Sintel, and VIPER) among all participants.

previous article Guest Editorial: Special Issue on the Promises and Dangers of Large Vision Models

next article A General Paradigm with Detail-Preserving Conditional Invertible Network for Image Fusion

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Abdulwahab, S., Rashwan, H. A., Garcia, M. A., Masoumian, A., & Puig, D. (2022). Monocular depth map estimation based on a multi-scale deep architecture and curvilinear saliency feature boosting. Neural Computing and Applications, 34(19), 16423–16440.CrossRef

Alhashim, I., & Wonka, P. (2018). High quality monocular depth estimation via transfer learning. arXiv preprint arXiv:1812.11941

Atapour-Abarghouei, A., & Breckon, T. P. (2018). Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2800–2810).

Bhat, S. F., Alhashim, I., & Wonka, P. (2021). Adabins: Depth estimation using adaptive bins. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4009–4018).

Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European conference on computer vision (ECCV) (pp. 611–625).

Cabon, Y., Murray, N., & Humenberger, M. (2020). Virtual KITTI 2. arXiv preprint arXiv:2001.10773

Chen, W., Fu, Z., Yang, D., & Deng, J. (2016). Single-image depth perception in the wild. In Advances in neural information processing systems (NeurIPS) (vol. 29).

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3213–3223).

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).

Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2022). CSWin transformer: A general vision transformer backbone with cross-shaped windows. In IEEE conference on computer vision and pattern recognition (CVPR) (pp 12124–12134).

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations (ICLR).

Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems (NeurIPS) (vol. 27).

Facil, J. M., Ummenhofer, B., Zhou, H., Montesano, L., Brox, T., & Civera, J. (2019). CAM-convs: Camera-aware multi-scale convolutions for single-view depth. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11826–11835).

Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2002–2011).

Gaidon, A., Wang, Q., Cabon, Y., & Vig, E. (2016). Virtual worlds as proxy for multi-object tracking analysis. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4340–4349).

Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32(11), 1231–1237.CrossRef

Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 270–279).

Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In IEEE international conference on computer vision (ICCV) (pp. 3828–3838).

Gta5-depth-estimation, Retrieved July 26, 2022. https://github.com/gta5-vision/GTA5-depth-estimation

Han, K., Wang, Y., Guo, J., Tang, Y., & Wu, E. (2022). Vision GNN: An image is worth graph of nodes. arXiv preprint arXiv:2206.00272

He, M., Hui, L., Bian, Y., Ren, J., Xie, J., & Yang, J. (2022). RA-depth: Resolution adaptive self-supervised monocular depth estimation. In European conference on computer vision (ECCV) (pp. 565–581).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).

Hua, Y., Kohli, P., Uplavikar, P., Ravi, A., Gunaseelan, S., Orozco, J., & Li, E. (2020). Holopix50k: A large-scale in-the-wild stereo image dataset. arXiv preprint arXiv:2003.11172

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4700–4708).

Hurl, B., Czarnecki, K., & Waslander, S. (2019). Precise synthetic image and LiDAR (PreSIL) dataset for autonomous vehicle perception. In IEEE intelligent vehicles symposium (IV) (pp. 2522–2529).

Ji, P., Li, R., Bhanu, B., & Xu, Y. (2021). MonoIndoor: Towards good practice of self-supervised monocular depth estimation for indoor environments. In IEEE international conference on computer vision (ICCV) (pp. 12787–12796).

Kim, Y., Ham, B., Oh, C., & Sohn, K. (2016). Structure selective depth superresolution for RGB-D cameras. IEEE Transactions on Image Processing (TIP), 25(11), 5227–5238.MathSciNetCrossRef

Kopf, J., Rong, X., & Huang, J. B. (2021). Robust consistent video depth estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1611–1621).

Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In International conference on 3D vision (3DV) (pp. 239–248).

Le, H. A., Mensink, T., Das, P., Karaoglu, S., & Gevers, T. (2021) EDEN: Multimodal synthetic dataset of enclosed garden scenes. In IEEE winter conference on applications of computer vision (WACV) (pp. 1579–1589).

Lee, J. H., Han, M. K., Ko, D. W., & Suh, I. H. (2019). From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326

Li, Z., & Snavely, N. (2018). MegaDepth: Learning single-view depth prediction from internet photos. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2041–2050).

Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., & Freeman, W. T. (2019). Learning the depths of moving people by watching frozen people. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4521–4530).

Li, B., Huang, Y., Liu, Z., Zou, D., & Yu, W. (2021). StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In IEEE international conference on computer vision (ICCV) (pp. 12663–12673).

Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., & Guo, B. (2022). Swin transformer v2: Scaling up capacity and resolution. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 12009–12019).

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE international conference on computer vision (ICCV) (pp. 10012–10022).

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 11976–11986).

Luo, X., Huang, J. B., Szeliski, R., Matzen, K., & Kopf, J. (2020). Consistent video depth estimation. ACM Transactions on Graphics (ToG), 39(4), 71–1.CrossRef

Masoumian, A., Rashwan, H. A., Abdulwahab, S., Cristiano, J., Asif, M. S., & Puig, D. (2023). GCNDepth: Self-supervised monocular depth estimation based on graph convolutional network. Neurocomputing, 517, 81–92.CrossRef

Masoumian, A., Rashwan, H. A., Cristiano, J., Asif, M. S., & Puig, D. (2022). Monocular depth estimation using deep learning: A review. Sensors, 22(14), 5353.CrossRef

Mehta, S., & Rastegari, M. (2021). MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer. In International conference on learning representations (ICLR).

Miangoleh, S. M. H., Dille, S., Mai, L., Paris, S., & Aksoy, Y. (2021). Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 9685–9694)

Ming, Y., Meng, X., Fan, C., & Yu, H. (2021). Deep learning for monocular depth estimation: A review. Neurocomputing, 438, 14–33.CrossRef

Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (2008). Dataset shift in machine learning. MIT Press.

Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision transformers for dense prediction. In IEEE international conference on computer vision (ICCV) (pp. 12179–12188).

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(3), 1623–1637.CrossRef

Ren, H., Raj, A., El-Khamy, M., & Lee, J. (2020). SUW-Learn: Joint supervised, unsupervised, weakly supervised deep learning for monocular depth estimation. In IEEE conference on computer vision and pattern recognition (CVPR) workshop (pp. 750–751).

Richter, S. R., Hayder, Z., & Koltun, V. (2017). Playing for benchmarks. In IEEE international conference on computer vision (ICCV) (pp. 2232–2241).

Saxena, A., Sun, M., & Ng, A. Y. (2008). Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 31(5), 824–840.CrossRef

Schonberger, J. L., & Frahm, J. M. (2016). Structure-from-motion revisited. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4104–4113).

Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In European conference on computer vision (ECCV) (pp. 746–760).

Tan, M., & Le, Q. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (ICML) (pp. 6105–6114).

Teed, Z., & Deng, J. (2020). RAFT: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision (ECCV) (pp. 402–419).

The robust vision challenge (2022). http://www.robustvision.net

Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1521–1528).

Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., & Brox, T. (2017). DeMoN: Depth and motion network for learning monocular stereo. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5038–5047).

Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. In Advances in neural information processing systems (NeurIPS) (vol. 30).

Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., Daniele, A. F., Mostajabi, M., Basart, S., & Walter, M. R., Shakhnarovich, G. (2019). Diode: A dense indoor and outdoor depth dataset. arXiv preprint arXiv:1908.00463

Vyas, P., Saxena, C., Badapanda, A., & Goswami, A. (2022). Outdoor monocular depth estimation: A research review. arXiv preprint arXiv:2205.01399

Wang, C., Lucey, S., Perazzi, F., & Wang, O. (2019). Web stereo video supervision for depth prediction from dynamic scenes. In International conference on 3D vision (3DV) (pp. 348–357).

Wang, X., Yin, W., Kong, T., Jiang, Y., Li, L., & Shen, C. (2020). Task-aware monocular depth estimation for 3d object detection. In AAAI conference on artificial intelligence (AAAI) (vol. 34, pp. 12257–12264).

Wu, C.Y., Wang, J., Hall, M., Neumann, U., & Su, S. (2022). Toward practical monocular indoor depth estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3814–3824).

Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., & Luo, Z. (2018). Monocular relative depth perception with web stereo data supervision. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 311–320).

Xian, K., Zhang, J., Wang, O., Mai, L., Lin, Z., & Cao, Z. (2020). Structure-guided ranking loss for single image depth prediction. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 611–620).

Xu, G., Yin, W., Chen, H., Cheng, K., Zhao, F., & Shen, C. (2022). Boosting monocular depth estimation with sparse guided points. arXiv preprint arXiv:2202.01470

Xu, G., Yin, W., Chen, H., Shen, C., Cheng, K., & Zhao, F. (2023). Pose-free 3d scene reconstruction with frozen depth models. In IEEE international conference on computer vision (ICCV).

Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., Chen, X., & Shen, C. (2023). Metric3D: Towards zero-shot metric 3d prediction from a single image. In IEEE international conference on computer vision (ICCV).

Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., & Shen, C. (2021). Learning to recover 3d scene shape from a single image. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 204–213).

Yin, W., Liu, Y., & Shen, C. (2021). Virtual normal: Enforcing geometric constraints for accurate and robust depth prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(10), 7282–7295.CrossRef

Yuan, W., Gu, X., Dai, Z., Zhu, S., & Tan, P. (2022). Neural window fully-connected CRFs for monocular depth estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3916–3925).

Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 340–349).

Zhang, Z., Lathuiliere, S., Ricci, E., Sebe, N., Yan, Y., & Yang, J. (2020). Online depth learning against forgetting in monocular videos. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4494–4503).

Zhao, S., Fu, H., Gong, M., & Tao, D. (2019). Geometry-aware symmetric domain adaptation for monocular depth estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 9788–9798).

Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2881–2890).

Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., & Mattoccia, S. (2022). MonoViT: Self-supervised monocular depth estimation with a vision transformer. In 2022 international conference on 3D vision (3DV) (pp. 668–678). IEEE

Zhao, C., Sun, Q., Zhang, C., Tang, Y., & Qian, F. (2020). Monocular depth estimation based on deep learning: An overview. Science China Technological Sciences, 63(9), 1612–1627.CrossRef

Zhao, C., Tang, Y., & Sun, Q. (2022). Unsupervised monocular depth estimation in highly complex environments. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(5), 1237–1246.CrossRef

Zheng, C., Cham, T. J., & Cai, J. (2018). T2Net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In European conference on computer vision (ECCV) (pp. 767–783).

Zhou, Z., & Dong, Q. (2022). Self-distilled feature aggregation for self-supervised monocular depth estimation. In European conference on computer vision (ECCV) (pp. 709–726).

Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017) Unsupervised learning of depth and ego-motion from video. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1851–1858).

Title: Towards a Unified Network for Robust Monocular Depth Estimation: Network Architecture, Training Strategy and Dataset
Authors: Mochu Xiang
Yuchao Dai
Feiyu Zhang
Jiawei Shi
Xinyu Tian
Zhensong Zhang
Publication date: 23-10-2023
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 4/2024
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-023-01915-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 4/2024

Guest Editorial: Special Issue on the Promises and Dangers of Large Vision Models

Language-Aware Soft Prompting: Text-to-Text Optimization for Few- and Zero-Shot Adaptation of V &L Models

InstaFormer++: Multi-Domain Instance-Aware Image-to-Image Translation with Transformer

Learning Robust Multi-scale Representation for Neural Radiance Fields from Unposed Images

DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes

Universal Object Detection with Large Vision Model

Premium Partner