Top

International Journal of Computer Vision

Published in:

18-06-2021

Unsupervised Scale-Consistent Depth Learning from Video

Authors: Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen, Ming-Ming Cheng, Ian Reid

Published in: International Journal of Computer Vision | Issue 9/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

We propose a monocular depth estimation method SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time. Our contributions include: (i) we propose a geometry consistency loss, which penalizes the inconsistency of predicted depths between adjacent views; (ii) we propose a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training; (iii) we demonstrate the efficacy of each component with a detailed ablation study and show high-quality depth estimation results in both KITTI and NYUv2 datasets. Moreover, thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into ORB-SLAM2 system for more robust and accurate tracking. The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training. Finally, we provide several demos for qualitative evaluation. The source code is released on GitHub.

previous article Learning Regression and Verification Networks for Robust Long-term Tracking

next article Learned Collaborative Stereo Refinement

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal on Computer Vision (IJCV), 56(3), 221–225.CrossRef

Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., Cheng, M.-M., et al. (2020a). GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence. International Journal on Computer Vision (IJCV), 4(2), 4181–4190.

Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., et al. (2019a). Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Neural Information Processing Systems (NeurIPS).

Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng, M.-M., et al. (2019b). An evaluation of feature matchers for fundamental matrix estimation. In British Machine Vision Conference (BMVC).

Bian, J.-W., Zhan, H., Wang, N., Chin, T.-J., Shen, C., & Reid, I. (2020b). Unsupervised depth learning in challenging indoor video: Weak rectification to rescue. arXiv preprintarXiv:2006.02708.

Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., & Davison, A. J. (2018). Codeslam—learning a compact, optimisable representation for dense visual slam. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2560–2568).

Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) (pp. 611–625).

Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019a). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Association for the Advancement of Artificial Intelligence (AAAI).

Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019b). Unsupervised monocular depth and ego-motion learning with structure and semantics. In CVPR Workshop on Visual Odometry and Computer Vision Applications Based on Location Cues (VOCVALC).

Chakrabarti, A., Shao, J., & Shakhnarovich, G. (2016). Depth from a single image by harmonizing overcomplete local network predictions. In Neural Information Processing Systems (NeurIPS).

Chen, W., Qian, S., & Deng, J. (2019a). Learning single-image depth from videos using quality assessment networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5604–5613).

Chen, Y., Schmid, C., & Sminchisescu, C. (2019b). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In IEEE International Conference on Computer Vision (ICCV) (pp. 7063–7072).

Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprintarXiv:1511.07289.

Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In ACM Transactions on Graphics (SIGGRAPH). CUMINCAD.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision (ICCV).

Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems (NeurIPS).

Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse odometry. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(3), 611–625.CrossRef

Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: Fast semi-direct monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 15–22). IEEE.

Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2002–2011).

Garg, R., BG, V. K., Carneiro, G., & Reid, I. (2016). Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision (ECCV). Springer.

Garg, R., Wadhwa, N., Ansari, S., & Barron, J. T. (2019). Learning single camera depth estimation using dual-pixels. In IEEE International Conference on Computer Vision (ICCV) (pp. 7628–7637).

Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets Robotics: The kitti dataset. International Journal of Robotics Research (IJRR).

Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV).

Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth prediction. In IEEE International Conference on Computer Vision (ICCV).

Gordon, A., Li, H., Jonschkowski, R., & Angelova, A. (2019). Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In IEEE International Conference on Computer Vision (ICCV).

Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., & Gaidon, A. (2020a). 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Guizilini, V., Hou, R., Li, J., Ambrus, R., & Gaidon, A. (2020b). Semantically-guided representation learning for self-supervised monocular depth. In International Conference on Learning Representations (ICLR).

Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.MATH

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2 (pp. 807–814).

Huynh, L., Nguyen-Ha, P., Matas, J., Rahtu, E., & Heikkilä, J. (2020). Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision (ECCV) (pp. 581–597). Springer.

Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Neural Information Processing Systems (NeurIPS).

Jeong, J., Cho, Y., Shin, Y.-S., Roh, H., & Kim, A. (2019). Complex urban dataset with multi-level sensors from highly diverse urban environments. The International Journal of Robotics Research, p. 0278364919843996.

Kingma, D. P., & Ba, J. (2014). ADAM: A method for stochastic optimization. arXiv preprintarXiv:1412.6980.

Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small ar workspaces. In IEEE and ACM international symposium on mixed and augmented reality (pp. 225–234). IEEE.

Klingner, M., Termöhlen, J.-A., Mikolajczyk, J., & Fingscheidt, T. (2020). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision (ECCV) (pp. 582–600). Springer.

Kuznietsov, Y., Stuckler, J., & Leibe, B. (2017). Semi-supervised deep learning for monocular depth map prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 3DV.

Lee, S., Im, S., Lin, S., & Kweon, I. S. (2021). Learning monocular depth in dynamic scenes via instance-aware projection consistency. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).

Li, H., Gordon, A., Zhao, H., Casser, V., & Angelova, A. (2020). Unsupervised monocular depth learning in dynamic scenes. In Conference on Robot Learning.

Li, J., Klein, R., & Yao, A. (2017). A two-streamed network for estimating fine-scaled depth maps from single rgb images. In IEEE International Conference on Computer Vision (ICCV).

Li, R., Wang, S., Long, Z., & Gu, D. (2018). Undeepvo: Monocular visual odometry through unsupervised deep learning. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 7286–7291). IEEE.

Li, Y., Ushiku, Y., & Harada, T. (2019a). Pose graph optimization for unsupervised monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5439–5445). IEEE.

Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., et al. (2019b). Learning the depths of moving people by watching frozen people. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4521–4530).

Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2041–2050).

Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 38(10).

Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang, H. (2019). Cnn-svo: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5218–5223). IEEE.

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV), 60(2), 91–110.CrossRef

Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., et al. (2019). Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 42(10), 2624–2641.CrossRef

Luo, X., Huang, J.-B., Szeliski, R., Matzen, K., & Kopf, J. (2020). Consistent video depth estimation. ACM Transactions on Graphics (SIGGRAPH).

Mahjourian, R., Wicke, M., & Angelova, A. (2018). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3061–3070).

Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015). ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (TRO), 31(5).

Mur-Artal, R., & Tardós, J. D. (2014). Fast relocalisation and loop closing in keyframe-based slam. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 846–853). IEEE.

Mur-Artal, R., & Tardós, J. D. (2017). ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics (TRO).

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in pytorch. In NIPS-W.

Pillai, S., Ambruş, R., & Gaidon, A. (2019). Superdepth: Self-supervised, super-resolved monocular depth estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 9250–9256). IEEE.

Pilzer, A., Xu, D., Puscas, M., Ricci, E., & Sebe, N. (2018). Unsupervised adversarial depth estimation using cycled generative networks. In International Conference on 3D Vision (3DV) (pp. 587–595).

Poggi, M., Aleotti, F., Tosi, F., & Mattoccia, S. (2020). On the uncertainty of self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Prisacariu, V. A., Kähler, O., Golodetz, S., Sapienza, M., Cavallari, T., Torr, P. H., et al. (2017). InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. ArXiv e-prints., 1708, 00783.

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI).

Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., & Black, M. J. (2019). Competitive Collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer.

Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. R. (2011). Orb: An efficient alternative to sift or surf. In IEEE International Conference on Computer Vision (ICCV).

Saxena, A., Chung, S. H., & Ng, A. Y. (2006). Learning depth from single monocular images. In Neural Information Processing Systems (NeurIPS).

Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4104–4113).

Schönberger, J. L., Zheng, E., Pollefeys, M., & Frahm, J.-M. (2016). Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV).

Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., et al. (2019). Beyond photometric loss for self-supervised ego-motion estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 6359–6365). IEEE.

Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV).

Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In IEEE International Conference on Intelligent Robots and Systems (IROS).

Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). Cnn-slam: Real-time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6243–6252).

Tiwari, L., Ji, P., Tran, Q.-H., Zhuang, B., Anand, S., & Chandraker, M. (2020). Pseudo rgb-d for self-improving monocular slam and depth prediction. In European Conference on Computer Vision (ECCV) (pp. 437–455). Springer.

Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). Sfm-net: Learning of structure and motion from video. arXiv preprintarXiv:1704.07804.

Wang, C., Lucey, S., Perazzi, F., & Wang, O. (2019). Web stereo video supervision for depth prediction from dynamic scenes. In International Conference on 3D Vision (3DV) (pp. 348–357). IEEE.

Wang, C., Miguel Buenaposada, J., Zhu, R., & Lucey, S. (2018). Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P., et al. (2004). Image Quality Assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4).

Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., et al. (2018). Monocular relative depth perception with web stereo data supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 311–320).

Yang, N., Stumberg, L. v., Wang, R., & Cremers, D. (2020). D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1281–1292).

Yang, N., Wang, R., Stuckler, J., & Cremers, D. (2018a). Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In European Conference on Computer Vision (ECCV).

Yang, Z., Wang, P., Xu, W., Zhao, L., & Nevatia, R. (2018b). Unsupervised learning of geometry with edge-aware depth-normal consistency. In Association for the Advancement of Artificial Intelligence (AAAI).

Yin, W., Liu, Y., Shen, C., & Yan, Y. (2019). Enforcing geometric constraints of virtual normal for depth prediction. In IEEE International Conference on Computer Vision (ICCV).

Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., et al. (2020). Learning to recover 3d scene shape from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Yin, X., Wang, X., Du, X., & Chen, Q. (2017). Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields. In IEEE International Conference on Computer Vision (ICCV) (pp. 5870–5878).

Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Zhang, Z. (1998). Determining the epipolar geometry and its uncertainty: A review. International Journal on Computer Vision (IJCV), 27(2), 161–195.CrossRef

Zhao, W., Liu, S., Shu, Y., & Liu, Y.-J. (2020). Towards better generalization: Joint depth-pose learning without posenet. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Zhou, J., Wang, Y., Qin, K., & Zeng, W. (2019). Moving indoor: Unsupervised video depth learning in challenging environments. In IEEE International Conference on Computer Vision (ICCV).

Zhou, Q.-Y., Park, J., & Koltun, V. (2018). Open3D: A modern library for 3D data processing. arXiv:1801.09847.

Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., & Chandraker, M. (2020). Learning monocular visual odometry via self-supervised long-term modeling. In European Conference on Computer Vision (ECCV).

Zou, Y., Luo, Z., & Huang, J.-B. (2018). DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision (ECCV).

Title: Unsupervised Scale-Consistent Depth Learning from Video
Authors: Jia-Wang Bian
Huangying Zhan
Naiyan Wang
Zhichao Li
Le Zhang
Chunhua Shen
Ming-Ming Cheng
Ian Reid
Publication date: 18-06-2021
Publisher: Springer US
Published in: International Journal of Computer Vision / Issue 9/2021
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-021-01484-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 9/2021

Multi-level Motion Attention for Human Motion Prediction

Shape My Face: Registering 3D Face Scans by Surface-to-Surface Translation

Correction to: Long-Short Temporal–Spatial Clues Excited Network for Robust Person Re-identification

Learning Regression and Verification Networks for Robust Long-term Tracking

Learning to Caricature via Semantic Shape Transform

Semantics-to-Signal Scalable Image Compression with Learned Revertible Representations

Premium Partner