Skip to main content
Top
Published in: International Journal of Computer Vision 9/2021

18-06-2021

Unsupervised Scale-Consistent Depth Learning from Video

Authors: Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen, Ming-Ming Cheng, Ian Reid

Published in: International Journal of Computer Vision | Issue 9/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We propose a monocular depth estimation method SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time. Our contributions include: (i) we propose a geometry consistency loss, which penalizes the inconsistency of predicted depths between adjacent views; (ii) we propose a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training; (iii) we demonstrate the efficacy of each component with a detailed ablation study and show high-quality depth estimation results in both KITTI and NYUv2 datasets. Moreover, thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into ORB-SLAM2 system for more robust and accurate tracking. The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training. Finally, we provide several demos for qualitative evaluation. The source code is released on GitHub.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal on Computer Vision (IJCV), 56(3), 221–225.CrossRef Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal on Computer Vision (IJCV), 56(3), 221–225.CrossRef
go back to reference Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., Cheng, M.-M., et al. (2020a). GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence. International Journal on Computer Vision (IJCV), 4(2), 4181–4190. Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., Cheng, M.-M., et al. (2020a). GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence. International Journal on Computer Vision (IJCV), 4(2), 4181–4190.
go back to reference Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., et al. (2019a). Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Neural Information Processing Systems (NeurIPS). Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., et al. (2019a). Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Neural Information Processing Systems (NeurIPS).
go back to reference Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng, M.-M., et al. (2019b). An evaluation of feature matchers for fundamental matrix estimation. In British Machine Vision Conference (BMVC). Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng, M.-M., et al. (2019b). An evaluation of feature matchers for fundamental matrix estimation. In British Machine Vision Conference (BMVC).
go back to reference Bian, J.-W., Zhan, H., Wang, N., Chin, T.-J., Shen, C., & Reid, I. (2020b). Unsupervised depth learning in challenging indoor video: Weak rectification to rescue. arXiv preprintarXiv:2006.02708. Bian, J.-W., Zhan, H., Wang, N., Chin, T.-J., Shen, C., & Reid, I. (2020b). Unsupervised depth learning in challenging indoor video: Weak rectification to rescue. arXiv preprintarXiv:​2006.​02708.
go back to reference Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., & Davison, A. J. (2018). Codeslam—learning a compact, optimisable representation for dense visual slam. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2560–2568). Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., & Davison, A. J. (2018). Codeslam—learning a compact, optimisable representation for dense visual slam. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2560–2568).
go back to reference Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) (pp. 611–625). Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) (pp. 611–625).
go back to reference Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019a). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Association for the Advancement of Artificial Intelligence (AAAI). Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019a). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Association for the Advancement of Artificial Intelligence (AAAI).
go back to reference Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019b). Unsupervised monocular depth and ego-motion learning with structure and semantics. In CVPR Workshop on Visual Odometry and Computer Vision Applications Based on Location Cues (VOCVALC). Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019b). Unsupervised monocular depth and ego-motion learning with structure and semantics. In CVPR Workshop on Visual Odometry and Computer Vision Applications Based on Location Cues (VOCVALC).
go back to reference Chakrabarti, A., Shao, J., & Shakhnarovich, G. (2016). Depth from a single image by harmonizing overcomplete local network predictions. In Neural Information Processing Systems (NeurIPS). Chakrabarti, A., Shao, J., & Shakhnarovich, G. (2016). Depth from a single image by harmonizing overcomplete local network predictions. In Neural Information Processing Systems (NeurIPS).
go back to reference Chen, W., Qian, S., & Deng, J. (2019a). Learning single-image depth from videos using quality assessment networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5604–5613). Chen, W., Qian, S., & Deng, J. (2019a). Learning single-image depth from videos using quality assessment networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5604–5613).
go back to reference Chen, Y., Schmid, C., & Sminchisescu, C. (2019b). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In IEEE International Conference on Computer Vision (ICCV) (pp. 7063–7072). Chen, Y., Schmid, C., & Sminchisescu, C. (2019b). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In IEEE International Conference on Computer Vision (ICCV) (pp. 7063–7072).
go back to reference Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprintarXiv:1511.07289. Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprintarXiv:​1511.​07289.
go back to reference Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In ACM Transactions on Graphics (SIGGRAPH). CUMINCAD. Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In ACM Transactions on Graphics (SIGGRAPH). CUMINCAD.
go back to reference Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision (ICCV). Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision (ICCV).
go back to reference Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems (NeurIPS). Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems (NeurIPS).
go back to reference Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse odometry. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(3), 611–625.CrossRef Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse odometry. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(3), 611–625.CrossRef
go back to reference Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: Fast semi-direct monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 15–22). IEEE. Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: Fast semi-direct monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 15–22). IEEE.
go back to reference Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2002–2011). Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2002–2011).
go back to reference Garg, R., BG, V. K., Carneiro, G., & Reid, I. (2016). Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision (ECCV). Springer. Garg, R., BG, V. K., Carneiro, G., & Reid, I. (2016). Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision (ECCV). Springer.
go back to reference Garg, R., Wadhwa, N., Ansari, S., & Barron, J. T. (2019). Learning single camera depth estimation using dual-pixels. In IEEE International Conference on Computer Vision (ICCV) (pp. 7628–7637). Garg, R., Wadhwa, N., Ansari, S., & Barron, J. T. (2019). Learning single camera depth estimation using dual-pixels. In IEEE International Conference on Computer Vision (ICCV) (pp. 7628–7637).
go back to reference Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets Robotics: The kitti dataset. International Journal of Robotics Research (IJRR). Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets Robotics: The kitti dataset. International Journal of Robotics Research (IJRR).
go back to reference Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV). Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV).
go back to reference Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth prediction. In IEEE International Conference on Computer Vision (ICCV). Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth prediction. In IEEE International Conference on Computer Vision (ICCV).
go back to reference Gordon, A., Li, H., Jonschkowski, R., & Angelova, A. (2019). Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In IEEE International Conference on Computer Vision (ICCV). Gordon, A., Li, H., Jonschkowski, R., & Angelova, A. (2019). Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In IEEE International Conference on Computer Vision (ICCV).
go back to reference Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., & Gaidon, A. (2020a). 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., & Gaidon, A. (2020a). 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Guizilini, V., Hou, R., Li, J., Ambrus, R., & Gaidon, A. (2020b). Semantically-guided representation learning for self-supervised monocular depth. In International Conference on Learning Representations (ICLR). Guizilini, V., Hou, R., Li, J., Ambrus, R., & Gaidon, A. (2020b). Semantically-guided representation learning for self-supervised monocular depth. In International Conference on Learning Representations (ICLR).
go back to reference Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.MATH Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.MATH
go back to reference He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2 (pp. 807–814). Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2 (pp. 807–814).
go back to reference Huynh, L., Nguyen-Ha, P., Matas, J., Rahtu, E., & Heikkilä, J. (2020). Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision (ECCV) (pp. 581–597). Springer. Huynh, L., Nguyen-Ha, P., Matas, J., Rahtu, E., & Heikkilä, J. (2020). Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision (ECCV) (pp. 581–597). Springer.
go back to reference Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Neural Information Processing Systems (NeurIPS). Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Neural Information Processing Systems (NeurIPS).
go back to reference Jeong, J., Cho, Y., Shin, Y.-S., Roh, H., & Kim, A. (2019). Complex urban dataset with multi-level sensors from highly diverse urban environments. The International Journal of Robotics Research, p. 0278364919843996. Jeong, J., Cho, Y., Shin, Y.-S., Roh, H., & Kim, A. (2019). Complex urban dataset with multi-level sensors from highly diverse urban environments. The International Journal of Robotics Research, p. 0278364919843996.
go back to reference Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small ar workspaces. In IEEE and ACM international symposium on mixed and augmented reality (pp. 225–234). IEEE. Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small ar workspaces. In IEEE and ACM international symposium on mixed and augmented reality (pp. 225–234). IEEE.
go back to reference Klingner, M., Termöhlen, J.-A., Mikolajczyk, J., & Fingscheidt, T. (2020). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision (ECCV) (pp. 582–600). Springer. Klingner, M., Termöhlen, J.-A., Mikolajczyk, J., & Fingscheidt, T. (2020). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision (ECCV) (pp. 582–600). Springer.
go back to reference Kuznietsov, Y., Stuckler, J., & Leibe, B. (2017). Semi-supervised deep learning for monocular depth map prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Kuznietsov, Y., Stuckler, J., & Leibe, B. (2017). Semi-supervised deep learning for monocular depth map prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 3DV. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 3DV.
go back to reference Lee, S., Im, S., Lin, S., & Kweon, I. S. (2021). Learning monocular depth in dynamic scenes via instance-aware projection consistency. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Lee, S., Im, S., Lin, S., & Kweon, I. S. (2021). Learning monocular depth in dynamic scenes via instance-aware projection consistency. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
go back to reference Li, H., Gordon, A., Zhao, H., Casser, V., & Angelova, A. (2020). Unsupervised monocular depth learning in dynamic scenes. In Conference on Robot Learning. Li, H., Gordon, A., Zhao, H., Casser, V., & Angelova, A. (2020). Unsupervised monocular depth learning in dynamic scenes. In Conference on Robot Learning.
go back to reference Li, J., Klein, R., & Yao, A. (2017). A two-streamed network for estimating fine-scaled depth maps from single rgb images. In IEEE International Conference on Computer Vision (ICCV). Li, J., Klein, R., & Yao, A. (2017). A two-streamed network for estimating fine-scaled depth maps from single rgb images. In IEEE International Conference on Computer Vision (ICCV).
go back to reference Li, R., Wang, S., Long, Z., & Gu, D. (2018). Undeepvo: Monocular visual odometry through unsupervised deep learning. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 7286–7291). IEEE. Li, R., Wang, S., Long, Z., & Gu, D. (2018). Undeepvo: Monocular visual odometry through unsupervised deep learning. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 7286–7291). IEEE.
go back to reference Li, Y., Ushiku, Y., & Harada, T. (2019a). Pose graph optimization for unsupervised monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5439–5445). IEEE. Li, Y., Ushiku, Y., & Harada, T. (2019a). Pose graph optimization for unsupervised monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5439–5445). IEEE.
go back to reference Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., et al. (2019b). Learning the depths of moving people by watching frozen people. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4521–4530). Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., et al. (2019b). Learning the depths of moving people by watching frozen people. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4521–4530).
go back to reference Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2041–2050). Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2041–2050).
go back to reference Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 38(10). Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 38(10).
go back to reference Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang, H. (2019). Cnn-svo: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5218–5223). IEEE. Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang, H. (2019). Cnn-svo: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5218–5223). IEEE.
go back to reference Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV), 60(2), 91–110.CrossRef Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV), 60(2), 91–110.CrossRef
go back to reference Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., et al. (2019). Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 42(10), 2624–2641.CrossRef Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., et al. (2019). Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 42(10), 2624–2641.CrossRef
go back to reference Luo, X., Huang, J.-B., Szeliski, R., Matzen, K., & Kopf, J. (2020). Consistent video depth estimation. ACM Transactions on Graphics (SIGGRAPH). Luo, X., Huang, J.-B., Szeliski, R., Matzen, K., & Kopf, J. (2020). Consistent video depth estimation. ACM Transactions on Graphics (SIGGRAPH).
go back to reference Mahjourian, R., Wicke, M., & Angelova, A. (2018). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Mahjourian, R., Wicke, M., & Angelova, A. (2018). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3061–3070). Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3061–3070).
go back to reference Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015). ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (TRO), 31(5). Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015). ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (TRO), 31(5).
go back to reference Mur-Artal, R., & Tardós, J. D. (2014). Fast relocalisation and loop closing in keyframe-based slam. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 846–853). IEEE. Mur-Artal, R., & Tardós, J. D. (2014). Fast relocalisation and loop closing in keyframe-based slam. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 846–853). IEEE.
go back to reference Mur-Artal, R., & Tardós, J. D. (2017). ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics (TRO). Mur-Artal, R., & Tardós, J. D. (2017). ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics (TRO).
go back to reference Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in pytorch. In NIPS-W. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in pytorch. In NIPS-W.
go back to reference Pillai, S., Ambruş, R., & Gaidon, A. (2019). Superdepth: Self-supervised, super-resolved monocular depth estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 9250–9256). IEEE. Pillai, S., Ambruş, R., & Gaidon, A. (2019). Superdepth: Self-supervised, super-resolved monocular depth estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 9250–9256). IEEE.
go back to reference Pilzer, A., Xu, D., Puscas, M., Ricci, E., & Sebe, N. (2018). Unsupervised adversarial depth estimation using cycled generative networks. In International Conference on 3D Vision (3DV) (pp. 587–595). Pilzer, A., Xu, D., Puscas, M., Ricci, E., & Sebe, N. (2018). Unsupervised adversarial depth estimation using cycled generative networks. In International Conference on 3D Vision (3DV) (pp. 587–595).
go back to reference Poggi, M., Aleotti, F., Tosi, F., & Mattoccia, S. (2020). On the uncertainty of self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Poggi, M., Aleotti, F., Tosi, F., & Mattoccia, S. (2020). On the uncertainty of self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Prisacariu, V. A., Kähler, O., Golodetz, S., Sapienza, M., Cavallari, T., Torr, P. H., et al. (2017). InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. ArXiv e-prints., 1708, 00783. Prisacariu, V. A., Kähler, O., Golodetz, S., Sapienza, M., Cavallari, T., Torr, P. H., et al. (2017). InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. ArXiv e-prints., 1708, 00783.
go back to reference Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI). Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI).
go back to reference Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., & Black, M. J. (2019). Competitive Collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., & Black, M. J. (2019). Competitive Collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer.
go back to reference Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. R. (2011). Orb: An efficient alternative to sift or surf. In IEEE International Conference on Computer Vision (ICCV). Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. R. (2011). Orb: An efficient alternative to sift or surf. In IEEE International Conference on Computer Vision (ICCV).
go back to reference Saxena, A., Chung, S. H., & Ng, A. Y. (2006). Learning depth from single monocular images. In Neural Information Processing Systems (NeurIPS). Saxena, A., Chung, S. H., & Ng, A. Y. (2006). Learning depth from single monocular images. In Neural Information Processing Systems (NeurIPS).
go back to reference Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4104–4113). Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4104–4113).
go back to reference Schönberger, J. L., Zheng, E., Pollefeys, M., & Frahm, J.-M. (2016). Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV). Schönberger, J. L., Zheng, E., Pollefeys, M., & Frahm, J.-M. (2016). Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV).
go back to reference Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., et al. (2019). Beyond photometric loss for self-supervised ego-motion estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 6359–6365). IEEE. Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., et al. (2019). Beyond photometric loss for self-supervised ego-motion estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 6359–6365). IEEE.
go back to reference Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV). Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV).
go back to reference Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In IEEE International Conference on Intelligent Robots and Systems (IROS). Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In IEEE International Conference on Intelligent Robots and Systems (IROS).
go back to reference Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). Cnn-slam: Real-time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6243–6252). Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). Cnn-slam: Real-time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6243–6252).
go back to reference Tiwari, L., Ji, P., Tran, Q.-H., Zhuang, B., Anand, S., & Chandraker, M. (2020). Pseudo rgb-d for self-improving monocular slam and depth prediction. In European Conference on Computer Vision (ECCV) (pp. 437–455). Springer. Tiwari, L., Ji, P., Tran, Q.-H., Zhuang, B., Anand, S., & Chandraker, M. (2020). Pseudo rgb-d for self-improving monocular slam and depth prediction. In European Conference on Computer Vision (ECCV) (pp. 437–455). Springer.
go back to reference Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). Sfm-net: Learning of structure and motion from video. arXiv preprintarXiv:1704.07804. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). Sfm-net: Learning of structure and motion from video. arXiv preprintarXiv:​1704.​07804.
go back to reference Wang, C., Lucey, S., Perazzi, F., & Wang, O. (2019). Web stereo video supervision for depth prediction from dynamic scenes. In International Conference on 3D Vision (3DV) (pp. 348–357). IEEE. Wang, C., Lucey, S., Perazzi, F., & Wang, O. (2019). Web stereo video supervision for depth prediction from dynamic scenes. In International Conference on 3D Vision (3DV) (pp. 348–357). IEEE.
go back to reference Wang, C., Miguel Buenaposada, J., Zhu, R., & Lucey, S. (2018). Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Wang, C., Miguel Buenaposada, J., Zhu, R., & Lucey, S. (2018). Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P., et al. (2004). Image Quality Assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4). Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P., et al. (2004). Image Quality Assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4).
go back to reference Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., et al. (2018). Monocular relative depth perception with web stereo data supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 311–320). Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., et al. (2018). Monocular relative depth perception with web stereo data supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 311–320).
go back to reference Yang, N., Stumberg, L. v., Wang, R., & Cremers, D. (2020). D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1281–1292). Yang, N., Stumberg, L. v., Wang, R., & Cremers, D. (2020). D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1281–1292).
go back to reference Yang, N., Wang, R., Stuckler, J., & Cremers, D. (2018a). Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In European Conference on Computer Vision (ECCV). Yang, N., Wang, R., Stuckler, J., & Cremers, D. (2018a). Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In European Conference on Computer Vision (ECCV).
go back to reference Yang, Z., Wang, P., Xu, W., Zhao, L., & Nevatia, R. (2018b). Unsupervised learning of geometry with edge-aware depth-normal consistency. In Association for the Advancement of Artificial Intelligence (AAAI). Yang, Z., Wang, P., Xu, W., Zhao, L., & Nevatia, R. (2018b). Unsupervised learning of geometry with edge-aware depth-normal consistency. In Association for the Advancement of Artificial Intelligence (AAAI).
go back to reference Yin, W., Liu, Y., Shen, C., & Yan, Y. (2019). Enforcing geometric constraints of virtual normal for depth prediction. In IEEE International Conference on Computer Vision (ICCV). Yin, W., Liu, Y., Shen, C., & Yan, Y. (2019). Enforcing geometric constraints of virtual normal for depth prediction. In IEEE International Conference on Computer Vision (ICCV).
go back to reference Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., et al. (2020). Learning to recover 3d scene shape from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., et al. (2020). Learning to recover 3d scene shape from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Yin, X., Wang, X., Du, X., & Chen, Q. (2017). Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields. In IEEE International Conference on Computer Vision (ICCV) (pp. 5870–5878). Yin, X., Wang, X., Du, X., & Chen, Q. (2017). Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields. In IEEE International Conference on Computer Vision (ICCV) (pp. 5870–5878).
go back to reference Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Zhang, Z. (1998). Determining the epipolar geometry and its uncertainty: A review. International Journal on Computer Vision (IJCV), 27(2), 161–195.CrossRef Zhang, Z. (1998). Determining the epipolar geometry and its uncertainty: A review. International Journal on Computer Vision (IJCV), 27(2), 161–195.CrossRef
go back to reference Zhao, W., Liu, S., Shu, Y., & Liu, Y.-J. (2020). Towards better generalization: Joint depth-pose learning without posenet. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Zhao, W., Liu, S., Shu, Y., & Liu, Y.-J. (2020). Towards better generalization: Joint depth-pose learning without posenet. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Zhou, J., Wang, Y., Qin, K., & Zeng, W. (2019). Moving indoor: Unsupervised video depth learning in challenging environments. In IEEE International Conference on Computer Vision (ICCV). Zhou, J., Wang, Y., Qin, K., & Zeng, W. (2019). Moving indoor: Unsupervised video depth learning in challenging environments. In IEEE International Conference on Computer Vision (ICCV).
go back to reference Zhou, Q.-Y., Park, J., & Koltun, V. (2018). Open3D: A modern library for 3D data processing. arXiv:1801.09847. Zhou, Q.-Y., Park, J., & Koltun, V. (2018). Open3D: A modern library for 3D data processing. arXiv:1801.09847.
go back to reference Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
go back to reference Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., & Chandraker, M. (2020). Learning monocular visual odometry via self-supervised long-term modeling. In European Conference on Computer Vision (ECCV). Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., & Chandraker, M. (2020). Learning monocular visual odometry via self-supervised long-term modeling. In European Conference on Computer Vision (ECCV).
go back to reference Zou, Y., Luo, Z., & Huang, J.-B. (2018). DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision (ECCV). Zou, Y., Luo, Z., & Huang, J.-B. (2018). DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision (ECCV).
Metadata
Title
Unsupervised Scale-Consistent Depth Learning from Video
Authors
Jia-Wang Bian
Huangying Zhan
Naiyan Wang
Zhichao Li
Le Zhang
Chunhua Shen
Ming-Ming Cheng
Ian Reid
Publication date
18-06-2021
Publisher
Springer US
Published in
International Journal of Computer Vision / Issue 9/2021
Print ISSN: 0920-5691
Electronic ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-021-01484-6

Other articles of this Issue 9/2021

International Journal of Computer Vision 9/2021 Go to the issue

Premium Partner