Skip to main content
Erschienen in: International Journal of Computer Vision 9/2021

18.06.2021

Unsupervised Scale-Consistent Depth Learning from Video

verfasst von: Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen, Ming-Ming Cheng, Ian Reid

Erschienen in: International Journal of Computer Vision | Ausgabe 9/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We propose a monocular depth estimation method SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time. Our contributions include: (i) we propose a geometry consistency loss, which penalizes the inconsistency of predicted depths between adjacent views; (ii) we propose a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training; (iii) we demonstrate the efficacy of each component with a detailed ablation study and show high-quality depth estimation results in both KITTI and NYUv2 datasets. Moreover, thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into ORB-SLAM2 system for more robust and accurate tracking. The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training. Finally, we provide several demos for qualitative evaluation. The source code is released on GitHub.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal on Computer Vision (IJCV), 56(3), 221–225.CrossRef Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal on Computer Vision (IJCV), 56(3), 221–225.CrossRef
Zurück zum Zitat Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., Cheng, M.-M., et al. (2020a). GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence. International Journal on Computer Vision (IJCV), 4(2), 4181–4190. Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., Cheng, M.-M., et al. (2020a). GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence. International Journal on Computer Vision (IJCV), 4(2), 4181–4190.
Zurück zum Zitat Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., et al. (2019a). Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Neural Information Processing Systems (NeurIPS). Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-M., et al. (2019a). Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Neural Information Processing Systems (NeurIPS).
Zurück zum Zitat Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng, M.-M., et al. (2019b). An evaluation of feature matchers for fundamental matrix estimation. In British Machine Vision Conference (BMVC). Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng, M.-M., et al. (2019b). An evaluation of feature matchers for fundamental matrix estimation. In British Machine Vision Conference (BMVC).
Zurück zum Zitat Bian, J.-W., Zhan, H., Wang, N., Chin, T.-J., Shen, C., & Reid, I. (2020b). Unsupervised depth learning in challenging indoor video: Weak rectification to rescue. arXiv preprintarXiv:2006.02708. Bian, J.-W., Zhan, H., Wang, N., Chin, T.-J., Shen, C., & Reid, I. (2020b). Unsupervised depth learning in challenging indoor video: Weak rectification to rescue. arXiv preprintarXiv:​2006.​02708.
Zurück zum Zitat Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., & Davison, A. J. (2018). Codeslam—learning a compact, optimisable representation for dense visual slam. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2560–2568). Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., & Davison, A. J. (2018). Codeslam—learning a compact, optimisable representation for dense visual slam. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2560–2568).
Zurück zum Zitat Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) (pp. 611–625). Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) (pp. 611–625).
Zurück zum Zitat Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019a). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Association for the Advancement of Artificial Intelligence (AAAI). Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019a). Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Association for the Advancement of Artificial Intelligence (AAAI).
Zurück zum Zitat Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019b). Unsupervised monocular depth and ego-motion learning with structure and semantics. In CVPR Workshop on Visual Odometry and Computer Vision Applications Based on Location Cues (VOCVALC). Casser, V., Pirk, S., Mahjourian, R., & Angelova, A. (2019b). Unsupervised monocular depth and ego-motion learning with structure and semantics. In CVPR Workshop on Visual Odometry and Computer Vision Applications Based on Location Cues (VOCVALC).
Zurück zum Zitat Chakrabarti, A., Shao, J., & Shakhnarovich, G. (2016). Depth from a single image by harmonizing overcomplete local network predictions. In Neural Information Processing Systems (NeurIPS). Chakrabarti, A., Shao, J., & Shakhnarovich, G. (2016). Depth from a single image by harmonizing overcomplete local network predictions. In Neural Information Processing Systems (NeurIPS).
Zurück zum Zitat Chen, W., Qian, S., & Deng, J. (2019a). Learning single-image depth from videos using quality assessment networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5604–5613). Chen, W., Qian, S., & Deng, J. (2019a). Learning single-image depth from videos using quality assessment networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5604–5613).
Zurück zum Zitat Chen, Y., Schmid, C., & Sminchisescu, C. (2019b). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In IEEE International Conference on Computer Vision (ICCV) (pp. 7063–7072). Chen, Y., Schmid, C., & Sminchisescu, C. (2019b). Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In IEEE International Conference on Computer Vision (ICCV) (pp. 7063–7072).
Zurück zum Zitat Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprintarXiv:1511.07289. Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprintarXiv:​1511.​07289.
Zurück zum Zitat Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In ACM Transactions on Graphics (SIGGRAPH). CUMINCAD. Curless, B., & Levoy, M. (1996). A volumetric method for building complex models from range images. In ACM Transactions on Graphics (SIGGRAPH). CUMINCAD.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision (ICCV). Eigen, D., & Fergus, R. (2015). Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision (ICCV).
Zurück zum Zitat Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems (NeurIPS). Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems (NeurIPS).
Zurück zum Zitat Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse odometry. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(3), 611–625.CrossRef Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse odometry. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 40(3), 611–625.CrossRef
Zurück zum Zitat Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: Fast semi-direct monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 15–22). IEEE. Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: Fast semi-direct monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 15–22). IEEE.
Zurück zum Zitat Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2002–2011). Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2002–2011).
Zurück zum Zitat Garg, R., BG, V. K., Carneiro, G., & Reid, I. (2016). Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision (ECCV). Springer. Garg, R., BG, V. K., Carneiro, G., & Reid, I. (2016). Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision (ECCV). Springer.
Zurück zum Zitat Garg, R., Wadhwa, N., Ansari, S., & Barron, J. T. (2019). Learning single camera depth estimation using dual-pixels. In IEEE International Conference on Computer Vision (ICCV) (pp. 7628–7637). Garg, R., Wadhwa, N., Ansari, S., & Barron, J. T. (2019). Learning single camera depth estimation using dual-pixels. In IEEE International Conference on Computer Vision (ICCV) (pp. 7628–7637).
Zurück zum Zitat Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets Robotics: The kitti dataset. International Journal of Robotics Research (IJRR). Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets Robotics: The kitti dataset. International Journal of Robotics Research (IJRR).
Zurück zum Zitat Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV). Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Dense 3d reconstruction in real-time. In Intelligent Vehicles Symposium (IV).
Zurück zum Zitat Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth prediction. In IEEE International Conference on Computer Vision (ICCV). Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth prediction. In IEEE International Conference on Computer Vision (ICCV).
Zurück zum Zitat Gordon, A., Li, H., Jonschkowski, R., & Angelova, A. (2019). Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In IEEE International Conference on Computer Vision (ICCV). Gordon, A., Li, H., Jonschkowski, R., & Angelova, A. (2019). Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In IEEE International Conference on Computer Vision (ICCV).
Zurück zum Zitat Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., & Gaidon, A. (2020a). 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., & Gaidon, A. (2020a). 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Guizilini, V., Hou, R., Li, J., Ambrus, R., & Gaidon, A. (2020b). Semantically-guided representation learning for self-supervised monocular depth. In International Conference on Learning Representations (ICLR). Guizilini, V., Hou, R., Li, J., Ambrus, R., & Gaidon, A. (2020b). Semantically-guided representation learning for self-supervised monocular depth. In International Conference on Learning Representations (ICLR).
Zurück zum Zitat Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.MATH Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.MATH
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2 (pp. 807–814). Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2 (pp. 807–814).
Zurück zum Zitat Huynh, L., Nguyen-Ha, P., Matas, J., Rahtu, E., & Heikkilä, J. (2020). Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision (ECCV) (pp. 581–597). Springer. Huynh, L., Nguyen-Ha, P., Matas, J., Rahtu, E., & Heikkilä, J. (2020). Guiding monocular depth estimation using depth-attention volume. In European Conference on Computer Vision (ECCV) (pp. 581–597). Springer.
Zurück zum Zitat Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Neural Information Processing Systems (NeurIPS). Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In Neural Information Processing Systems (NeurIPS).
Zurück zum Zitat Jeong, J., Cho, Y., Shin, Y.-S., Roh, H., & Kim, A. (2019). Complex urban dataset with multi-level sensors from highly diverse urban environments. The International Journal of Robotics Research, p. 0278364919843996. Jeong, J., Cho, Y., Shin, Y.-S., Roh, H., & Kim, A. (2019). Complex urban dataset with multi-level sensors from highly diverse urban environments. The International Journal of Robotics Research, p. 0278364919843996.
Zurück zum Zitat Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small ar workspaces. In IEEE and ACM international symposium on mixed and augmented reality (pp. 225–234). IEEE. Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small ar workspaces. In IEEE and ACM international symposium on mixed and augmented reality (pp. 225–234). IEEE.
Zurück zum Zitat Klingner, M., Termöhlen, J.-A., Mikolajczyk, J., & Fingscheidt, T. (2020). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision (ECCV) (pp. 582–600). Springer. Klingner, M., Termöhlen, J.-A., Mikolajczyk, J., & Fingscheidt, T. (2020). Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision (ECCV) (pp. 582–600). Springer.
Zurück zum Zitat Kuznietsov, Y., Stuckler, J., & Leibe, B. (2017). Semi-supervised deep learning for monocular depth map prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Kuznietsov, Y., Stuckler, J., & Leibe, B. (2017). Semi-supervised deep learning for monocular depth map prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 3DV. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 3DV.
Zurück zum Zitat Lee, S., Im, S., Lin, S., & Kweon, I. S. (2021). Learning monocular depth in dynamic scenes via instance-aware projection consistency. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). Lee, S., Im, S., Lin, S., & Kweon, I. S. (2021). Learning monocular depth in dynamic scenes via instance-aware projection consistency. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
Zurück zum Zitat Li, H., Gordon, A., Zhao, H., Casser, V., & Angelova, A. (2020). Unsupervised monocular depth learning in dynamic scenes. In Conference on Robot Learning. Li, H., Gordon, A., Zhao, H., Casser, V., & Angelova, A. (2020). Unsupervised monocular depth learning in dynamic scenes. In Conference on Robot Learning.
Zurück zum Zitat Li, J., Klein, R., & Yao, A. (2017). A two-streamed network for estimating fine-scaled depth maps from single rgb images. In IEEE International Conference on Computer Vision (ICCV). Li, J., Klein, R., & Yao, A. (2017). A two-streamed network for estimating fine-scaled depth maps from single rgb images. In IEEE International Conference on Computer Vision (ICCV).
Zurück zum Zitat Li, R., Wang, S., Long, Z., & Gu, D. (2018). Undeepvo: Monocular visual odometry through unsupervised deep learning. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 7286–7291). IEEE. Li, R., Wang, S., Long, Z., & Gu, D. (2018). Undeepvo: Monocular visual odometry through unsupervised deep learning. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 7286–7291). IEEE.
Zurück zum Zitat Li, Y., Ushiku, Y., & Harada, T. (2019a). Pose graph optimization for unsupervised monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5439–5445). IEEE. Li, Y., Ushiku, Y., & Harada, T. (2019a). Pose graph optimization for unsupervised monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5439–5445). IEEE.
Zurück zum Zitat Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., et al. (2019b). Learning the depths of moving people by watching frozen people. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4521–4530). Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., et al. (2019b). Learning the depths of moving people by watching frozen people. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4521–4530).
Zurück zum Zitat Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2041–2050). Li, Z., & Snavely, N. (2018). Megadepth: Learning single-view depth prediction from internet photos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2041–2050).
Zurück zum Zitat Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 38(10). Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 38(10).
Zurück zum Zitat Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang, H. (2019). Cnn-svo: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5218–5223). IEEE. Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang, H. (2019). Cnn-svo: Improving the mapping in semi-direct visual odometry using single-image depth prediction. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 5218–5223). IEEE.
Zurück zum Zitat Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV), 60(2), 91–110.CrossRef Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision (IJCV), 60(2), 91–110.CrossRef
Zurück zum Zitat Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., et al. (2019). Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 42(10), 2624–2641.CrossRef Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., et al. (2019). Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI), 42(10), 2624–2641.CrossRef
Zurück zum Zitat Luo, X., Huang, J.-B., Szeliski, R., Matzen, K., & Kopf, J. (2020). Consistent video depth estimation. ACM Transactions on Graphics (SIGGRAPH). Luo, X., Huang, J.-B., Szeliski, R., Matzen, K., & Kopf, J. (2020). Consistent video depth estimation. ACM Transactions on Graphics (SIGGRAPH).
Zurück zum Zitat Mahjourian, R., Wicke, M., & Angelova, A. (2018). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Mahjourian, R., Wicke, M., & Angelova, A. (2018). Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3061–3070). Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3061–3070).
Zurück zum Zitat Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015). ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (TRO), 31(5). Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015). ORB-SLAM: a versatile and accurate monocular slam system. IEEE Transactions on Robotics (TRO), 31(5).
Zurück zum Zitat Mur-Artal, R., & Tardós, J. D. (2014). Fast relocalisation and loop closing in keyframe-based slam. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 846–853). IEEE. Mur-Artal, R., & Tardós, J. D. (2014). Fast relocalisation and loop closing in keyframe-based slam. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 846–853). IEEE.
Zurück zum Zitat Mur-Artal, R., & Tardós, J. D. (2017). ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics (TRO). Mur-Artal, R., & Tardós, J. D. (2017). ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics (TRO).
Zurück zum Zitat Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in pytorch. In NIPS-W. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in pytorch. In NIPS-W.
Zurück zum Zitat Pillai, S., Ambruş, R., & Gaidon, A. (2019). Superdepth: Self-supervised, super-resolved monocular depth estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 9250–9256). IEEE. Pillai, S., Ambruş, R., & Gaidon, A. (2019). Superdepth: Self-supervised, super-resolved monocular depth estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 9250–9256). IEEE.
Zurück zum Zitat Pilzer, A., Xu, D., Puscas, M., Ricci, E., & Sebe, N. (2018). Unsupervised adversarial depth estimation using cycled generative networks. In International Conference on 3D Vision (3DV) (pp. 587–595). Pilzer, A., Xu, D., Puscas, M., Ricci, E., & Sebe, N. (2018). Unsupervised adversarial depth estimation using cycled generative networks. In International Conference on 3D Vision (3DV) (pp. 587–595).
Zurück zum Zitat Poggi, M., Aleotti, F., Tosi, F., & Mattoccia, S. (2020). On the uncertainty of self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Poggi, M., Aleotti, F., Tosi, F., & Mattoccia, S. (2020). On the uncertainty of self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Prisacariu, V. A., Kähler, O., Golodetz, S., Sapienza, M., Cavallari, T., Torr, P. H., et al. (2017). InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. ArXiv e-prints., 1708, 00783. Prisacariu, V. A., Kähler, O., Golodetz, S., Sapienza, M., Cavallari, T., Torr, P. H., et al. (2017). InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure. ArXiv e-prints., 1708, 00783.
Zurück zum Zitat Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI). Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI).
Zurück zum Zitat Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., & Black, M. J. (2019). Competitive Collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., & Black, M. J. (2019). Competitive Collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer.
Zurück zum Zitat Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. R. (2011). Orb: An efficient alternative to sift or surf. In IEEE International Conference on Computer Vision (ICCV). Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. R. (2011). Orb: An efficient alternative to sift or surf. In IEEE International Conference on Computer Vision (ICCV).
Zurück zum Zitat Saxena, A., Chung, S. H., & Ng, A. Y. (2006). Learning depth from single monocular images. In Neural Information Processing Systems (NeurIPS). Saxena, A., Chung, S. H., & Ng, A. Y. (2006). Learning depth from single monocular images. In Neural Information Processing Systems (NeurIPS).
Zurück zum Zitat Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4104–4113). Schonberger, J. L., & Frahm, J.-M. (2016). Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4104–4113).
Zurück zum Zitat Schönberger, J. L., Zheng, E., Pollefeys, M., & Frahm, J.-M. (2016). Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV). Schönberger, J. L., Zheng, E., Pollefeys, M., & Frahm, J.-M. (2016). Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV).
Zurück zum Zitat Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., et al. (2019). Beyond photometric loss for self-supervised ego-motion estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 6359–6365). IEEE. Shen, T., Luo, Z., Zhou, L., Deng, H., Zhang, R., Fang, T., et al. (2019). Beyond photometric loss for self-supervised ego-motion estimation. In IEEE International Conference on Robotics and Automation (ICRA) (pp. 6359–6365). IEEE.
Zurück zum Zitat Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV). Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV).
Zurück zum Zitat Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In IEEE International Conference on Intelligent Robots and Systems (IROS). Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of rgb-d slam systems. In IEEE International Conference on Intelligent Robots and Systems (IROS).
Zurück zum Zitat Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). Cnn-slam: Real-time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6243–6252). Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). Cnn-slam: Real-time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 6243–6252).
Zurück zum Zitat Tiwari, L., Ji, P., Tran, Q.-H., Zhuang, B., Anand, S., & Chandraker, M. (2020). Pseudo rgb-d for self-improving monocular slam and depth prediction. In European Conference on Computer Vision (ECCV) (pp. 437–455). Springer. Tiwari, L., Ji, P., Tran, Q.-H., Zhuang, B., Anand, S., & Chandraker, M. (2020). Pseudo rgb-d for self-improving monocular slam and depth prediction. In European Conference on Computer Vision (ECCV) (pp. 437–455). Springer.
Zurück zum Zitat Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). Sfm-net: Learning of structure and motion from video. arXiv preprintarXiv:1704.07804. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., & Fragkiadaki, K. (2017). Sfm-net: Learning of structure and motion from video. arXiv preprintarXiv:​1704.​07804.
Zurück zum Zitat Wang, C., Lucey, S., Perazzi, F., & Wang, O. (2019). Web stereo video supervision for depth prediction from dynamic scenes. In International Conference on 3D Vision (3DV) (pp. 348–357). IEEE. Wang, C., Lucey, S., Perazzi, F., & Wang, O. (2019). Web stereo video supervision for depth prediction from dynamic scenes. In International Conference on 3D Vision (3DV) (pp. 348–357). IEEE.
Zurück zum Zitat Wang, C., Miguel Buenaposada, J., Zhu, R., & Lucey, S. (2018). Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Wang, C., Miguel Buenaposada, J., Zhu, R., & Lucey, S. (2018). Learning depth from monocular videos using direct methods. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., & Yuille, A. L. (2015). Towards unified depth and semantic prediction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P., et al. (2004). Image Quality Assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4). Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P., et al. (2004). Image Quality Assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing (TIP), 13(4).
Zurück zum Zitat Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., et al. (2018). Monocular relative depth perception with web stereo data supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 311–320). Xian, K., Shen, C., Cao, Z., Lu, H., Xiao, Y., Li, R., et al. (2018). Monocular relative depth perception with web stereo data supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 311–320).
Zurück zum Zitat Yang, N., Stumberg, L. v., Wang, R., & Cremers, D. (2020). D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1281–1292). Yang, N., Stumberg, L. v., Wang, R., & Cremers, D. (2020). D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1281–1292).
Zurück zum Zitat Yang, N., Wang, R., Stuckler, J., & Cremers, D. (2018a). Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In European Conference on Computer Vision (ECCV). Yang, N., Wang, R., Stuckler, J., & Cremers, D. (2018a). Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry. In European Conference on Computer Vision (ECCV).
Zurück zum Zitat Yang, Z., Wang, P., Xu, W., Zhao, L., & Nevatia, R. (2018b). Unsupervised learning of geometry with edge-aware depth-normal consistency. In Association for the Advancement of Artificial Intelligence (AAAI). Yang, Z., Wang, P., Xu, W., Zhao, L., & Nevatia, R. (2018b). Unsupervised learning of geometry with edge-aware depth-normal consistency. In Association for the Advancement of Artificial Intelligence (AAAI).
Zurück zum Zitat Yin, W., Liu, Y., Shen, C., & Yan, Y. (2019). Enforcing geometric constraints of virtual normal for depth prediction. In IEEE International Conference on Computer Vision (ICCV). Yin, W., Liu, Y., Shen, C., & Yan, Y. (2019). Enforcing geometric constraints of virtual normal for depth prediction. In IEEE International Conference on Computer Vision (ICCV).
Zurück zum Zitat Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., et al. (2020). Learning to recover 3d scene shape from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Yin, W., Zhang, J., Wang, O., Niklaus, S., Mai, L., Chen, S., et al. (2020). Learning to recover 3d scene shape from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Yin, X., Wang, X., Du, X., & Chen, Q. (2017). Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields. In IEEE International Conference on Computer Vision (ICCV) (pp. 5870–5878). Yin, X., Wang, X., Du, X., & Chen, Q. (2017). Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields. In IEEE International Conference on Computer Vision (ICCV) (pp. 5870–5878).
Zurück zum Zitat Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Yin, Z., & Shi, J. (2018). GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Zhan, H., Garg, R., Saroj Weerasekera, C., Li, K., Agarwal, H., & Reid, I. (2018). Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Zhang, Z. (1998). Determining the epipolar geometry and its uncertainty: A review. International Journal on Computer Vision (IJCV), 27(2), 161–195.CrossRef Zhang, Z. (1998). Determining the epipolar geometry and its uncertainty: A review. International Journal on Computer Vision (IJCV), 27(2), 161–195.CrossRef
Zurück zum Zitat Zhao, W., Liu, S., Shu, Y., & Liu, Y.-J. (2020). Towards better generalization: Joint depth-pose learning without posenet. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Zhao, W., Liu, S., Shu, Y., & Liu, Y.-J. (2020). Towards better generalization: Joint depth-pose learning without posenet. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Zhou, J., Wang, Y., Qin, K., & Zeng, W. (2019). Moving indoor: Unsupervised video depth learning in challenging environments. In IEEE International Conference on Computer Vision (ICCV). Zhou, J., Wang, Y., Qin, K., & Zeng, W. (2019). Moving indoor: Unsupervised video depth learning in challenging environments. In IEEE International Conference on Computer Vision (ICCV).
Zurück zum Zitat Zhou, Q.-Y., Park, J., & Koltun, V. (2018). Open3D: A modern library for 3D data processing. arXiv:1801.09847. Zhou, Q.-Y., Park, J., & Koltun, V. (2018). Open3D: A modern library for 3D data processing. arXiv:1801.09847.
Zurück zum Zitat Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Unsupervised learning of depth and ego-motion from video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zurück zum Zitat Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., & Chandraker, M. (2020). Learning monocular visual odometry via self-supervised long-term modeling. In European Conference on Computer Vision (ECCV). Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., & Chandraker, M. (2020). Learning monocular visual odometry via self-supervised long-term modeling. In European Conference on Computer Vision (ECCV).
Zurück zum Zitat Zou, Y., Luo, Z., & Huang, J.-B. (2018). DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision (ECCV). Zou, Y., Luo, Z., & Huang, J.-B. (2018). DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision (ECCV).
Metadaten
Titel
Unsupervised Scale-Consistent Depth Learning from Video
verfasst von
Jia-Wang Bian
Huangying Zhan
Naiyan Wang
Zhichao Li
Le Zhang
Chunhua Shen
Ming-Ming Cheng
Ian Reid
Publikationsdatum
18.06.2021
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 9/2021
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-021-01484-6

Weitere Artikel der Ausgabe 9/2021

International Journal of Computer Vision 9/2021 Zur Ausgabe

Premium Partner