Skip to main content
Erschienen in: International Journal of Computer Vision 8/2021

18.05.2021

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

verfasst von: Dapeng Du, Limin Wang, Zhaoyang Li, Gangshan Wu

Erschienen in: International Journal of Computer Vision | Ausgabe 8/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The existing RGB-D scene recognition approaches typically employ two separate and modality-specific networks to learn effective RGB and Depth representations respectively. This independent training scheme fails to capture the correlation of two modalities, and thus may be suboptimal for RGB-D scene recognition. To address this issue, this paper proposes a general and flexible framework to enhance RGB-D representation learning with a customized cross-modal pyramid translation branch, coined as TRecgNet. This framework unifies the tasks of cross-modal translation and modality-specific recognition with a shared feature encoder, and aims at leveraging the correspondence between two modalities to regularize the representation learning of each modality. Specifically, we present a cross-modal pyramid translation strategy to perform multi-scale image generation with a carefully designed layer-wise perceptual supervision. To improve the complementarity of cross-modal translation to modality specific scene recognition, we devise a feature selection module to adaptively enhance the discriminative information during the translation procedure. In addition, we train multiple auxiliary classifiers to further regularize the behavior of generated data to be consistent with its paired data on label prediction. Meanwhile, our translation branch enables us to generate cross-modal data for training data augmentation and further improve single modality scene recognition. Extensive experiments on benchmarks of SUN RGB-D and NYU Depth V2 demonstrate the superiority of the proposed method to the state-of-the-art RGB-D scene recognition methods. We also generalize the TRecgNet to the single modality scene recognition benchmark of MIT Indoor, and automatically synthesize a depth view to boost the final recognition accuracy.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., & Torralba, A. (2017). Cross-modal scene networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10), 2303–2314.CrossRef Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., & Torralba, A. (2017). Cross-modal scene networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10), 2303–2314.CrossRef
Zurück zum Zitat Banica, D., & Sminchisescu, C. (2015). Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images. In IEEE conference on computer vision and pattern recognition, pp. 3517–3526. Banica, D., & Sminchisescu, C. (2015). Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images. In IEEE conference on computer vision and pattern recognition, pp. 3517–3526.
Zurück zum Zitat Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 839–847). IEEE. Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 839–847). IEEE.
Zurück zum Zitat Chen, Y., Lai, Y. K., & Liu, Y. J. (2018). CartoonGAN: Generative adversarial networks for photo cartoonization. In IEEE conference on computer vision and pattern recognition, pp. 9465–9474. Chen, Y., Lai, Y. K., & Liu, Y. J. (2018). CartoonGAN: Generative adversarial networks for photo cartoonization. In IEEE conference on computer vision and pattern recognition, pp. 9465–9474.
Zurück zum Zitat Cheng, X., Lu, J., Feng, J., Yuan, B., & Zhou, J. (2018). Scene recognition with objectness. Pattern Recognition, 74, 474–487.CrossRef Cheng, X., Lu, J., Feng, J., Yuan, B., & Zhou, J. (2018). Scene recognition with objectness. Pattern Recognition, 74, 474–487.CrossRef
Zurück zum Zitat Christoudias, C. M., Urtasun, R., Salzmann, M., & Darrell, T. (2010). Learning to recognize objects from unseen modalities. In European conference on computer vision (pp. 677–691). Springer. Christoudias, C. M., Urtasun, R., Salzmann, M., & Darrell, T. (2010). Learning to recognize objects from unseen modalities. In European conference on computer vision (pp. 677–691). Springer.
Zurück zum Zitat Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.MathSciNetCrossRef Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.MathSciNetCrossRef
Zurück zum Zitat Du, D., Wang, L., Wang, H., Zhao, K., & Wu, G. (2019). Translate-to-recognize networks for RGB-D scene recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11836–11845. Du, D., Wang, L., Wang, H., Zhao, K., & Wu, G. (2019). Translate-to-recognize networks for RGB-D scene recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11836–11845.
Zurück zum Zitat Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In IEEE conference on computer vision and pattern recognition (pp. 2414–2423). IEEE. Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In IEEE conference on computer vision and pattern recognition (pp. 2414–2423). IEEE.
Zurück zum Zitat Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.
Zurück zum Zitat Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. IJCV, 112(2), 133–149.MathSciNetCrossRef Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. IJCV, 112(2), 133–149.MathSciNetCrossRef
Zurück zum Zitat Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (pp. 345–360). Springer. Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (pp. 345–360). Springer.
Zurück zum Zitat Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In IEEE conference on computer vision and pattern recognition, pp. 2827–2836. Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In IEEE conference on computer vision and pattern recognition, pp. 2827–2836.
Zurück zum Zitat He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In IEEE international conference on computer vision, pp. 2980–2988. He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In IEEE international conference on computer vision, pp. 2980–2988.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pp. 346–361. Springer. He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pp. 346–361. Springer.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778.
Zurück zum Zitat Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637.
Zurück zum Zitat Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:​1704.​04861
Zurück zum Zitat Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In IEEE conference on computer vision and pattern recognition. Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In IEEE conference on computer vision and pattern recognition.
Zurück zum Zitat Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision (pp. 141–165). Springer. Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision (pp. 141–165). Springer.
Zurück zum Zitat Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer. Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.
Zurück zum Zitat Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In Proceedings of the IEEE international conference on computer vision, pp. 4163–4172. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In Proceedings of the IEEE international conference on computer vision, pp. 4163–4172.
Zurück zum Zitat Kapidis, G., Poppe, R., van Dam, E., Noldus, L., & Veltkamp, R. (2019). Multitask learning to improve egocentric action recognition. In Proceedings of the IEEE international conference on computer vision workshops. Kapidis, G., Poppe, R., van Dam, E., Noldus, L., & Veltkamp, R. (2019). Multitask learning to improve egocentric action recognition. In Proceedings of the IEEE international conference on computer vision workshops.
Zurück zum Zitat Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491.
Zurück zum Zitat Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6129–6138. Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6129–6138.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105.
Zurück zum Zitat Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 2016 fourth international conference on 3D Vision (3DV) (pp. 239–248). IEEE. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 2016 fourth international conference on 3D Vision (3DV) (pp. 239–248). IEEE.
Zurück zum Zitat Li, T., & Wang, L. (2020). Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691 Li, T., & Wang, L. (2020). Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:​2001.​05691
Zurück zum Zitat Li, Y., Zhang, J., Cheng, Y., Huang, K., & Tan, T. (2018). DF2Net: Discriminative feature learning and fusion network for RGB-D indoor scene classification. In AAAI. Li, Y., Zhang, J., Cheng, Y., Huang, K., & Tan, T. (2018). DF2Net: Discriminative feature learning and fusion network for RGB-D indoor scene classification. In AAAI.
Zurück zum Zitat Liu, Y., Feng, X., & Zhou, Z. (2016). Multimodal video classification with stacked contractive autoencoders. Signal Processing, 120, 761–766.CrossRef Liu, Y., Feng, X., & Zhou, Z. (2016). Multimodal video classification with stacked contractive autoencoders. Signal Processing, 120, 761–766.CrossRef
Zurück zum Zitat Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5137–5146. Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5137–5146.
Zurück zum Zitat Lvd, M., & Hinton, G. (2008). Visualizing data using T-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.MATH Lvd, M., & Hinton, G. (2008). Visualizing data using T-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.MATH
Zurück zum Zitat McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In IEEE international conference on computer vision, Vol. 4. McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In IEEE international conference on computer vision, Vol. 4.
Zurück zum Zitat Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3994–4003. Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3994–4003.
Zurück zum Zitat Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696.
Zurück zum Zitat Omeiza, D., Speakman, S., Cintas, C., & Weldermariam, K. (2019). Smooth Grad-CAM++: An enhanced inference level visualization technique for deep convolutional neural network models. arXiv preprint arXiv:1908.01224 Omeiza, D., Speakman, S., Cintas, C., & Weldermariam, K. (2019). Smooth Grad-CAM++: An enhanced inference level visualization technique for deep convolutional neural network models. arXiv preprint arXiv:​1908.​01224
Zurück zum Zitat Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8024–8035. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8024–8035.
Zurück zum Zitat Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In 2009 IEEE conference on computer vision and pattern recognition (pp. 413–420). IEEE. Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In 2009 IEEE conference on computer vision and pattern recognition (pp. 413–420). IEEE.
Zurück zum Zitat Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:​1511.​06434
Zurück zum Zitat Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, pp. 91–99.
Zurück zum Zitat Ren, Z., & Jae Lee, Y. (2018). Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 762–771. Ren, Z., & Jae Lee, Y. (2018). Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 762–771.
Zurück zum Zitat Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In International conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In International conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer.
Zurück zum Zitat Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626.
Zurück zum Zitat Shmelkov, K., Schmid, C., & Alahari, K. (2018). How good is my gan? In Proceedings of the European conference on computer vision (European Conference on Computer Vision), pp. 213–229. Shmelkov, K., Schmid, C., & Alahari, K. (2018). How good is my gan? In Proceedings of the European conference on computer vision (European Conference on Computer Vision), pp. 213–229.
Zurück zum Zitat Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGB-D images. In European conference on computer vision (pp. 746–760). Springer. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGB-D images. In European conference on computer vision (pp. 746–760). Springer.
Zurück zum Zitat Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556 Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arXiv:​1409.​1556
Zurück zum Zitat Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943. Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943.
Zurück zum Zitat Song, S., Lichtenberg, S. P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE conference on computer vision and pattern recognition (pp. 567–576). IEEE. Song, S., Lichtenberg, S. P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE conference on computer vision and pattern recognition (pp. 567–576). IEEE.
Zurück zum Zitat Song, X., Herranz, L., & Jiang, S. (2017). Depth CNNs for RGB-D scene recognition: Learning from scratch better than transferring from RGB-CNNs. In Thirty-first AAAI conference on artificial intelligence. Song, X., Herranz, L., & Jiang, S. (2017). Depth CNNs for RGB-D scene recognition: Learning from scratch better than transferring from RGB-CNNs. In Thirty-first AAAI conference on artificial intelligence.
Zurück zum Zitat Song, X., Jiang, S., Wang, B., Chen, C., & Chen, G. (2019). Image representations with spatial object-to-object relations for RGB-D scene recognition. IEEE Transactions on Image Processing, 29, 525–537.MathSciNetCrossRef Song, X., Jiang, S., Wang, B., Chen, C., & Chen, G. (2019). Image representations with spatial object-to-object relations for RGB-D scene recognition. IEEE Transactions on Image Processing, 29, 525–537.MathSciNetCrossRef
Zurück zum Zitat Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep Boltzmann machines. In Advances in neural information processing systems, pp. 2222–2230. Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep Boltzmann machines. In Advances in neural information processing systems, pp. 2222–2230.
Zurück zum Zitat Takikawa, T., Acuna, D., Jampani, V., & Fidler, S. (2019). Gated-SCNN: Gated shape CNNs for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 5229–5238. Takikawa, T., Acuna, D., Jampani, V., & Fidler, S. (2019). Gated-SCNN: Gated shape CNNs for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 5229–5238.
Zurück zum Zitat Wang, A., Cai, J., Lu, J., & Cham, T. J. (2016). Modality and component aware feature fusion for RGB-D scene classification. In IEEE conference on computer vision and pattern recognition (pp. 5995–6004). IEEE. Wang, A., Cai, J., Lu, J., & Cham, T. J. (2016). Modality and component aware feature fusion for RGB-D scene classification. In IEEE conference on computer vision and pattern recognition (pp. 5995–6004). IEEE.
Zurück zum Zitat Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., & Hu, X. (2020). Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 24–25. Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., & Hu, X. (2020). Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 24–25.
Zurück zum Zitat Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017a). Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Transactions on Image Processing, 26(4), 2055–2068.MathSciNetCrossRef Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017a). Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Transactions on Image Processing, 26(4), 2055–2068.MathSciNetCrossRef
Zurück zum Zitat Wang, L., Li, W., Li, W., & Gool, L. V. (2018a). Appearance-and-relation networks for video classification. In IEEE conference on computer vision and pattern recognition, pp. 1430–1439. Wang, L., Li, W., Li, W., & Gool, L. V. (2018a). Appearance-and-relation networks for video classification. In IEEE conference on computer vision and pattern recognition, pp. 1430–1439.
Zurück zum Zitat Wang, L., Wang, Z., Qiao, Y., & Gool, L. V. (2018b). Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision, 126(2–4), 390–409.MathSciNetCrossRef Wang, L., Wang, Z., Qiao, Y., & Gool, L. V. (2018b). Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision, 126(2–4), 390–409.MathSciNetCrossRef
Zurück zum Zitat Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., & Ogunbona, P. (2017b). Scene flow to action map: A new representation for RGB-D based action recognition with convolutional neural networks. In IEEE conference on computer vision and pattern recognition. Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., & Ogunbona, P. (2017b). Scene flow to action map: A new representation for RGB-D based action recognition with convolutional neural networks. In IEEE conference on computer vision and pattern recognition.
Zurück zum Zitat Wang, X., & Gupta, A. (2016). Generative image modeling using style and structure adversarial networks. In European conference on computer vision (pp. 318–335). Springer. Wang, X., & Gupta, A. (2016). Generative image modeling using style and structure adversarial networks. In European conference on computer vision (pp. 318–335). Springer.
Zurück zum Zitat Wang, Z., Wang, L., Wang, Y., Zhang, B., & Qiao, Y. (2017c). Weakly supervised patchnets: Describing and aggregating local patches for scene recognition. IEEE Transactions on Image Processing, 26(4), 2028–2041.MathSciNetCrossRef Wang, Z., Wang, L., Wang, Y., Zhang, B., & Qiao, Y. (2017c). Weakly supervised patchnets: Describing and aggregating local patches for scene recognition. IEEE Transactions on Image Processing, 26(4), 2028–2041.MathSciNetCrossRef
Zurück zum Zitat Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using SFM and object labels. In IEEE international conference on computer vision (pp. 1625–1632). IEEE. Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using SFM and object labels. In IEEE international conference on computer vision (pp. 1625–1632). IEEE.
Zurück zum Zitat Xiong, Z., Yuan, Y., & Wang, Q. (2020). MSN: Modality separation networks for RGB-D scene recognition. Neurocomputing, 373, 81–89.CrossRef Xiong, Z., Yuan, Y., & Wang, Q. (2020). MSN: Modality separation networks for RGB-D scene recognition. Neurocomputing, 373, 81–89.CrossRef
Zurück zum Zitat Xu, D., Ouyang, W., Ricci, E., Wang, X., & Sebe, N. (2017a). Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5363–5371. Xu, D., Ouyang, W., Ricci, E., Wang, X., & Sebe, N. (2017a). Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5363–5371.
Zurück zum Zitat Xu, X., Li, Y., Wu, G., & Luo, J. (2017b). Multi-modal dep feature learning for RGB-D object detection. Pattern Recognition, 72, 300–313.CrossRef Xu, X., Li, Y., Wu, G., & Luo, J. (2017b). Multi-modal dep feature learning for RGB-D object detection. Pattern Recognition, 72, 300–313.CrossRef
Zurück zum Zitat Yuan, Y., Xiong, Z., & Wang, Q. (2019). ACM: Adaptive cross-modal graph convolutional neural networks for RGB-D scene recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 9176–9184. Yuan, Y., Xiong, Z., & Wang, Q. (2019). ACM: Adaptive cross-modal graph convolutional neural networks for RGB-D scene recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 9176–9184.
Zurück zum Zitat Zhang, J., Li, W., Ogunbona, P. O., Wang, P., & Tang, C. (2016a). RGB-D-based action recognition datasets: A survey. Pattern Recognition, 60, 86–105.CrossRef Zhang, J., Li, W., Ogunbona, P. O., Wang, P., & Tang, C. (2016a). RGB-D-based action recognition datasets: A survey. Pattern Recognition, 60, 86–105.CrossRef
Zurück zum Zitat Zhang, R., Isola, P., & Efros, A. A. (2016b). Colorful image colorization. In European conference on computer vision (pp. 649–666). Springer. Zhang, R., Isola, P., & Efros, A. A. (2016b). Colorful image colorization. In European conference on computer vision (pp. 649–666). Springer.
Zurück zum Zitat Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial landmark detection by deep multi-task learning. In European conference on computer vision (pp. 94–108). Springer. Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial landmark detection by deep multi-task learning. In European conference on computer vision (pp. 94–108). Springer.
Zurück zum Zitat Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pp. 487–495. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pp. 487–495.
Zurück zum Zitat Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. In PAMI. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. In PAMI.
Zurück zum Zitat Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for RGBD indoor scene recognition. In IEEE conference on computer vision and pattern recognition (pp. 2969–2976). IEEE. Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for RGBD indoor scene recognition. In IEEE conference on computer vision and pattern recognition (pp. 2969–2976). IEEE.
Metadaten
Titel
Cross-Modal Pyramid Translation for RGB-D Scene Recognition
verfasst von
Dapeng Du
Limin Wang
Zhaoyang Li
Gangshan Wu
Publikationsdatum
18.05.2021
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 8/2021
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-021-01475-7

Weitere Artikel der Ausgabe 8/2021

International Journal of Computer Vision 8/2021 Zur Ausgabe