nach oben

International Journal of Computer Vision

Erschienen in:

18.05.2021

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

verfasst von: Dapeng Du, Limin Wang, Zhaoyang Li, Gangshan Wu

Erschienen in: International Journal of Computer Vision | Ausgabe 8/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The existing RGB-D scene recognition approaches typically employ two separate and modality-specific networks to learn effective RGB and Depth representations respectively. This independent training scheme fails to capture the correlation of two modalities, and thus may be suboptimal for RGB-D scene recognition. To address this issue, this paper proposes a general and flexible framework to enhance RGB-D representation learning with a customized cross-modal pyramid translation branch, coined as TRecgNet. This framework unifies the tasks of cross-modal translation and modality-specific recognition with a shared feature encoder, and aims at leveraging the correspondence between two modalities to regularize the representation learning of each modality. Specifically, we present a cross-modal pyramid translation strategy to perform multi-scale image generation with a carefully designed layer-wise perceptual supervision. To improve the complementarity of cross-modal translation to modality specific scene recognition, we devise a feature selection module to adaptively enhance the discriminative information during the translation procedure. In addition, we train multiple auxiliary classifiers to further regularize the behavior of generated data to be consistent with its paired data on label prediction. Meanwhile, our translation branch enables us to generate cross-modal data for training data augmentation and further improve single modality scene recognition. Extensive experiments on benchmarks of SUN RGB-D and NYU Depth V2 demonstrate the superiority of the proposed method to the state-of-the-art RGB-D scene recognition methods. We also generalize the TRecgNet to the single modality scene recognition benchmark of MIT Indoor, and automatically synthesize a depth view to boost the final recognition accuracy.

Nächster Artikel Deep CockTail Networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., & Torralba, A. (2017). Cross-modal scene networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10), 2303–2314.CrossRef

Banica, D., & Sminchisescu, C. (2015). Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images. In IEEE conference on computer vision and pattern recognition, pp. 3517–3526.

Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 839–847). IEEE.

Chen, Y., Lai, Y. K., & Liu, Y. J. (2018). CartoonGAN: Generative adversarial networks for photo cartoonization. In IEEE conference on computer vision and pattern recognition, pp. 9465–9474.

Cheng, X., Lu, J., Feng, J., Yuan, B., & Zhou, J. (2018). Scene recognition with objectness. Pattern Recognition, 74, 474–487.CrossRef

Christoudias, C. M., Urtasun, R., Salzmann, M., & Darrell, T. (2010). Learning to recognize objects from unseen modalities. In European conference on computer vision (pp. 677–691). Springer.

Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.MathSciNetCrossRef

Du, D., Wang, L., Wang, H., Zhao, K., & Wu, G. (2019). Translate-to-recognize networks for RGB-D scene recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11836–11845.

Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In IEEE conference on computer vision and pattern recognition (pp. 2414–2423). IEEE.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680.

Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. IJCV, 112(2), 133–149.MathSciNetCrossRef

Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (pp. 345–360). Springer.

Gupta, S., Hoffman, J., & Malik, J. (2016). Cross modal distillation for supervision transfer. In IEEE conference on computer vision and pattern recognition, pp. 2827–2836.

He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In IEEE international conference on computer vision, pp. 2980–2988.

He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pp. 346–361. Springer.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In IEEE conference on computer vision and pattern recognition.

Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., & Darrell, T. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision (pp. 141–165). Springer.

Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision (pp. 694–711). Springer.

Kalogeiton, V., Weinzaepfel, P., Ferrari, V., & Schmid, C. (2017). Joint learning of object and action detectors. In Proceedings of the IEEE international conference on computer vision, pp. 4163–4172.

Kapidis, G., Poppe, R., van Dam, E., Noldus, L., & Veltkamp, R. (2019). Multitask learning to improve egocentric action recognition. In Proceedings of the IEEE international conference on computer vision workshops.

Kendall, A., Gal, Y., & Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491.

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6129–6138.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105.

Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In 2016 fourth international conference on 3D Vision (3DV) (pp. 239–248). IEEE.

Li, T., & Wang, L. (2020). Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691

Li, Y., Zhang, J., Cheng, Y., Huang, K., & Tan, T. (2018). DF2Net: Discriminative feature learning and fusion network for RGB-D indoor scene classification. In AAAI.

Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400

Liu, Y., Feng, X., & Zhou, Z. (2016). Multimodal video classification with stacked contractive autoencoders. Signal Processing, 120, 761–766.CrossRef

Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5137–5146.

Lvd, M., & Hinton, G. (2008). Visualizing data using T-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.MATH

McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In IEEE international conference on computer vision, Vol. 4.

Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3994–4003.

Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696.

Omeiza, D., Speakman, S., Cintas, C., & Weldermariam, K. (2019). Smooth Grad-CAM++: An enhanced inference level visualization technique for deep convolutional neural network models. arXiv preprint arXiv:1908.01224

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8024–8035.

Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In 2009 IEEE conference on computer vision and pattern recognition (pp. 413–420). IEEE.

Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434

Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, pp. 91–99.

Ren, Z., & Jae Lee, Y. (2018). Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 762–771.

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In International conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer.

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626.

Shmelkov, K., Schmid, C., & Alahari, K. (2018). How good is my gan? In Proceedings of the European conference on computer vision (European Conference on Computer Vision), pp. 213–229.

Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGB-D images. In European conference on computer vision (pp. 746–760). Springer.

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556

Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943.

Song, S., Lichtenberg, S. P., & Xiao, J. (2015). SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE conference on computer vision and pattern recognition (pp. 567–576). IEEE.

Song, X., Herranz, L., & Jiang, S. (2017). Depth CNNs for RGB-D scene recognition: Learning from scratch better than transferring from RGB-CNNs. In Thirty-first AAAI conference on artificial intelligence.

Song, X., Jiang, S., Wang, B., Chen, C., & Chen, G. (2019). Image representations with spatial object-to-object relations for RGB-D scene recognition. IEEE Transactions on Image Processing, 29, 525–537.MathSciNetCrossRef

Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep Boltzmann machines. In Advances in neural information processing systems, pp. 2222–2230.

Takikawa, T., Acuna, D., Jampani, V., & Fidler, S. (2019). Gated-SCNN: Gated shape CNNs for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 5229–5238.

team A (2019) https://github.com/ashrutkumar/indoor-scene-recognition, 3rd rank of kaggle challenge. Kaggle challenge.

Wang, A., Cai, J., Lu, J., & Cham, T. J. (2016). Modality and component aware feature fusion for RGB-D scene classification. In IEEE conference on computer vision and pattern recognition (pp. 5995–6004). IEEE.

Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., & Hu, X. (2020). Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 24–25.

Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017a). Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Transactions on Image Processing, 26(4), 2055–2068.MathSciNetCrossRef

Wang, L., Li, W., Li, W., & Gool, L. V. (2018a). Appearance-and-relation networks for video classification. In IEEE conference on computer vision and pattern recognition, pp. 1430–1439.

Wang, L., Wang, Z., Qiao, Y., & Gool, L. V. (2018b). Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision, 126(2–4), 390–409.MathSciNetCrossRef

Wang, P., Li, W., Gao, Z., Zhang, Y., Tang, C., & Ogunbona, P. (2017b). Scene flow to action map: A new representation for RGB-D based action recognition with convolutional neural networks. In IEEE conference on computer vision and pattern recognition.

Wang, X., & Gupta, A. (2016). Generative image modeling using style and structure adversarial networks. In European conference on computer vision (pp. 318–335). Springer.

Wang, Z., Wang, L., Wang, Y., Zhang, B., & Qiao, Y. (2017c). Weakly supervised patchnets: Describing and aggregating local patches for scene recognition. IEEE Transactions on Image Processing, 26(4), 2028–2041.MathSciNetCrossRef

Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using SFM and object labels. In IEEE international conference on computer vision (pp. 1625–1632). IEEE.

Xiong, Z., Yuan, Y., & Wang, Q. (2020). MSN: Modality separation networks for RGB-D scene recognition. Neurocomputing, 373, 81–89.CrossRef

Xu, D., Ouyang, W., Ricci, E., Wang, X., & Sebe, N. (2017a). Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5363–5371.

Xu, X., Li, Y., Wu, G., & Luo, J. (2017b). Multi-modal dep feature learning for RGB-D object detection. Pattern Recognition, 72, 300–313.CrossRef

Yuan, Y., Xiong, Z., & Wang, Q. (2019). ACM: Adaptive cross-modal graph convolutional neural networks for RGB-D scene recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 9176–9184.

Zhang, J., Li, W., Ogunbona, P. O., Wang, P., & Tang, C. (2016a). RGB-D-based action recognition datasets: A survey. Pattern Recognition, 60, 86–105.CrossRef

Zhang, R., Isola, P., & Efros, A. A. (2016b). Colorful image colorization. In European conference on computer vision (pp. 649–666). Springer.

Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014). Facial landmark detection by deep multi-task learning. In European conference on computer vision (pp. 94–108). Springer.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pp. 487–495.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. In PAMI.

Zhu, H., Weibel, J. B., & Lu, S. (2016). Discriminative multi-modal feature fusion for RGBD indoor scene recognition. In IEEE conference on computer vision and pattern recognition (pp. 2969–2976). IEEE.

Titel: Cross-Modal Pyramid Translation for RGB-D Scene Recognition
verfasst von: Dapeng Du
Limin Wang
Zhaoyang Li
Gangshan Wu
Publikationsdatum: 18.05.2021
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 8/2021
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-021-01475-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 8/2021

Deep CockTail Networks

Saliency Detection Inspired by Topological Perception Theory

Deep Unsupervised 3D Human Body Reconstruction from a Sparse set of Landmarks

OCNet: Object Context for Semantic Segmentation

ShadingNet: Image Intrinsics by Fine-Grained Shading Decomposition

MADAN: Multi-source Adversarial Domain Aggregation Network for Domain Adaptation