Skip to main content
Top
Published in: Soft Computing 7/2021

05-01-2021 | Methodologies and Application

Cross-modality co-attention networks for visual question answering

Authors: Dezhi Han, Shuli Zhou, Kuan Ching Li, Rodrigo Fernandes de Mello

Published in: Soft Computing | Issue 7/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Visual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention network provides an effective way that selectively utilizes the given visual information. However, the internal relationship of modalities is often ignored in VQA, and most previous models focus on the relationship between visual and language features. To address such an issue: (1) we propose a cross-modality co-attention networks (CMCN) framework, such a network framework aims to help in learning both intra-modality and cross-modality relationships. (2) Cross-modality co-attention (CMC) module is the core of the whole network framework, composed of self-attention blocks and guided-attention blocks. The self-attention block learns the relations of intra-modalities, while the guided-attention block models cross-modal interactions between an image and a question. The cascaded network of multiple CMC modules not only improves the fusion of visual and language representations, but also captures more representative image and text information. (3) To prove that the proposed model can improve the results to some extent, we have carried out a thorough experimental verification. Experimental evaluations on the VQA 2.0 dataset confirm that the CMCN has significant performance advantages over existing methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Anderson P, He XD, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE conference on computer vision and pattern recognition Anderson P, He XD, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE conference on computer vision and pattern recognition
go back to reference Ben-Younes H, Cadene R, Cord M, Thome N (2017) MUTAN: multimodal tucker fusion for visual question answering. In: IEEE international conference on computer vision Ben-Younes H, Cadene R, Cord M, Thome N (2017) MUTAN: multimodal tucker fusion for visual question answering. In: IEEE international conference on computer vision
go back to reference Chung J, Gulcehre C, Cho K (2015) Gated feedback recurrent neural networks. In: International conference on machine learning, pp 2067–2075 Chung J, Gulcehre C, Cho K (2015) Gated feedback recurrent neural networks. In: International conference on machine learning, pp 2067–2075
go back to reference Deng W, Xu JJ, Zhao HM, Song YJ (2020a) An effective improved co-evolution ant colony optimisation algorithm with multi-strategies and its application. Int J Bio-Inspir Comput 16(3):1–10CrossRef Deng W, Xu JJ, Zhao HM, Song YJ (2020a) An effective improved co-evolution ant colony optimisation algorithm with multi-strategies and its application. Int J Bio-Inspir Comput 16(3):1–10CrossRef
go back to reference Deng W, Xu JJ, Zhao HM, Song YJ (2020b) A novel gate resource allocation method using improved PSO-based QEA. IEEE trans intell transp syst 1–9 Deng W, Xu JJ, Zhao HM, Song YJ (2020b) A novel gate resource allocation method using improved PSO-based QEA. IEEE trans intell transp syst 1–9
go back to reference Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805
go back to reference Fukui A, Huk Park D, Yang D, Rohrbach A, Darrell T, and Rohrbach M (2016) Multi-modal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 457–468 Fukui A, Huk Park D, Yang D, Rohrbach A, Darrell T, and Rohrbach M (2016) Multi-modal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 457–468
go back to reference Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: IEEE conference on computer vision and pattern recognition, pp 317–326 Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: IEEE conference on computer vision and pattern recognition, pp 317–326
go back to reference Gao P, Jiang ZK, You HX, Lu P, Steven CH, Wang XG, Li HS (2019) Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR, pp 6639–6648 Gao P, Jiang ZK, You HX, Lu P, Steven CH, Wang XG, Li HS (2019) Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR, pp 6639–6648
go back to reference Gao HY, Mao JH, Zhou J, Huang ZH, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. NIPS 28:2296–2304 Gao HY, Mao JH, Zhou J, Huang ZH, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. NIPS 28:2296–2304
go back to reference Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA Matter: elevating the role of image understanding in visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 6904–6913 Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA Matter: elevating the role of image understanding in visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 6904–6913
go back to reference Guo Z, Han D, Massetto FI, Li K-C (2021) Double-layer affective visual question answering network. Comput Sci Inf Syst 18(1):38CrossRef Guo Z, Han D, Massetto FI, Li K-C (2021) Double-layer affective visual question answering network. Comput Sci Inf Syst 18(1):38CrossRef
go back to reference Gurari D, Li Q, Stangl AJ, Guo AH, Lin C, Grauman K, Luo JB, and Bigham JP (2018) Vizwiz grand challenge: answering visual questions from blind people. In: IEEE conference on computer vision and pattern recognition Gurari D, Li Q, Stangl AJ, Guo AH, Lin C, Grauman K, Luo JB, and Bigham JP (2018) Vizwiz grand challenge: answering visual questions from blind people. In: IEEE conference on computer vision and pattern recognition
go back to reference He S, Han D (2020) An effective dense co-attention networks for visual question answering. Sensors 20:4897CrossRef He S, Han D (2020) An effective dense co-attention networks for visual question answering. Sensors 20:4897CrossRef
go back to reference He KM, Zhang XY, Ren SQ, and Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778 He KM, Zhang XY, Ren SQ, and Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778
go back to reference Ilievski I, Yan SC, Feng JS (2016) A focused dynamic attention model for visual question answering. In: CoRR. arXiv:abs/1604.01485 Ilievski I, Yan SC, Feng JS (2016) A focused dynamic attention model for visual question answering. In: CoRR. arXiv:​abs/​1604.​01485
go back to reference Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. arXiv preprint arXiv:1805.07932 Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. arXiv preprint arXiv:​1805.​07932
go back to reference Kim JH, Woon K, Lim W, Kim J, Ha JW, Zhang BT (2017) Hadamard product for low-rank bilinear pooling. In: ICLR 2017 Kim JH, Woon K, Lim W, Kim J, Ha JW, Zhang BT (2017) Hadamard product for low-rank bilinear pooling. In: ICLR 2017
go back to reference Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:​1412.​6980
go back to reference Krishna R, Zhu YK, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73MathSciNetCrossRef Krishna R, Zhu YK, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73MathSciNetCrossRef
go back to reference Lei BJ, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450 Lei BJ, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:​1607.​06450
go back to reference Li S, Xiao T, Li HS, Yang W, Wang XG (2017) Identity-aware textual-visual matching with latent co-attention. In: Computer vision (ICCV), 2017 IEEE international conference on, pp 1908–1917 Li S, Xiao T, Li HS, Yang W, Wang XG (2017) Identity-aware textual-visual matching with latent co-attention. In: Computer vision (ICCV), 2017 IEEE international conference on, pp 1908–1917
go back to reference Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll' ar P, Zitnick CL (2014 )Microsoft CoCo: common objects in context. In: Proceedings of the European conference on computer vision, pp 740–755. Springer Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Doll' ar P, Zitnick CL (2014 )Microsoft CoCo: common objects in context. In: Proceedings of the European conference on computer vision, pp 740–755. Springer
go back to reference Lu J, Yang JW, Batra D, Parikh D (2016a) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297 Lu J, Yang JW, Batra D, Parikh D (2016a) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297
go back to reference Lu J, Yang JW, Batra D, Parikh D (2016b) Hierarchical question-image co-attention for visual question answering. NIPS 29:289–297 Lu J, Yang JW, Batra D, Parikh D (2016b) Hierarchical question-image co-attention for visual question answering. NIPS 29:289–297
go back to reference Ma L, Lu ZD, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI, pp 3567–3573 Ma L, Lu ZD, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI, pp 3567–3573
go back to reference Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: ICCV, pp 1–9 Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: ICCV, pp 1–9
go back to reference Mao JH , Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR Mao JH , Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR
go back to reference Nam H, Ha JW, Kim J (2017) Dual attention networks for multi-modal reasoning and matching. In: CVPR, pp 2156–2164 Nam H, Ha JW, Kim J (2017) Dual attention networks for multi-modal reasoning and matching. In: CVPR, pp 2156–2164
go back to reference Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR, pp 6087–6096 Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR, pp 6087–6096
go back to reference Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: The conference on empirical methods in natural language processing, pp 1532–1543 Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: The conference on empirical methods in natural language processing, pp 1532–1543
go back to reference Ren SQ, He KM, Girshick R, Sun J (2015a) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99 Ren SQ, He KM, Girshick R, Sun J (2015a) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99
go back to reference Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. NIPS 28:2953–2961 Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. NIPS 28:2953–2961
go back to reference Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409-1416 Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:​1409-1416
go back to reference Sun SY, Pang JM, Shi JP, Yi S, Ouyang WL (2018) Fishnet: a versatile backbone for image, region, and pixel-level prediction. In: Advances in neural information processing systems, pp 760–770 Sun SY, Pang JM, Shi JP, Yi S, Ouyang WL (2018) Fishnet: a versatile backbone for image, region, and pixel-level prediction. In: Advances in neural information processing systems, pp 760–770
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
go back to reference Weng T-H, Chiu C-C, Hsieh M-Y, Lu H, Li K-C (2020) Parallelisation of practical shared sampling alpha matting with OpenMP. Int J Comput Sci Eng 21(1):105–115 Weng T-H, Chiu C-C, Hsieh M-Y, Lu H, Li K-C (2020) Parallelisation of practical shared sampling alpha matting with OpenMP. Int J Comput Sci Eng 21(1):105–115
go back to reference Wu Q , Shen CH , Liu LQ, Dick AR, Hengel A (2016) What value do explicit high-level concepts have in vision to language problems? In: CVPR, pp. 203–212. Wu Q , Shen CH , Liu LQ, Dick AR, Hengel A (2016) What value do explicit high-level concepts have in vision to language problems? In: CVPR, pp. 203–212.
go back to reference Xiong CM, Zhong V, Socher R (2017) Dynamic co-attention networks for question answering. In: International conference on learning representations Xiong CM, Zhong V, Socher R (2017) Dynamic co-attention networks for question answering. In: International conference on learning representations
go back to reference Xu HJ, Saenko K (2016) Ask, attend and answer. exploring question-guided spatial attention for visual question answering. ECCV 7:451–466 Xu HJ, Saenko K (2016) Ask, attend and answer. exploring question-guided spatial attention for visual question answering. ECCV 7:451–466
go back to reference Yang ZC, He XD, Gao JF, Deng L, J. Smola (2016) Stacked attention networks for image question answering. In: CVPR, pp. 21–29 Yang ZC, He XD, Gao JF, Deng L, J. Smola (2016) Stacked attention networks for image question answering. In: CVPR, pp. 21–29
go back to reference Yu Z, Yu J, Fan JP, Tao DC (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV, pp 1839–1848 Yu Z, Yu J, Fan JP, Tao DC (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV, pp 1839–1848
go back to reference Yu Z, Yu J, Xiang CC, Fan JP, Tao DC (2018) beyond bilinear: generalized multi-modal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef Yu Z, Yu J, Xiang CC, Fan JP, Tao DC (2018) beyond bilinear: generalized multi-modal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef
go back to reference Zhang Y, Hare JS (2018) Adam Prügel-Bennett. Learning to count objects in natural images for visual question answering. In: ICLR Zhang Y, Hare JS (2018) Adam Prügel-Bennett. Learning to count objects in natural images for visual question answering. In: ICLR
go back to reference Zitnick CL, Agrawal A, Antol S, Mitchell M, Batra D, Parikh D (2016) Measuring machine intelligence through visual question answering. AI Maga 37(1):63–72CrossRef Zitnick CL, Agrawal A, Antol S, Mitchell M, Batra D, Parikh D (2016) Measuring machine intelligence through visual question answering. AI Maga 37(1):63–72CrossRef
Metadata
Title
Cross-modality co-attention networks for visual question answering
Authors
Dezhi Han
Shuli Zhou
Kuan Ching Li
Rodrigo Fernandes de Mello
Publication date
05-01-2021
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 7/2021
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-020-05539-7

Other articles of this Issue 7/2021

Soft Computing 7/2021 Go to the issue

Premium Partner