Skip to main content
Top
Published in: Neural Processing Letters 2/2022

13-01-2022

Research on Visual Question Answering Based on GAT Relational Reasoning

Authors: Yalin Miao, Wenfang Cheng, Shuyun He, Hui Jiang

Published in: Neural Processing Letters | Issue 2/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Due to the diversity of questions in VQA, it brings new challenges to the construction of VQA model. Existing VQA models focus on constructing a new attention mechanism, which makes the model increasingly complex. In addition, most of them concentrate on object recognition, ignore the research on spatial reasoning, semantic relations and even scene understanding. Therefore, a Graph Attention Network Relational Reasoning (GAT2R) model is proposed in this paper, which mainly includes scene graph generation and scene graph answer prediction. The scene map generation module mainly extracts the regional and spatial features of objects through the object detection model, and uses the relation decoder to predict the relations between object pairs. The scene graph answer prediction dynamically updates the node representation through the question-guided graph attention network, then performs multi-modal fusion with the question features, the answer is obtained finally. The experimental result shows that the accuracy of the proposed model is 54.45% on the natural scene dataset GQA, which is mainly based on relational reasoning. The experimental result is 68.04% on the widely used VQA2.0 dataset. Compared with the benchmark model, the accuracy of the proposed model is increased by 4.71% and 2.37% on GQA and VQA2.0 respectively, which proves the effectiveness and generalization of the model.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Liang W, Tian Y, Chen C et al (2020) Moss: End-to-end dialog system framework with modular supervision. Proc AAAI Conf Artif Intell 34(5):8327–8335 Liang W, Tian Y, Chen C et al (2020) Moss: End-to-end dialog system framework with modular supervision. Proc AAAI Conf Artif Intell 34(5):8327–8335
2.
go back to reference Antol S, Agrawal A, Lu J et al (2015) VQA: visual question answering. Int J Comput Vision 123(1):4–31MathSciNet Antol S, Agrawal A, Lu J et al (2015) VQA: visual question answering. Int J Comput Vision 123(1):4–31MathSciNet
3.
go back to reference Desta MT, Chen L, Kornuta T (2018) Object-based reasoning in VQA. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1814–1823 Desta MT, Chen L, Kornuta T (2018) Object-based reasoning in VQA. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1814–1823
4.
go back to reference Hudson DA, Manning CD (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 6700–6709 Hudson DA, Manning CD (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 6700–6709
5.
go back to reference Teney D, Anderson P, He X et al (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: IEEE conference on computer vision and pattern recognition, pp 4223–4232 Teney D, Anderson P, He X et al (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: IEEE conference on computer vision and pattern recognition, pp 4223–4232
6.
go back to reference Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9 Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9
8.
go back to reference Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961 Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961
9.
go back to reference Fukui A, Park DH, Yang D et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 Fukui A, Park DH, Yang D et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:​1606.​01847
11.
go back to reference Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830 Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830
12.
go back to reference Yu Z, Yu J, Xiang C et al (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef Yu Z, Yu J, Xiang C et al (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef
13.
go back to reference Ben-Younes H, Cadene R, Cord M et al (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620 Ben-Younes H, Cadene R, Cord M et al (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
14.
go back to reference Ben-Younes H, Cadene R, Thome N et al (2019) Block: bilinear super diagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the AAAI conference on artificial intelligence, pp 8102–8109 Ben-Younes H, Cadene R, Thome N et al (2019) Block: bilinear super diagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the AAAI conference on artificial intelligence, pp 8102–8109
15.
go back to reference Gao P, You H, Zhang Z et al (2019) Multi-modality latent interaction network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5825–5835 Gao P, You H, Zhang Z et al (2019) Multi-modality latent interaction network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5825–5835
16.
go back to reference Yang Z, He X, Gao J et al (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29 Yang Z, He X, Gao J et al (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
17.
go back to reference Lu J, Yang J, Batra D et al (2016) Parikh. Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297 Lu J, Yang J, Batra D et al (2016) Parikh. Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297
18.
go back to reference Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE conference on computer vision and pattern recognition, pp 6077–6086 Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE conference on computer vision and pattern recognition, pp 6077–6086
19.
21.
go back to reference Peng G, Jiang Z, You H et al (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6639–6648 Peng G, Jiang Z, You H et al (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6639–6648
22.
go back to reference Yu Z, Yu J, Cui Y et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290 Yu Z, Yu J, Cui Y et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
23.
go back to reference Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6087–6096 Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6087–6096
24.
go back to reference Wang Q, Liu ST et al (2018) Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans Geosci Remote Sens 2018:1155–1165 Wang Q, Liu ST et al (2018) Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans Geosci Remote Sens 2018:1155–1165
25.
go back to reference Yuan Y, Xiong ZT et al (2019) VSSA-NET: vertical spatial sequence attention network for traffic sign detection. IEEE Trans Image Process 28:3423–3434MathSciNetCrossRef Yuan Y, Xiong ZT et al (2019) VSSA-NET: vertical spatial sequence attention network for traffic sign detection. IEEE Trans Image Process 28:3423–3434MathSciNetCrossRef
26.
go back to reference Norcliffe-Brown W, Vafeias E, Parisot S (2018) Learning conditioned graph structures for interpretable visual question answering[J]. arXiv preprint arXiv:1806.07243 Norcliffe-Brown W, Vafeias E, Parisot S (2018) Learning conditioned graph structures for interpretable visual question answering[J]. arXiv preprint arXiv:​1806.​07243
27.
go back to reference Zhu Z, Yu J, Wang Y et al (2020) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. arXiv preprint arXiv:2006.09073 Zhu Z, Yu J, Wang Y et al (2020) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. arXiv preprint arXiv:​2006.​09073
28.
go back to reference Yu DF (2019) Research on visual Question Answering Based on attention mechanism and high level semantics. University of Science and Technology of China, Hefei Yu DF (2019) Research on visual Question Answering Based on attention mechanism and high level semantics. University of Science and Technology of China, Hefei
29.
go back to reference Li L, Gan Z, Cheng Y et al (2019) Relation-aware graph attention network for visual question answering. In: 2019 IEEE/CVF international conference on computer vision (ICCV). IEEE, pp 10313–10322 Li L, Gan Z, Cheng Y et al (2019) Relation-aware graph attention network for visual question answering. In: 2019 IEEE/CVF international conference on computer vision (ICCV). IEEE, pp 10313–10322
30.
go back to reference Cadene R, Ben-Younes H, Cord M et al (2019) Murel: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1989–1998 Cadene R, Ben-Younes H, Cord M et al (2019) Murel: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1989–1998
31.
go back to reference Yang Z, Qin Z, Yu J et al (2020) Prior visual relationship reasoning for visual question answering. In: 2020 IEEE international conference on image processing (ICIP). IEEE, pp 1411–1415 Yang Z, Qin Z, Yu J et al (2020) Prior visual relationship reasoning for visual question answering. In: 2020 IEEE international conference on image processing (ICIP). IEEE, pp 1411–1415
32.
go back to reference Wang P, Wu Q, Cao J et al (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: Conference on computer vision and pattern recognition, pp 1960–1968 Wang P, Wu Q, Cao J et al (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: Conference on computer vision and pattern recognition, pp 1960–1968
33.
go back to reference Liang W, Niu F, Reganti A et al (2020) LRTA: a transparent neural-symbolic reasoning framework with modular supervision for visual question answering. arXiv preprint arXiv:2011.10731 Liang W, Niu F, Reganti A et al (2020) LRTA: a transparent neural-symbolic reasoning framework with modular supervision for visual question answering. arXiv preprint arXiv:​2011.​10731
34.
go back to reference Liang W, Jiang Y, Liu Z (2021) GraghVQA: language-guided graph neural networks for graph-based visual question answering. In: Proceedings of the third workshop on multimodal artificial intelligence Liang W, Jiang Y, Liu Z (2021) GraghVQA: language-guided graph neural networks for graph-based visual question answering. In: Proceedings of the third workshop on multimodal artificial intelligence
35.
go back to reference Zhu X, Mao Z, Chen Z et al (2020) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80:1–19 Zhu X, Mao Z, Chen Z et al (2020) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80:1–19
36.
go back to reference Ren S, He K, Girshick R et al (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRef Ren S, He K, Girshick R et al (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRef
37.
go back to reference Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing, pp 1532–1543 Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing, pp 1532–1543
38.
go back to reference Cho K, Van Merrienboer B, Bahdanau D et al (2014) On the properties of neural machine translation: encoder–decoder approaches. arXiv preprint arXiv:1409.1259 Cho K, Van Merrienboer B, Bahdanau D et al (2014) On the properties of neural machine translation: encoder–decoder approaches. arXiv preprint arXiv:​1409.​1259
39.
go back to reference Zellers R, Yatskar M, Thomson S et al (2018) Neural motifs: scene graph parsing with global context. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5831–5840 Zellers R, Yatskar M, Thomson S et al (2018) Neural motifs: scene graph parsing with global context. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5831–5840
40.
go back to reference Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699 Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
41.
42.
go back to reference Wu Z, Pan S, Chen F et al (2021) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32(1):4–24MathSciNetCrossRef Wu Z, Pan S, Chen F et al (2021) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32(1):4–24MathSciNetCrossRef
44.
go back to reference Yu J, Lu Y, Qin Z et al (2018) Modeling text with graph convolutional network for cross-modal information retrieval. In: Pacific Rim conference on multimedia. Springer, Cham, pp 223–234 Yu J, Lu Y, Qin Z et al (2018) Modeling text with graph convolutional network for cross-modal information retrieval. In: Pacific Rim conference on multimedia. Springer, Cham, pp 223–234
45.
go back to reference Gilmer J, Schoenholz SS, Riley PF et al (2017) Neural message passing for quantum chemistry. In: International conference on machine learning PMLR, pp 1263–1272 Gilmer J, Schoenholz SS, Riley PF et al (2017) Neural message passing for quantum chemistry. In: International conference on machine learning PMLR, pp 1263–1272
46.
go back to reference Teney D, Anderson P, He X et al (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In; IEEE conference on computer vision and pattern recognition, pp 4223–4232 Teney D, Anderson P, He X et al (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In; IEEE conference on computer vision and pattern recognition, pp 4223–4232
47.
go back to reference Lu J, Lin X, Batra D et al. Deeper lstm and normalized cnn visual question answering model Lu J, Lin X, Batra D et al. Deeper lstm and normalized cnn visual question answering model
Metadata
Title
Research on Visual Question Answering Based on GAT Relational Reasoning
Authors
Yalin Miao
Wenfang Cheng
Shuyun He
Hui Jiang
Publication date
13-01-2022
Publisher
Springer US
Published in
Neural Processing Letters / Issue 2/2022
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-021-10689-2

Other articles of this Issue 2/2022

Neural Processing Letters 2/2022 Go to the issue