Top

Neural Processing Letters

Published in:

13-01-2022

Research on Visual Question Answering Based on GAT Relational Reasoning

Authors: Yalin Miao, Wenfang Cheng, Shuyun He, Hui Jiang

Published in: Neural Processing Letters | Issue 2/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Due to the diversity of questions in VQA, it brings new challenges to the construction of VQA model. Existing VQA models focus on constructing a new attention mechanism, which makes the model increasingly complex. In addition, most of them concentrate on object recognition, ignore the research on spatial reasoning, semantic relations and even scene understanding. Therefore, a Graph Attention Network Relational Reasoning (GAT2R) model is proposed in this paper, which mainly includes scene graph generation and scene graph answer prediction. The scene map generation module mainly extracts the regional and spatial features of objects through the object detection model, and uses the relation decoder to predict the relations between object pairs. The scene graph answer prediction dynamically updates the node representation through the question-guided graph attention network, then performs multi-modal fusion with the question features, the answer is obtained finally. The experimental result shows that the accuracy of the proposed model is 54.45% on the natural scene dataset GQA, which is mainly based on relational reasoning. The experimental result is 68.04% on the widely used VQA2.0 dataset. Compared with the benchmark model, the accuracy of the proposed model is increased by 4.71% and 2.37% on GQA and VQA2.0 respectively, which proves the effectiveness and generalization of the model.

previous article Continuous Positioning with Recurrent Auto-Regressive Neural Network for Unmanned Surface Vehicles in GPS Outages

next article A Entity Relation Extraction Model with Enhanced Position Attention in Food Domain

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Liang W, Tian Y, Chen C et al (2020) Moss: End-to-end dialog system framework with modular supervision. Proc AAAI Conf Artif Intell 34(5):8327–8335

Antol S, Agrawal A, Lu J et al (2015) VQA: visual question answering. Int J Comput Vision 123(1):4–31MathSciNet

Desta MT, Chen L, Kornuta T (2018) Object-based reasoning in VQA. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1814–1823

Hudson DA, Manning CD (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 6700–6709

Teney D, Anderson P, He X et al (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: IEEE conference on computer vision and pattern recognition, pp 4223–4232

Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9

Hudson DA, Manning CD (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067

Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961

Fukui A, Park DH, Yang D et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847

10.

Kim JH, On KW, Lim W et al (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325

11.

Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830

12.

Yu Z, Yu J, Xiang C et al (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef

13.

Ben-Younes H, Cadene R, Cord M et al (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620

14.

Ben-Younes H, Cadene R, Thome N et al (2019) Block: bilinear super diagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the AAAI conference on artificial intelligence, pp 8102–8109

15.

Gao P, You H, Zhang Z et al (2019) Multi-modality latent interaction network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5825–5835

16.

Yang Z, He X, Gao J et al (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

17.

Lu J, Yang J, Batra D et al (2016) Parikh. Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297

18.

Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE conference on computer vision and pattern recognition, pp 6077–6086

19.

Jiang Y, Natarajan V, Chen X et al (2018) Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956

20.

Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. arXiv preprint arXiv:1805.07932

21.

Peng G, Jiang Z, You H et al (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6639–6648

22.

Yu Z, Yu J, Cui Y et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290

23.

Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6087–6096

24.

Wang Q, Liu ST et al (2018) Scene classification with recurrent attention of VHR remote sensing images. IEEE Trans Geosci Remote Sens 2018:1155–1165

25.

Yuan Y, Xiong ZT et al (2019) VSSA-NET: vertical spatial sequence attention network for traffic sign detection. IEEE Trans Image Process 28:3423–3434MathSciNetCrossRef

26.

Norcliffe-Brown W, Vafeias E, Parisot S (2018) Learning conditioned graph structures for interpretable visual question answering[J]. arXiv preprint arXiv:1806.07243

27.

Zhu Z, Yu J, Wang Y et al (2020) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. arXiv preprint arXiv:2006.09073

28.

Yu DF (2019) Research on visual Question Answering Based on attention mechanism and high level semantics. University of Science and Technology of China, Hefei

29.

Li L, Gan Z, Cheng Y et al (2019) Relation-aware graph attention network for visual question answering. In: 2019 IEEE/CVF international conference on computer vision (ICCV). IEEE, pp 10313–10322

30.

Cadene R, Ben-Younes H, Cord M et al (2019) Murel: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1989–1998

31.

Yang Z, Qin Z, Yu J et al (2020) Prior visual relationship reasoning for visual question answering. In: 2020 IEEE international conference on image processing (ICIP). IEEE, pp 1411–1415

32.

Wang P, Wu Q, Cao J et al (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: Conference on computer vision and pattern recognition, pp 1960–1968

33.

Liang W, Niu F, Reganti A et al (2020) LRTA: a transparent neural-symbolic reasoning framework with modular supervision for visual question answering. arXiv preprint arXiv:2011.10731

34.

Liang W, Jiang Y, Liu Z (2021) GraghVQA: language-guided graph neural networks for graph-based visual question answering. In: Proceedings of the third workshop on multimodal artificial intelligence

35.

Zhu X, Mao Z, Chen Z et al (2020) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80:1–19

36.

Ren S, He K, Girshick R et al (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRef

37.

Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Conference on empirical methods in natural language processing, pp 1532–1543

38.

Cho K, Van Merrienboer B, Bahdanau D et al (2014) On the properties of neural machine translation: encoder–decoder approaches. arXiv preprint arXiv:1409.1259

39.

Zellers R, Yatskar M, Thomson S et al (2018) Neural motifs: scene graph parsing with global context. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5831–5840

40.

Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699

41.

Battaglia PW, Hamrick JB et al (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261

42.

Wu Z, Pan S, Chen F et al (2021) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32(1):4–24MathSciNetCrossRef

43.

Velickovic P, Cucurull G, Casanova A et al (2017) Graph attention networks. arXiv preprint arXiv:1710.10903

44.

Yu J, Lu Y, Qin Z et al (2018) Modeling text with graph convolutional network for cross-modal information retrieval. In: Pacific Rim conference on multimedia. Springer, Cham, pp 223–234

45.

Gilmer J, Schoenholz SS, Riley PF et al (2017) Neural message passing for quantum chemistry. In: International conference on machine learning PMLR, pp 1263–1272

46.

Teney D, Anderson P, He X et al (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In; IEEE conference on computer vision and pattern recognition, pp 4223–4232

47.

Lu J, Lin X, Batra D et al. Deeper lstm and normalized cnn visual question answering model

Title: Research on Visual Question Answering Based on GAT Relational Reasoning
Authors: Yalin Miao
Wenfang Cheng
Shuyun He
Hui Jiang
Publication date: 13-01-2022
Publisher: Springer US
Published in: Neural Processing Letters / Issue 2/2022
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-021-10689-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2022

Fast Anomaly Detection Based on 3D Integral Images

Synchronization of Quaternion Valued Neural Networks with Mixed Time Delays Using Lyapunov Function Method

Lagrange Stability of BAM Quaternion-Valued Inertial Neural Networks via Auxiliary Function-Based Integral Inequalities

High Parameter Frequency Resolution Encoding Scheme for Spatial Audio Objects Using Stacked Sparse Autoencoder

Emotional Neural Network Based on Improved CLPSO Algorithm For Time Series Prediction

MixNet: A Robust Mixture of Convolutional Neural Networks as Feature Extractors to Detect Stego Images Created by Content-Adaptive Steganography