nach oben

World Wide Web

Erschienen in:

25.11.2021

Multi-level, multi-modal interactions for visual question answering over text in images

verfasst von: Jincai Chen, Sheng Zhang, Jiangfeng Zeng, Fuhao Zou, Yuan-Fang Li, Tao Liu, Ping Lu

Erschienen in: World Wide Web | Ausgabe 4/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.

Vorheriger Artikel Scaled gated networks

Nächster Artikel DRA-ODM: a faster and more accurate deep recurrent attention dynamic model for object detection

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Anderson, P, He, X, Buehler, C, Teney, D, Johnson, M, Gould, S, Zhang, L: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html, pp 6077–6086. IEEE Computer Society (2018)

Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Lawrence Zitnick, C, Parikh, D: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433 (2015)

Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Zitnick, CL, Parikh, D: VQA: visual question answering. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp 2425–2433. IEEE Computer Society (2015)

Ben-younes, H, Cadene, R, Cord, M, Thome, N: Mutan: Multimodal tucker fusion for visual question answering. In: Proc. IEEE Int. Conf. Computer Vision (ICCV), pp 2631–2639 (2017)

Biten, AF, Tito, R, Mafla, A, Gómez, L, Rusiñol, M, Mathew, M, Jawahar, CV, Valveny, E, Karatzas, D: ICDAR 2019 competition on scene text visual question answering. CoRR abs/1907.00490, 1907.00490(2019)

Bojanowski, P, Grave, E, Joulin, A, Mikolov, T: Enriching word vectors with subword information. TACL 5, 135–146 (2017). https://transacl.org/ojs/index.php/tacl/article/view/999CrossRef

Chen, Z, Lu, H, Tian, S, Qiu, J, Kamiya, T, Serikawa, S, Xu, L: Construction of a hierarchical feature enhancement network and its application in fault recognition. IEEE Transactions on Industrial Informatics (2020)

Devlin, J, Chang, M-W, Lee, K, Toutanova, K: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J, Doran, C, Solorio, T (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). https://www.aclweb.org/anthology/N19-1423/, pp 4171–4186. Association for Computational Linguistics (2019)

Gao, P, Jiang, Z, You, H, Lu, P, Hoi, SCH, Wang, X, Li, H: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, long beach, ca, usa, june 16-20, 2019. http://openaccess.thecvf.com/content_CVPR_2019/html/Gao_Dynamic_Fusion_With_Intra-_and_Inter-Modality_Attention_Flow_for_Visual_CVPR_2019_paper.html, pp 6639–6648. Computer Vision Foundation / IEEE (2019)

10.

Goyal, Y, Khot, T, Summers-Stay, D, Batra, D, Parikh, D: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6904–6913 (2017)

11.

Gu, J, Lu, Z, Li, H, Li, VOK: Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. https://www.aclweb.org/anthology/P16-1154/. The Association for Computer Linguistics (2016)

12.

Jiang, T, Zeng, J, Zhou, K, Huang, P, Yang, T: Lifelong disk failure prediction via gan-based anomaly detection. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp 199–207, IEEE (2019)

13.

Jiang, Y, Natarajan, V, Chen, X, Rohrbach, M, Batra, D, Parikh, D: Pythia v0.1: the winning entry to the VQA challenge 2018. CoRR abs/1807.09956, 1807.09956 (2018)

14.

Johnson, J, Hariharan, B, van der Maaten, L, Fei-Fei, L, Lawrence Zitnick, C, Girshick, R: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2901–2910 (2017)

15.

Kahou, SE, Michalski, V, Atkinson, A, Kádár, A, Trischler, A, Bengio, Y: Figureqa: An annotated figure dataset for visual reasoning. arXiv:1710.07300 (2017)

16.

Kim, J-H, Jun, J, Zhang, B-T: Bilinear attention networks. In: Bengio, S, Wallach, HM, Larochelle, H, Grauman, K, Cesa-Bianchi, N, Garnett, R (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. http://papers.nips.cc/paper/7429-bilinear-attention-networks, pp 1571–1581 (2018)

17.

Krishna, R, Zhu, Y, Groth, O, Johnson, J, Hata, K, Kravitz, J, Chen, S, Kalantidis, Y, Li, L-J, Shamma, DA, Bernstein, MS, Fei-Fei, L: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7MathSciNetCrossRef

18.

Kuznetsova, A, Rom, H, Alldrin, N, Uijlings, JRR, Krasin, I, Pont-Tuset, J, Kamali, S, Popov, S, Malloci, M, Duerig, T, Ferrari, V: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR abs/1811.00982, 1811.00982 (2018)

19.

Lin, Y, Zhao, H, Li, Y, Wang, D: Dcd zju, textvqa challenge 2019 winner. https://visualqa.org/workshop.html (2019)

20.

Lu, H, Li, Y, Mu, S, Wang, D, Kim, H, Serikawa, S: Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE internet of things journal 5(4), 2315–2322 (2017)CrossRef

21.

Lu, H, Zhang, M, Xu, X, Li, Y, Shen, HT: Deep fuzzy hashing network for efficient image retrieval. IEEE Trans. Fuzzy Syst. (2020)

22.

Lu, H, Zhang, Y, Li, Y, Jiang, C, Abbas, H: User-oriented virtual mobile network resource management for vehicle communications. IEEE Trans. Intell. Transp. Syst. (2020)

23.

Ma, X, Zeng, J, Peng, L, Fortino, G, Zhang, Y: Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis. Futur. Gener. Comput. Syst. 93, 304–311 (2019)CrossRef

24.

Malinowski, M, Fritz, M: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, pp 1682–1690 (2014)

25.

Malinowski, M, Fritz, M: Towards a visual turing challenge. CoRR abs/1410.8027, 1410.8027 (2014)

26.

Nguyen, D-K, Okatani, T: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. http://openaccess.thecvf.com/content_cvpr_2018/html/Nguyen_Improved_Fusion_of_CVPR_2018_paper.html, pp 6087–6096. IEEE Computer Society (2018)

27.

Paszke, A, Gross, S, Chintala, S, Chanan, G, Yang, E, DeVito, Z, Lin, Z, Desmaison, A, Antiga, L, Lerer, A: Automatic differentiation in pytorch (2017)

28.

Pennington, J, Socher, R, Manning, CD: Glove: Global vectors for word representation. In: Moschitti, A, Pang, B, Daelemans, W (eds.) Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, october 25-29, 2014, doha, qatar, A meeting of sigdat, a special interest group of the ACL. https://www.aclweb.org/anthology/D14-1162/, pp 1532–1543. ACL (2014)

29.

Ren, M, Kiros, R, Zemel, R: Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961 (2015)

30.

Ren, S, He, K, Girshick, RB, Sun, J: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C, Lawrence, ND, Lee, DD, Sugiyama, M, Garnett, R (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks, pp 91–99 (2015)

31.

Singh, A, Natarajan, V, Shah, M, Jiang, Y, Chen, X, Batra, D, Parikh, D, Rohrbach, M: Towards VQA models that can read. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, long beach, ca, usa, june 16-20, 2019. http://openaccess.thecvf.com/content_CVPR_2019/html/Singh_Towards_VQA_Models_That_Can_Read_CVPR_2019_paper.html, pp 8317–8326. Computer Vision Foundation / IEEE (2019)

32.

submission, A: Msft_vti. https://evalai.cloudcv.org/web/challenges/challenge-page/224/

33.

Suhr, A, Lewis, M, Yeh, J, Artzi, Y: A corpus of natural language for visual reasoning. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 217–223 (2017)

34.

Tan, H, Bansal, M: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K, Jiang, J, Ng, V, Wan, X (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp 5099–5110. Association for Computational Linguistics (2019)

35.

Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, Polosukhin, I: Attention is all you need. In: Guyon, I, von Luxburg, U, Bengio, S, Wallach, HM, Fergus, R, Vishwanathan, SVN, Garnett, R (eds.) Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA. http://papers.nips.cc/paper/7181-attention-is-all-you-need, pp 5998–6008 (2017)

36.

Wang, P, Wang, D, Zhang, X, Li, X, Peng, T, Lu, H, Tian, X: Numerical and experimental study on the maneuverability of an active propeller control based wave glider. Applied Ocean Research 104, 102369 (2020)CrossRef

37.

Wang, P, Wu, Q, Shen, C, Dick, A, Van Den Hengel, A: Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40(10), 2413–2427 (2017)CrossRef

38.

Xu, K, Wang, Z, Shi, J, Li, H, Zhang, QC: A2-net: Molecular structure estimation from cryo-em density volumes. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp 1230–1237. AAAI Press (2019)

39.

Yang, Z, He, X, Gao, J, Deng, L, Smola, A: Stacked attention networks for image question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

40.

Zeng, J, Ma, X, Zhou, K: Photo-realistic face age progression/regression using a single generative adversarial network. Neurocomputing 366, 295–304 (2019)CrossRef

41.

Zhou, K, Zeng, J, Liu, Y, Zou, F: Deep sentiment hashing for text retrieval in social ciot. Futur. Gener. Comput. Syst. 86, 362–371 (2018)CrossRef

Titel: Multi-level, multi-modal interactions for visual question answering over text in images
verfasst von: Jincai Chen
Sheng Zhang
Jiangfeng Zeng
Fuhao Zou
Yuan-Fang Li
Tao Liu
Ping Lu
Publikationsdatum: 25.11.2021
Verlag: Springer US
Erschienen in: World Wide Web / Ausgabe 4/2022
Print ISSN: 1386-145X
Elektronische ISSN: 1573-1413
DOI: https://doi.org/10.1007/s11280-021-00976-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 4/2022

DRA-ODM: a faster and more accurate deep recurrent attention dynamic model for object detection

Scaled gated networks

Multi-granularity interaction model based on pinyins and radicals for Chinese semantic matching

Multi-feature fusion point cloud completion network

Object detection network pruning with multi-task information fusion

Special Issue on Synthetic Media on the Web

Premium Partner