Skip to main content
Erschienen in: World Wide Web 4/2022

25.11.2021

Multi-level, multi-modal interactions for visual question answering over text in images

verfasst von: Jincai Chen, Sheng Zhang, Jiangfeng Zeng, Fuhao Zou, Yuan-Fang Li, Tao Liu, Ping Lu

Erschienen in: World Wide Web | Ausgabe 4/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://​github.​com/​zhangshengHust/​mlci.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Lawrence Zitnick, C, Parikh, D: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433 (2015) Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Lawrence Zitnick, C, Parikh, D: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433 (2015)
3.
Zurück zum Zitat Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Zitnick, CL, Parikh, D: VQA: visual question answering. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp 2425–2433. IEEE Computer Society (2015) Antol, S, Agrawal, A, Lu, J, Mitchell, M, Batra, D, Zitnick, CL, Parikh, D: VQA: visual question answering. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp 2425–2433. IEEE Computer Society (2015)
4.
Zurück zum Zitat Ben-younes, H, Cadene, R, Cord, M, Thome, N: Mutan: Multimodal tucker fusion for visual question answering. In: Proc. IEEE Int. Conf. Computer Vision (ICCV), pp 2631–2639 (2017) Ben-younes, H, Cadene, R, Cord, M, Thome, N: Mutan: Multimodal tucker fusion for visual question answering. In: Proc. IEEE Int. Conf. Computer Vision (ICCV), pp 2631–2639 (2017)
5.
Zurück zum Zitat Biten, AF, Tito, R, Mafla, A, Gómez, L, Rusiñol, M, Mathew, M, Jawahar, CV, Valveny, E, Karatzas, D: ICDAR 2019 competition on scene text visual question answering. CoRR abs/1907.00490, 1907.00490(2019) Biten, AF, Tito, R, Mafla, A, Gómez, L, Rusiñol, M, Mathew, M, Jawahar, CV, Valveny, E, Karatzas, D: ICDAR 2019 competition on scene text visual question answering. CoRR abs/1907.00490, 1907.​00490(2019)
7.
Zurück zum Zitat Chen, Z, Lu, H, Tian, S, Qiu, J, Kamiya, T, Serikawa, S, Xu, L: Construction of a hierarchical feature enhancement network and its application in fault recognition. IEEE Transactions on Industrial Informatics (2020) Chen, Z, Lu, H, Tian, S, Qiu, J, Kamiya, T, Serikawa, S, Xu, L: Construction of a hierarchical feature enhancement network and its application in fault recognition. IEEE Transactions on Industrial Informatics (2020)
8.
Zurück zum Zitat Devlin, J, Chang, M-W, Lee, K, Toutanova, K: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J, Doran, C, Solorio, T (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). https://www.aclweb.org/anthology/N19-1423/, pp 4171–4186. Association for Computational Linguistics (2019) Devlin, J, Chang, M-W, Lee, K, Toutanova, K: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J, Doran, C, Solorio, T (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). https://​www.​aclweb.​org/​anthology/​N19-1423/​, pp 4171–4186. Association for Computational Linguistics (2019)
10.
Zurück zum Zitat Goyal, Y, Khot, T, Summers-Stay, D, Batra, D, Parikh, D: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6904–6913 (2017) Goyal, Y, Khot, T, Summers-Stay, D, Batra, D, Parikh, D: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6904–6913 (2017)
11.
Zurück zum Zitat Gu, J, Lu, Z, Li, H, Li, VOK: Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. https://www.aclweb.org/anthology/P16-1154/. The Association for Computer Linguistics (2016) Gu, J, Lu, Z, Li, H, Li, VOK: Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. https://​www.​aclweb.​org/​anthology/​P16-1154/​. The Association for Computer Linguistics (2016)
12.
Zurück zum Zitat Jiang, T, Zeng, J, Zhou, K, Huang, P, Yang, T: Lifelong disk failure prediction via gan-based anomaly detection. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp 199–207, IEEE (2019) Jiang, T, Zeng, J, Zhou, K, Huang, P, Yang, T: Lifelong disk failure prediction via gan-based anomaly detection. In: 2019 IEEE 37th International Conference on Computer Design (ICCD), pp 199–207, IEEE (2019)
13.
Zurück zum Zitat Jiang, Y, Natarajan, V, Chen, X, Rohrbach, M, Batra, D, Parikh, D: Pythia v0.1: the winning entry to the VQA challenge 2018. CoRR abs/1807.09956, 1807.09956 (2018) Jiang, Y, Natarajan, V, Chen, X, Rohrbach, M, Batra, D, Parikh, D: Pythia v0.1: the winning entry to the VQA challenge 2018. CoRR abs/1807.09956, 1807.​09956 (2018)
14.
Zurück zum Zitat Johnson, J, Hariharan, B, van der Maaten, L, Fei-Fei, L, Lawrence Zitnick, C, Girshick, R: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2901–2910 (2017) Johnson, J, Hariharan, B, van der Maaten, L, Fei-Fei, L, Lawrence Zitnick, C, Girshick, R: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2901–2910 (2017)
15.
Zurück zum Zitat Kahou, SE, Michalski, V, Atkinson, A, Kádár, A, Trischler, A, Bengio, Y: Figureqa: An annotated figure dataset for visual reasoning. arXiv:1710.07300 (2017) Kahou, SE, Michalski, V, Atkinson, A, Kádár, A, Trischler, A, Bengio, Y: Figureqa: An annotated figure dataset for visual reasoning. arXiv:1710.​07300 (2017)
16.
Zurück zum Zitat Kim, J-H, Jun, J, Zhang, B-T: Bilinear attention networks. In: Bengio, S, Wallach, HM, Larochelle, H, Grauman, K, Cesa-Bianchi, N, Garnett, R (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. http://papers.nips.cc/paper/7429-bilinear-attention-networks, pp 1571–1581 (2018) Kim, J-H, Jun, J, Zhang, B-T: Bilinear attention networks. In: Bengio, S, Wallach, HM, Larochelle, H, Grauman, K, Cesa-Bianchi, N, Garnett, R (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada. http://​papers.​nips.​cc/​paper/​7429-bilinear-attention-networks, pp 1571–1581 (2018)
18.
Zurück zum Zitat Kuznetsova, A, Rom, H, Alldrin, N, Uijlings, JRR, Krasin, I, Pont-Tuset, J, Kamali, S, Popov, S, Malloci, M, Duerig, T, Ferrari, V: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR abs/1811.00982, 1811.00982 (2018) Kuznetsova, A, Rom, H, Alldrin, N, Uijlings, JRR, Krasin, I, Pont-Tuset, J, Kamali, S, Popov, S, Malloci, M, Duerig, T, Ferrari, V: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR abs/1811.00982, 1811.​00982 (2018)
20.
Zurück zum Zitat Lu, H, Li, Y, Mu, S, Wang, D, Kim, H, Serikawa, S: Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE internet of things journal 5(4), 2315–2322 (2017)CrossRef Lu, H, Li, Y, Mu, S, Wang, D, Kim, H, Serikawa, S: Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE internet of things journal 5(4), 2315–2322 (2017)CrossRef
21.
Zurück zum Zitat Lu, H, Zhang, M, Xu, X, Li, Y, Shen, HT: Deep fuzzy hashing network for efficient image retrieval. IEEE Trans. Fuzzy Syst. (2020) Lu, H, Zhang, M, Xu, X, Li, Y, Shen, HT: Deep fuzzy hashing network for efficient image retrieval. IEEE Trans. Fuzzy Syst. (2020)
22.
Zurück zum Zitat Lu, H, Zhang, Y, Li, Y, Jiang, C, Abbas, H: User-oriented virtual mobile network resource management for vehicle communications. IEEE Trans. Intell. Transp. Syst. (2020) Lu, H, Zhang, Y, Li, Y, Jiang, C, Abbas, H: User-oriented virtual mobile network resource management for vehicle communications. IEEE Trans. Intell. Transp. Syst. (2020)
23.
Zurück zum Zitat Ma, X, Zeng, J, Peng, L, Fortino, G, Zhang, Y: Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis. Futur. Gener. Comput. Syst. 93, 304–311 (2019)CrossRef Ma, X, Zeng, J, Peng, L, Fortino, G, Zhang, Y: Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis. Futur. Gener. Comput. Syst. 93, 304–311 (2019)CrossRef
24.
Zurück zum Zitat Malinowski, M, Fritz, M: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, pp 1682–1690 (2014) Malinowski, M, Fritz, M: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, pp 1682–1690 (2014)
25.
Zurück zum Zitat Malinowski, M, Fritz, M: Towards a visual turing challenge. CoRR abs/1410.8027, 1410.8027 (2014) Malinowski, M, Fritz, M: Towards a visual turing challenge. CoRR abs/1410.8027, 1410.​8027 (2014)
27.
Zurück zum Zitat Paszke, A, Gross, S, Chintala, S, Chanan, G, Yang, E, DeVito, Z, Lin, Z, Desmaison, A, Antiga, L, Lerer, A: Automatic differentiation in pytorch (2017) Paszke, A, Gross, S, Chintala, S, Chanan, G, Yang, E, DeVito, Z, Lin, Z, Desmaison, A, Antiga, L, Lerer, A: Automatic differentiation in pytorch (2017)
28.
Zurück zum Zitat Pennington, J, Socher, R, Manning, CD: Glove: Global vectors for word representation. In: Moschitti, A, Pang, B, Daelemans, W (eds.) Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, october 25-29, 2014, doha, qatar, A meeting of sigdat, a special interest group of the ACL. https://www.aclweb.org/anthology/D14-1162/, pp 1532–1543. ACL (2014) Pennington, J, Socher, R, Manning, CD: Glove: Global vectors for word representation. In: Moschitti, A, Pang, B, Daelemans, W (eds.) Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, october 25-29, 2014, doha, qatar, A meeting of sigdat, a special interest group of the ACL. https://​www.​aclweb.​org/​anthology/​D14-1162/​, pp 1532–1543. ACL (2014)
29.
Zurück zum Zitat Ren, M, Kiros, R, Zemel, R: Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961 (2015) Ren, M, Kiros, R, Zemel, R: Exploring models and data for image question answering. In: Advances in neural information processing systems, pp 2953–2961 (2015)
33.
Zurück zum Zitat Suhr, A, Lewis, M, Yeh, J, Artzi, Y: A corpus of natural language for visual reasoning. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 217–223 (2017) Suhr, A, Lewis, M, Yeh, J, Artzi, Y: A corpus of natural language for visual reasoning. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 217–223 (2017)
34.
Zurück zum Zitat Tan, H, Bansal, M: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K, Jiang, J, Ng, V, Wan, X (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp 5099–5110. Association for Computational Linguistics (2019) Tan, H, Bansal, M: LXMERT: learning cross-modality encoder representations from transformers. In: Inui, K, Jiang, J, Ng, V, Wan, X (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp 5099–5110. Association for Computational Linguistics (2019)
35.
Zurück zum Zitat Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, Polosukhin, I: Attention is all you need. In: Guyon, I, von Luxburg, U, Bengio, S, Wallach, HM, Fergus, R, Vishwanathan, SVN, Garnett, R (eds.) Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA. http://papers.nips.cc/paper/7181-attention-is-all-you-need, pp 5998–6008 (2017) Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, AN, Kaiser, L, Polosukhin, I: Attention is all you need. In: Guyon, I, von Luxburg, U, Bengio, S, Wallach, HM, Fergus, R, Vishwanathan, SVN, Garnett, R (eds.) Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA. http://​papers.​nips.​cc/​paper/​7181-attention-is-all-you-need, pp 5998–6008 (2017)
36.
Zurück zum Zitat Wang, P, Wang, D, Zhang, X, Li, X, Peng, T, Lu, H, Tian, X: Numerical and experimental study on the maneuverability of an active propeller control based wave glider. Applied Ocean Research 104, 102369 (2020)CrossRef Wang, P, Wang, D, Zhang, X, Li, X, Peng, T, Lu, H, Tian, X: Numerical and experimental study on the maneuverability of an active propeller control based wave glider. Applied Ocean Research 104, 102369 (2020)CrossRef
37.
Zurück zum Zitat Wang, P, Wu, Q, Shen, C, Dick, A, Van Den Hengel, A: Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40(10), 2413–2427 (2017)CrossRef Wang, P, Wu, Q, Shen, C, Dick, A, Van Den Hengel, A: Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40(10), 2413–2427 (2017)CrossRef
38.
Zurück zum Zitat Xu, K, Wang, Z, Shi, J, Li, H, Zhang, QC: A2-net: Molecular structure estimation from cryo-em density volumes. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp 1230–1237. AAAI Press (2019) Xu, K, Wang, Z, Shi, J, Li, H, Zhang, QC: A2-net: Molecular structure estimation from cryo-em density volumes. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp 1230–1237. AAAI Press (2019)
39.
Zurück zum Zitat Yang, Z, He, X, Gao, J, Deng, L, Smola, A: Stacked attention networks for image question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) Yang, Z, He, X, Gao, J, Deng, L, Smola, A: Stacked attention networks for image question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
40.
Zurück zum Zitat Zeng, J, Ma, X, Zhou, K: Photo-realistic face age progression/regression using a single generative adversarial network. Neurocomputing 366, 295–304 (2019)CrossRef Zeng, J, Ma, X, Zhou, K: Photo-realistic face age progression/regression using a single generative adversarial network. Neurocomputing 366, 295–304 (2019)CrossRef
41.
Zurück zum Zitat Zhou, K, Zeng, J, Liu, Y, Zou, F: Deep sentiment hashing for text retrieval in social ciot. Futur. Gener. Comput. Syst. 86, 362–371 (2018)CrossRef Zhou, K, Zeng, J, Liu, Y, Zou, F: Deep sentiment hashing for text retrieval in social ciot. Futur. Gener. Comput. Syst. 86, 362–371 (2018)CrossRef
Metadaten
Titel
Multi-level, multi-modal interactions for visual question answering over text in images
verfasst von
Jincai Chen
Sheng Zhang
Jiangfeng Zeng
Fuhao Zou
Yuan-Fang Li
Tao Liu
Ping Lu
Publikationsdatum
25.11.2021
Verlag
Springer US
Erschienen in
World Wide Web / Ausgabe 4/2022
Print ISSN: 1386-145X
Elektronische ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-021-00976-2

Weitere Artikel der Ausgabe 4/2022

World Wide Web 4/2022 Zur Ausgabe

Premium Partner