Skip to main content
Erschienen in: Artificial Intelligence Review 8/2020

08.04.2020

Visual question answering: a state-of-the-art review

verfasst von: Sruthy Manmadhan, Binsu C. Kovoor

Erschienen in: Artificial Intelligence Review | Ausgabe 8/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Visual question answering (VQA) is a task that has received immense consideration from two major research communities: computer vision and natural language processing. Recently it has been widely accepted as an AI-complete task which can be used as an alternative to visual turing test. In its most common form, it is a multi-modal challenging task where a computer is required to provide the correct answer for a natural language question asked about an input image. It attracts many deep learning researchers after their remarkable achievements in text, voice and vision technologies. This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics. Finally, this paper also discusses future research directions for all the above-mentioned aspects of VQA separately.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Agrawal A, Kembhavi A, Batra D, Parikh D (2017) C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243 Agrawal A, Kembhavi A, Batra D, Parikh D (2017) C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset. arXiv preprint arXiv:​1704.​08243
Zurück zum Zitat Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6077–6086 Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6077–6086
Zurück zum Zitat Andreas J, Rohrbach M, Darrell T, Klein D (2015) Deep compositional question answering with neural module networks. arXiv preprint. arXiv preprint arXiv:1511.02799 Andreas J, Rohrbach M, Darrell T, Klein D (2015) Deep compositional question answering with neural module networks. arXiv preprint. arXiv preprint arXiv:​1511.​02799
Zurück zum Zitat Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 39–48 Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 39–48
Zurück zum Zitat Antol S, Zitnick CL, Parikh D (2014) Zero-shot learning via visual abstraction. In: European conference on computer vision. Springer, Cham, pp 401–416 Antol S, Zitnick CL, Parikh D (2014) Zero-shot learning via visual abstraction. In: European conference on computer vision. Springer, Cham, pp 401–416
Zurück zum Zitat Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2425–2433 Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2425–2433
Zurück zum Zitat Bai Y, Fu J, Zhao T, Mei T (2018) Deep attention neural tensor network for visual question answering. In: Computer vision–ECCV 2018: 15th European conference, Munich, Germany, September 8–14, 2018, Proceedings. Springer, vol 11216, p 20 Bai Y, Fu J, Zhao T, Mei T (2018) Deep attention neural tensor network for visual question answering. In: Computer vision–ECCV 2018: 15th European conference, Munich, Germany, September 8–14, 2018, Proceedings. Springer, vol 11216, p 20
Zurück zum Zitat Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(2):1137–1155MATH Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(2):1137–1155MATH
Zurück zum Zitat Ben-Younes H, Cadene R, Cord M, & Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2612–2620 Ben-Younes H, Cadene R, Cord M, & Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2612–2620
Zurück zum Zitat Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef
Zurück zum Zitat Cao L, Gao L, Song J, Xu X, Shen HT (2017) Jointly learning attentions with semantic cross-modal correlation for visual question answering. In: Australasian database conference. Springer, Cham, pp 248–260 Cao L, Gao L, Song J, Xu X, Shen HT (2017) Jointly learning attentions with semantic cross-modal correlation for visual question answering. In: Australasian database conference. Springer, Cham, pp 248–260
Zurück zum Zitat Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung YH (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175 Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung YH (2018) Universal sentence encoder. arXiv preprint arXiv:​1803.​11175
Zurück zum Zitat Chaturvedi I, Satapathy R, Cavallari S, Cambria E (2019) Fuzzy common-sense reasoning for multimodal sentiment analysis. Pattern Recognit Lett 125:264–270CrossRef Chaturvedi I, Satapathy R, Cavallari S, Cambria E (2019) Fuzzy common-sense reasoning for multimodal sentiment analysis. Pattern Recognit Lett 125:264–270CrossRef
Zurück zum Zitat Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:​1511.​05960
Zurück zum Zitat Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078
Zurück zum Zitat Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364 Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:​1705.​02364
Zurück zum Zitat Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society conference on computer vision and pattern recognition. CVPR 2005. IEEE, vol 1, pp 886–893 Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society conference on computer vision and pattern recognition. CVPR 2005. IEEE, vol 1, pp 886–893
Zurück zum Zitat Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation. pp 376–380 Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation. pp 376–380
Zurück zum Zitat Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805
Zurück zum Zitat Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218MATHCrossRef Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218MATHCrossRef
Zurück zum Zitat Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211CrossRef Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211CrossRef
Zurück zum Zitat Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced question encoder. Pattern Recognit 90:404–414CrossRef Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced question encoder. Pattern Recognit 90:404–414CrossRef
Zurück zum Zitat Feng, Y., Zhu, X., Li, Y., Ruan, Y., & Greenspan, M. (2018). Learning Capsule Networks with Images and Text. In Advances in neural information processing systems Feng, Y., Zhu, X., Li, Y., Ruan, Y., & Greenspan, M. (2018). Learning Capsule Networks with Images and Text. In Advances in neural information processing systems
Zurück zum Zitat Forsbom E (2003) Training a super model look-alike: featuring edit distance, n-gram occurrence, and one reference translation. In: Proceedings of the workshop on machine translation evaluation: towards systemizing MT evaluation. pp 29–36 Forsbom E (2003) Training a super model look-alike: featuring edit distance, n-gram occurrence, and one reference translation. In: Proceedings of the workshop on machine translation evaluation: towards systemizing MT evaluation. pp 29–36
Zurück zum Zitat Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:​1606.​01847
Zurück zum Zitat Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304 Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304
Zurück zum Zitat Gao P, Li H, Li S, Lu P, Li Y, Hoi SC, Wang X (2018) Question-guided hybrid convolution for visual question answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 485–501 Gao P, Li H, Li S, Lu P, Li Y, Hoi SC, Wang X (2018) Question-guided hybrid convolution for visual question answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 485–501
Zurück zum Zitat Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. In: Proceedings of the national academy of sciences. pp 201422953 Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. In: Proceedings of the national academy of sciences. pp 201422953
Zurück zum Zitat Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 1440–1448 Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 1440–1448
Zurück zum Zitat Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 580–587 Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 580–587
Zurück zum Zitat Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRef Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233CrossRef
Zurück zum Zitat Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR. vol 1(2), p 3 Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR. vol 1(2), p 3
Zurück zum Zitat Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Luo J, Bigham JP (2018) VizWiz grand challenge: answering visual questions from blind people. arXiv preprint arXiv:1802.08218 Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Luo J, Bigham JP (2018) VizWiz grand challenge: answering visual questions from blind people. arXiv preprint arXiv:​1802.​08218
Zurück zum Zitat Hasan SA, Ling Y, Farri O, Liu J, Lungren M, Müller H (2018) Overview of the ImageCLEF 2018 medical domain visual question answering task. In CLEF2018 working notes. CEUR Workshop proceedings, Avignon, France Hasan SA, Ling Y, Farri O, Liu J, Lungren M, Müller H (2018) Overview of the ImageCLEF 2018 medical domain visual question answering task. In CLEF2018 working notes. CEUR Workshop proceedings, Avignon, France
Zurück zum Zitat He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778
Zurück zum Zitat He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2980–2988 He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2980–2988
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
Zurück zum Zitat Huang LC, Kulkarni K, Jha A, Lohit S, Jayasuriya S, Turaga P (2018) CS-VQA: visual question answering with compressively sensed images. arXiv preprint arXiv:1806.03379 Huang LC, Kulkarni K, Jha A, Lohit S, Jayasuriya S, Turaga P (2018) CS-VQA: visual question answering with compressively sensed images. arXiv preprint arXiv:​1806.​03379
Zurück zum Zitat Jabri A, Joulin A, van der Maaten L (2016) Revisiting visual question answering baselines. In: European conference on computer vision. Springer, Cham, pp 727–739 Jabri A, Joulin A, van der Maaten L (2016) Revisiting visual question answering baselines. In: European conference on computer vision. Springer, Cham, pp 727–739
Zurück zum Zitat Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2901–2910 Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2901–2910
Zurück zum Zitat Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4976–4984 Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4976–4984
Zurück zum Zitat Kafle K, Kanan C (2017a) An analysis of visual question answering algorithms. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 1983–1991 Kafle K, Kanan C (2017a) An analysis of visual question answering algorithms. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 1983–1991
Zurück zum Zitat Kafle K, Kanan C (2017b) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20CrossRef Kafle K, Kanan C (2017b) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20CrossRef
Zurück zum Zitat Kafle K, Price B, Cohen S, Kanan C (2018) DVQA: Understanding data visualizations via question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5648–5656 Kafle K, Price B, Cohen S, Kanan C (2018) DVQA: Understanding data visualizations via question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5648–5656
Zurück zum Zitat Kahou SE, Michalski V, Atkinson A, Kadar A, Trischler A, Bengio Y (2017) Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 Kahou SE, Michalski V, Atkinson A, Kadar A, Trischler A, Bengio Y (2017) Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:​1710.​07300
Zurück zum Zitat Kembhavi A, Salvato M, Kolve E, Seo M, Hajishirzi H, Farhadi A (2016) A diagram is worth a dozen images. In: European conference on computer vision. Springer, Cham, pp 235–251 Kembhavi A, Salvato M, Kolve E, Seo M, Hajishirzi H, Farhadi A (2016) A diagram is worth a dozen images. In: European conference on computer vision. Springer, Cham, pp 235–251
Zurück zum Zitat Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
Zurück zum Zitat Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016a) Multimodal residual learning for visual qa. In: Advances in neural information processing systems pp 361–369 Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016a) Multimodal residual learning for visual qa. In: Advances in neural information processing systems pp 361–369
Zurück zum Zitat Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016b) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016b) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:​1610.​04325
Zurück zum Zitat Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp. 3294–3302 Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp. 3294–3302
Zurück zum Zitat Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73MathSciNetCrossRef Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73MathSciNetCrossRef
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems pp 1097-1105) Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems pp 1097-1105)
Zurück zum Zitat Lao M, Guo Y, Wang H, Zhang X (2018) Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access Lao M, Guo Y, Wang H, Zhang X (2018) Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access
Zurück zum Zitat Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems pp 2177–2185 Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems pp 2177–2185
Zurück zum Zitat Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225CrossRef Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225CrossRef
Zurück zum Zitat Li M, Gu L, Ji Y, Liu C (2018) Text-guided dual-branch attention network for visual question answering. In: Pacific rim conference on multimedia. Springer, Cham, pp 750–760 Li M, Gu L, Ji Y, Liu C (2018) Text-guided dual-branch attention network for visual question answering. In: Pacific rim conference on multimedia. Springer, Cham, pp 750–760
Zurück zum Zitat Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: Proceedings. 2002 international conference on image processing. IEEE, vol 1, pp I–I Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: Proceedings. 2002 international conference on image processing. IEEE, vol 1, pp I–I
Zurück zum Zitat Lin X, Parikh D (2016) Leveraging visual question answering for image-caption ranking. In: European conference on computer vision. Springer, Cham, pp 261–277 Lin X, Parikh D (2016) Leveraging visual question answering for image-caption ranking. In: European conference on computer vision. Springer, Cham, pp 261–277
Zurück zum Zitat Lioutas V, Passalis N, Tefas A (2018) Explicit ensemble attention learning for improving visual question answering. Pattern Recognit Lett 111:51–57CrossRef Lioutas V, Passalis N, Tefas A (2018) Explicit ensemble attention learning for improving visual question answering. Pattern Recognit Lett 111:51–57CrossRef
Zurück zum Zitat Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999. IEEE, vol 2, pp 1150–1157 Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999. IEEE, vol 2, pp 1150–1157
Zurück zum Zitat Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems. pp 289–297 Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems. pp 289–297
Zurück zum Zitat Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 375–383 Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 375–383
Zurück zum Zitat Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI. vol. 3(7), p 16 Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI. vol. 3(7), p 16
Zurück zum Zitat Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems. pp 1682–1690 Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems. pp 1682–1690
Zurück zum Zitat Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125(1–3):110–135MathSciNetCrossRef Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125(1–3):110–135MathSciNetCrossRef
Zurück zum Zitat Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 3–20 Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 3–20
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp 3111–3119
Zurück zum Zitat Narasimhan M, Schwing AG (2018) Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European conference on computer vision (ECCV). pp 451–468 Narasimhan M, Schwing AG (2018) Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European conference on computer vision (ECCV). pp 451–468
Zurück zum Zitat Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 30–38 Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 30–38
Zurück zum Zitat Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318 Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
Zurück zum Zitat Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimedia Tools Appl 78(3):3843–3858CrossRef Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimedia Tools Appl 78(3):3843–3858CrossRef
Zurück zum Zitat Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543 Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543
Zurück zum Zitat Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint ar-Xiv:1802.05365 Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint ar-Xiv:1802.05365
Zurück zum Zitat Prakash BS, Sanjeev KV, Prakash R, Chandrasekaran K (2019) A survey on recurrent eural network architectures for sequential learning. In: Soft computing for problem solving. Springer, Singapore, pp 57–66 Prakash BS, Sanjeev KV, Prakash R, Chandrasekaran K (2019) A survey on recurrent eural network architectures for sequential learning. In: Soft computing for problem solving. Springer, Singapore, pp 57–66
Zurück zum Zitat Ren H, Lu H (2018) Compositional coding capsule network with k-means routing for text classification. arXiv preprint arXiv:1810.09177 Ren H, Lu H (2018) Compositional coding capsule network with k-means routing for text classification. arXiv preprint arXiv:​1810.​09177
Zurück zum Zitat Ren M, Kiros R, Zemel R (2015a) Image question answering: a visual semantic embedding model and a new dataset. Proc Adv Neural Inf Process Syst 1(2):5 Ren M, Kiros R, Zemel R (2015a) Image question answering: a visual semantic embedding model and a new dataset. Proc Adv Neural Inf Process Syst 1(2):5
Zurück zum Zitat Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. In: Advances in neural information processing systems. pp 2953–2961 Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. In: Advances in neural information processing systems. pp 2953–2961
Zurück zum Zitat Ren S, He K, Girshick R, Sun J (2015c) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp 91–99 Ren S, He K, Girshick R, Sun J (2015c) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp 91–99
Zurück zum Zitat Ruwa N, Mao Q, Wang L, Dong M (2018) Affective visual question answering network. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR) Ruwa N, Mao Q, Wang L, Dong M (2018) Affective visual question answering network. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR)
Zurück zum Zitat Sabour S, Frosst N, Hinton GE (2017). Dynamic routing between capsules. In: Advances in neural information processing systems. pp 3856–3866 Sabour S, Frosst N, Hinton GE (2017). Dynamic routing between capsules. In: Advances in neural information processing systems. pp 3856–3866
Zurück zum Zitat Saito K, Shin A, Ushiku Y, Harada T (2017) Dualnet: domain-invariant network for visual question answering. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 829–834 Saito K, Shin A, Ushiku Y, Harada T (2017) Dualnet: domain-invariant network for visual question answering. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 829–834
Zurück zum Zitat Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6649–6658 Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6649–6658
Zurück zum Zitat Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 158–175 Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 158–175
Zurück zum Zitat Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4613–4621 Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4613–4621
Zurück zum Zitat Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 10472–10481 Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 10472–10481
Zurück zum Zitat Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:​1409.​1556
Zurück zum Zitat Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp 3104–3112 Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp 3104–3112
Zurück zum Zitat Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9
Zurück zum Zitat Teney D, Hengel AV (2018) Visual question answering as a meta learning task. In: Computer vision—ECCV 2018 lecture notes in computer science. 229–245 Teney D, Hengel AV (2018) Visual question answering as a meta learning task. In: Computer vision—ECCV 2018 lecture notes in computer science. 229–245
Zurück zum Zitat Tommasi T, Mallya A, Plummer B, Lazebnik S, Berg AC, Berg TL (2019) Combining multiple cues for visual madlibs question answering. Int J Comput Vis 127(1):38–60CrossRef Tommasi T, Mallya A, Plummer B, Lazebnik S, Berg AC, Berg TL (2019) Combining multiple cues for visual madlibs question answering. Int J Comput Vis 127(1):38–60CrossRef
Zurück zum Zitat Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimedia Tools Appl 78(3):2921–2935CrossRef Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimedia Tools Appl 78(3):2921–2935CrossRef
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Advances in neural in-formation processing systems. pp 5998–6008 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Advances in neural in-formation processing systems. pp 5998–6008
Zurück zum Zitat Wang P, Wu Q, Shen C, Hengel AVD, Dick A (2015) Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 Wang P, Wu Q, Shen C, Hengel AVD, Dick A (2015) Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:​1511.​02570
Zurück zum Zitat Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427CrossRef Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427CrossRef
Zurück zum Zitat Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 133–138 Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 133–138
Zurück zum Zitat Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016). Ask me any-thing: Free-form visual question answering based on knowledge from exter-nal sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 4622-4630) Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016). Ask me any-thing: Free-form visual question answering based on knowledge from exter-nal sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 4622-4630)
Zurück zum Zitat Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381CrossRef Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381CrossRef
Zurück zum Zitat Xu W, Rudnicky A (2000) Can artificial neural networks learn language models?. In: sixth international conference on spoken language processing Xu W, Rudnicky A (2000) Can artificial neural networks learn language models?. In: sixth international conference on spoken language processing
Zurück zum Zitat Xu H, Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: European conference on computer vision. Springer, Cham, pp 451–466 Xu H, Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: European conference on computer vision. Springer, Cham, pp 451–466
Zurück zum Zitat Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 21–29 Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 21–29
Zurück zum Zitat Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learn-ing based natural language processing. IEEE Comput Intell Mag 13(3):55–75CrossRef Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learn-ing based natural language processing. IEEE Comput Intell Mag 13(3):55–75CrossRef
Zurück zum Zitat Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the ieee international conference on computer vision. pp 2461–2469 Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the ieee international conference on computer vision. pp 2461–2469
Zurück zum Zitat Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4187–4195 Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4187–4195
Zurück zum Zitat Yu D, Gao X, Xiong H (2018a) Structured semantic representation for visual question answering. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 2286–2290 Yu D, Gao X, Xiong H (2018a) Structured semantic representation for visual question answering. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 2286–2290
Zurück zum Zitat Yu Z, Yu J, Xiang C, Fan J, Tao D (2018b) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef Yu Z, Yu J, Xiang C, Fan J, Tao D (2018b) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959CrossRef
Zurück zum Zitat Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6281–6290 Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6281–6290
Zurück zum Zitat Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, Cham, pp 818–833 Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, Cham, pp 818–833
Zurück zum Zitat Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5014–5022 Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5014–5022
Zurück zum Zitat Zhao W, Peng H, Eger S, Cambria E, Yang M (2019) Towards scalable and reliable capsule networks for challenging NLP applications. arXiv preprint arXiv:1906.02829 Zhao W, Peng H, Eger S, Cambria E, Yang M (2019) Towards scalable and reliable capsule networks for challenging NLP applications. arXiv preprint arXiv:​1906.​02829
Zurück zum Zitat Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:​1512.​02167
Zurück zum Zitat Zhu Y, Zhang C, Ré C, Fei-Fei L (2015) Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint arXiv:1507.05670 Zhu Y, Zhang C, Ré C, Fei-Fei L (2015) Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint arXiv:​1507.​05670
Zurück zum Zitat Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4995–5004 Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4995–5004
Metadaten
Titel
Visual question answering: a state-of-the-art review
verfasst von
Sruthy Manmadhan
Binsu C. Kovoor
Publikationsdatum
08.04.2020
Verlag
Springer Netherlands
Erschienen in
Artificial Intelligence Review / Ausgabe 8/2020
Print ISSN: 0269-2821
Elektronische ISSN: 1573-7462
DOI
https://doi.org/10.1007/s10462-020-09832-7

Weitere Artikel der Ausgabe 8/2020

Artificial Intelligence Review 8/2020 Zur Ausgabe

Premium Partner