Skip to main content
Top

2021 | OriginalPaper | Chapter

Recent Advances in Video Question Answering: A Review of Datasets and Methods

Authors : Devshree Patel, Ratnam Parikh, Yesha Shastri

Published in: Pattern Recognition. ICPR International Workshops and Challenges

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Video Question Answering (VQA) is a recent emerging challenging task in the field of Computer Vision. Several visual information retrieval techniques like Video Captioning/Description and Video-guided Machine Translation have preceded the task of VQA. VQA helps to retrieve temporal and spatial information from the video scenes and interpret it. In this survey, we review a number of methods and datasets for the task of VQA. To the best of our knowledge, no previous survey has been conducted for the VQA task.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Tu, K., et al.: Joint video and text parsing for understanding events and answering queries. IEEE MultiMed. 21(2), 42–70 (2014)CrossRef Tu, K., et al.: Joint video and text parsing for understanding events and answering queries. IEEE MultiMed. 21(2), 42–70 (2014)CrossRef
2.
go back to reference Yu, Y., et al.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) Yu, Y., et al.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
3.
go back to reference Zhu, L., et al.: Uncovering the temporal context for video question answering. Int. J. Comput. Vis. 124(3), 409–421 (2017)MathSciNetCrossRef Zhu, L., et al.: Uncovering the temporal context for video question answering. Int. J. Comput. Vis. 124(3), 409–421 (2017)MathSciNetCrossRef
4.
go back to reference Zeng, K., et al.: Leveraging video descriptions to learn video question answering. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) Zeng, K., et al.: Leveraging video descriptions to learn video question answering. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
5.
go back to reference Ye, Y., et al.: Video question answering via attribute-augmented attention network learning. In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval (2017) Ye, Y., et al.: Video question answering via attribute-augmented attention network learning. In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval (2017)
6.
go back to reference Zhao, Z., et al.: Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI (2017) Zhao, Z., et al.: Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI (2017)
7.
go back to reference Xu, D., et al.: Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM International Conference on Multimedia (2017) Xu, D., et al.: Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM International Conference on Multimedia (2017)
9.
go back to reference Zhao, Z., et al.: Video question answering via hierarchical dual-level attention network learning. In: Proceedings of the 25th ACM International Conference on Multimedia (2017) Zhao, Z., et al.: Video question answering via hierarchical dual-level attention network learning. In: Proceedings of the 25th ACM International Conference on Multimedia (2017)
10.
go back to reference Gao, J., et al.: Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) Gao, J., et al.: Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
11.
go back to reference Zhao, Z., et al.: Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: IJCAI (2018) Zhao, Z., et al.: Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: IJCAI (2018)
12.
go back to reference Kim, K., et al.: Multimodal dual attention memory for video story question answering. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018) Kim, K., et al.: Multimodal dual attention memory for video story question answering. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
13.
go back to reference Fan, C., et al.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019) Fan, C., et al.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
14.
15.
go back to reference Zhang, W., et al.: Frame augmented alternating attention network for video question answering. IEEE Trans. Multimed. 22(4), 1032–1041 (2019)CrossRef Zhang, W., et al.: Frame augmented alternating attention network for video question answering. IEEE Trans. Multimed. 22(4), 1032–1041 (2019)CrossRef
16.
go back to reference Li, X., et al.: Beyond rnns: positional self-attention with co-attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019) Li, X., et al.: Beyond rnns: positional self-attention with co-attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019)
17.
go back to reference Yang, T., et al.: Question-aware tube-switch network for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia (2019) Yang, T., et al.: Question-aware tube-switch network for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia (2019)
18.
go back to reference Gao, L., et al.: Structured two-stream attention network for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019) Gao, L., et al.: Structured two-stream attention network for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019)
19.
go back to reference Jin, W., et al.: Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia (2019) Jin, W., et al.: Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia (2019)
20.
go back to reference Kim, J., et al.: Progressive attention memory network for movie story question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019) Kim, J., et al.: Progressive attention memory network for movie story question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
21.
go back to reference Huang, D., et al.: Location-aware graph convolutional networks for video question answering. In: AAAI (2020) Huang, D., et al.: Location-aware graph convolutional networks for video question answering. In: AAAI (2020)
22.
go back to reference Yu, T., et al.: Compositional attention networks with two-stream fusion for video question answering. IEEE Trans. Image Process. 29, 1204–1218 (2019)MathSciNetCrossRef Yu, T., et al.: Compositional attention networks with two-stream fusion for video question answering. IEEE Trans. Image Process. 29, 1204–1218 (2019)MathSciNetCrossRef
23.
go back to reference Yuan, Z., et al.: Adversarial multimodal network for movie story question answering. IEEE Trans. Multimed. 21(2), 42–70 (2020) Yuan, Z., et al.: Adversarial multimodal network for movie story question answering. IEEE Trans. Multimed. 21(2), 42–70 (2020)
24.
go back to reference Zhao, Z., et al.: Open-ended video question answering via multi-modal conditional adversarial networks. IEEE Trans. Image Proces. 29, 3859–3870 (2020)CrossRef Zhao, Z., et al.: Open-ended video question answering via multi-modal conditional adversarial networks. IEEE Trans. Image Proces. 29, 3859–3870 (2020)CrossRef
25.
go back to reference Zhuang, Y., et al.: Multichannel attention refinement for video question answering. ACM Trans. Multimed. Comput. Commun. Appl. 16(1), 1–23 (2020)CrossRef Zhuang, Y., et al.: Multichannel attention refinement for video question answering. ACM Trans. Multimed. Comput. Commun. Appl. 16(1), 1–23 (2020)CrossRef
26.
go back to reference Lei, C., et al.: Multi-question learning for visual question answering. In: AAAI (2020) Lei, C., et al.: Multi-question learning for visual question answering. In: AAAI (2020)
27.
go back to reference Yang, Z., et al.: BERT representations for video question answering. In: The IEEE Winter Conference on Applications of Computer Vision (2020) Yang, Z., et al.: BERT representations for video question answering. In: The IEEE Winter Conference on Applications of Computer Vision (2020)
28.
go back to reference Jiang, Jianwen, et al. “Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering.” AAAI. 2020 Jiang, Jianwen, et al. “Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering.” AAAI. 2020
29.
go back to reference Le, T.M., et al.: Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020) Le, T.M., et al.: Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
30.
go back to reference Yu, T., et al.: Long-term video question answering via multimodal hierarchical memory attentive networks. IEEE Trans. Circuits Syst. Video Technol. (2020) Yu, T., et al.: Long-term video question answering via multimodal hierarchical memory attentive networks. IEEE Trans. Circuits Syst. Video Technol. (2020)
31.
go back to reference Jiang, P., Yahong, H.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI (2020) Jiang, P., Yahong, H.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI (2020)
32.
go back to reference Teney, D., et al.: Visual question answering: a tutorial. IEEE Signal Process. Magaz. 34(6), 63–75 (2017)CrossRef Teney, D., et al.: Visual question answering: a tutorial. IEEE Signal Process. Magaz. 34(6), 63–75 (2017)CrossRef
33.
go back to reference Castro, S., et al.: LifeQA: a real-life dataset for video question answering. In: Proceedings of the 12th Language Resources and Evaluation Conference (2020) Castro, S., et al.: LifeQA: a real-life dataset for video question answering. In: Proceedings of the 12th Language Resources and Evaluation Conference (2020)
34.
go back to reference Fan, C.: EgoVQA-an egocentric video question answering benchmark dataset. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019) Fan, C.: EgoVQA-an egocentric video question answering benchmark dataset. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)
35.
go back to reference Yu, Z., et al.: Activitynet-QA: A dataset for understanding complex web videos via question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019) Yu, Z., et al.: Activitynet-QA: A dataset for understanding complex web videos via question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019)
36.
go back to reference Tapaswi, M., et al.: Movieqa: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) Tapaswi, M., et al.: Movieqa: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
37.
go back to reference Mun, J., et al.: Marioqa: answering questions by watching gameplay videos. In: Proceedings of the IEEE International Conference on Computer Vision (2017) Mun, J., et al.: Marioqa: answering questions by watching gameplay videos. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
38.
go back to reference Zadeh, A., et al.: Social-iq: a question answering benchmark for artificial social intelligence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019) Zadeh, A., et al.: Social-iq: a question answering benchmark for artificial social intelligence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
42.
go back to reference Maharaj, T., et al.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) Maharaj, T., et al.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
43.
44.
go back to reference Jang, Y., et al.: Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) Jang, Y., et al.: Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
45.
go back to reference Xu, J., et al.: Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) Xu, J., et al.: Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
47.
go back to reference Rohrbach, A., et al.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) Rohrbach, A., et al.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
48.
go back to reference Guadarrama, S., et al.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference On Computer Vision (2013) Guadarrama, S., et al.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference On Computer Vision (2013)
49.
go back to reference Chen, D., William B.D.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (2011) Chen, D., William B.D.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (2011)
50.
go back to reference Caba, H.F., et al.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) Caba, H.F., et al.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
51.
go back to reference Li, Y., et al.: TGIF: a new dataset and benchmark on animated GIF description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) Li, Y., et al.: TGIF: a new dataset and benchmark on animated GIF description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Metadata
Title
Recent Advances in Video Question Answering: A Review of Datasets and Methods
Authors
Devshree Patel
Ratnam Parikh
Yesha Shastri
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-68790-8_27

Premium Partner