Top

Published in:

2021 | OriginalPaper | Chapter

Recent Advances in Video Question Answering: A Review of Datasets and Methods

Authors : Devshree Patel, Ratnam Parikh, Yesha Shastri

Published in: Pattern Recognition. ICPR International Workshops and Challenges

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Video Question Answering (VQA) is a recent emerging challenging task in the field of Computer Vision. Several visual information retrieval techniques like Video Captioning/Description and Video-guided Machine Translation have preceded the task of VQA. VQA helps to retrieve temporal and spatial information from the video scenes and interpret it. In this survey, we review a number of methods and datasets for the task of VQA. To the best of our knowledge, no previous survey has been conducted for the VQA task.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Multi-task Learning for Supervised and Unsupervised Classification of Grocery Images

next chapter IQ-VQA: Intelligent Visual Question Answering

Tu, K., et al.: Joint video and text parsing for understanding events and answering queries. IEEE MultiMed. 21(2), 42–70 (2014)CrossRef

Yu, Y., et al.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

Zhu, L., et al.: Uncovering the temporal context for video question answering. Int. J. Comput. Vis. 124(3), 409–421 (2017)MathSciNetCrossRef

Zeng, K., et al.: Leveraging video descriptions to learn video question answering. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

Ye, Y., et al.: Video question answering via attribute-augmented attention network learning. In: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval (2017)

Zhao, Z., et al.: Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI (2017)

Xu, D., et al.: Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM International Conference on Multimedia (2017)

Kim, K., et al.: Deepstory: Video story qa by deep embedded memory networks. arXiv preprint arXiv:1707.00836 (2017)

Zhao, Z., et al.: Video question answering via hierarchical dual-level attention network learning. In: Proceedings of the 25th ACM International Conference on Multimedia (2017)

10.

Gao, J., et al.: Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

11.

Zhao, Z., et al.: Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: IJCAI (2018)

12.

Kim, K., et al.: Multimodal dual attention memory for video story question answering. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

13.

Fan, C., et al.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)

14.

Chao, G., et al.: Learning question-guided video representation for multi-turn video question answering. arXiv:1907.13280 (2019)

15.

Zhang, W., et al.: Frame augmented alternating attention network for video question answering. IEEE Trans. Multimed. 22(4), 1032–1041 (2019)CrossRef

16.

Li, X., et al.: Beyond rnns: positional self-attention with co-attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019)

17.

Yang, T., et al.: Question-aware tube-switch network for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia (2019)

18.

Gao, L., et al.: Structured two-stream attention network for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019)

19.

Jin, W., et al.: Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia (2019)

20.

Kim, J., et al.: Progressive attention memory network for movie story question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)

21.

Huang, D., et al.: Location-aware graph convolutional networks for video question answering. In: AAAI (2020)

22.

Yu, T., et al.: Compositional attention networks with two-stream fusion for video question answering. IEEE Trans. Image Process. 29, 1204–1218 (2019)MathSciNetCrossRef

23.

Yuan, Z., et al.: Adversarial multimodal network for movie story question answering. IEEE Trans. Multimed. 21(2), 42–70 (2020)

24.

Zhao, Z., et al.: Open-ended video question answering via multi-modal conditional adversarial networks. IEEE Trans. Image Proces. 29, 3859–3870 (2020)CrossRef

25.

Zhuang, Y., et al.: Multichannel attention refinement for video question answering. ACM Trans. Multimed. Comput. Commun. Appl. 16(1), 1–23 (2020)CrossRef

26.

Lei, C., et al.: Multi-question learning for visual question answering. In: AAAI (2020)

27.

Yang, Z., et al.: BERT representations for video question answering. In: The IEEE Winter Conference on Applications of Computer Vision (2020)

28.

Jiang, Jianwen, et al. “Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering.” AAAI. 2020

29.

Le, T.M., et al.: Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)

30.

Yu, T., et al.: Long-term video question answering via multimodal hierarchical memory attentive networks. IEEE Trans. Circuits Syst. Video Technol. (2020)

31.

Jiang, P., Yahong, H.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI (2020)

32.

Teney, D., et al.: Visual question answering: a tutorial. IEEE Signal Process. Magaz. 34(6), 63–75 (2017)CrossRef

33.

Castro, S., et al.: LifeQA: a real-life dataset for video question answering. In: Proceedings of the 12th Language Resources and Evaluation Conference (2020)

34.

Fan, C.: EgoVQA-an egocentric video question answering benchmark dataset. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)

35.

Yu, Z., et al.: Activitynet-QA: A dataset for understanding complex web videos via question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019)

36.

Tapaswi, M., et al.: Movieqa: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

37.

Mun, J., et al.: Marioqa: answering questions by watching gameplay videos. In: Proceedings of the IEEE International Conference on Computer Vision (2017)

38.

Zadeh, A., et al.: Social-iq: a question answering benchmark for artificial social intelligence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)

39.

Garcia, N., et al.: KnowIT VQA: answering knowledge-based questions about videos. arXiv preprint arXiv:1910.10706 (2019)

40.

Lei, J., et al.: Tvqa: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)

41.

Lei, J., et al.: Tvqa+: spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019)

42.

Maharaj, T., et al.: A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

43.

Choi, S., et al.: DramaQA: character-centered video story understanding with hierarchical QA. arXiv preprint arXiv:2005.03356 (2020)

44.

Jang, Y., et al.: Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

45.

Xu, J., et al.: Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

46.

Alamri, H., et al.: Audio visual scene-aware dialog (avsd) challenge at dstc7. arXiv preprint arXiv:1806.00525 (2018)

47.

Rohrbach, A., et al.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

48.

Guadarrama, S., et al.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference On Computer Vision (2013)

49.

Chen, D., William B.D.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (2011)

50.

Caba, H.F., et al.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

51.

Li, Y., et al.: TGIF: a new dataset and benchmark on animated GIF description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

Title: Recent Advances in Video Question Answering: A Review of Datasets and Methods
Authors: Devshree Patel
Ratnam Parikh
Yesha Shastri
Publisher: Springer International Publishing
Book: Pattern Recognition. ICPR International Workshops and Challenges
Print ISBN: 978-3-030-68789-2

Electronic ISBN: 978-3-030-68790-8

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-3-030-68790-8_27

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner