Skip to main content
Erschienen in: Neural Computing and Applications 14/2024

27.02.2024 | Original Article

Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism

verfasst von: Yuanmao Luo, Ruomei Wang, Fuwei Zhang, Fan Zhou, Mingyang Liu, Jiawei Feng

Erschienen in: Neural Computing and Applications | Ausgabe 14/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Multi-modal attention learning in video question answering (VideoQA) is a challenging task, as it requires consideration of information recognition within modalities and information interaction and fusion between modalities. Existing methods employs the cross-attention mechanism to compute feature similarity between modalities, thereby aggregating relevant information in a shared space. However, heterogeneous features have different distributions in the shared space, making it difficult to directly match semantics, which may affect similarity calculation. To address this issue, a novel enhanced cross-modal attention mechanism (ECAM) is proposed in this paper that pre-fuses two modalities to generate an enhanced key with feature importance distributions to effectively solve the semantic mismatch. Compared with the existing cross-attention mechanism, ECAM can realize the semantic matching between multiple modalities more accurately and pay more attention to the relevant feature regions. In the multi-modal fusion phase, a two-stage fusion strategy is proposed to exploit the advantages of the two fusion methods to deeply explore the complex and diverse dependency relationships between the multi-modal features. Collectively supported by these two newly designed modules, we proposed the VideoQA solution based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism which is able to conquer challenging semantic understanding and question answering tasks. Extensive experiments on four VideoQA datasets show that the new approach attains superior results in comparison with state-of-the-art peer methods. Moreover, experiments on the latest joint task datasets prove that ECAM is a general mechanism that can be easily adapted to solve other visual-linguistic tasks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Anne Hendricks L, Wang O, Shechtman E, et al (2017) Localizing moments in video with natural language. In: Proceedings of the IEEE international conference on computer vision, pp 5803–5812 Anne Hendricks L, Wang O, Shechtman E, et al (2017) Localizing moments in video with natural language. In: Proceedings of the IEEE international conference on computer vision, pp 5803–5812
2.
Zurück zum Zitat Dai Y, Gieseke F, Oehmcke S, et al (2021) Attentional feature fusion. In: Proceedings of the IEEE/cvf winter conference on applications of computer vision, pp 3560–3569 Dai Y, Gieseke F, Oehmcke S, et al (2021) Attentional feature fusion. In: Proceedings of the IEEE/cvf winter conference on applications of computer vision, pp 3560–3569
3.
Zurück zum Zitat Fan C, Zhang X, Zhang S, et al (2019) Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1999–2007 Fan C, Zhang X, Zhang S, et al (2019) Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1999–2007
4.
Zurück zum Zitat Gu M, Zhao Z, Jin W et al (2021) Graph-based multi-interaction network for video question answering. IEEE Trans Image Process 30:2758–2770CrossRef Gu M, Zhao Z, Jin W et al (2021) Graph-based multi-interaction network for video question answering. IEEE Trans Image Process 30:2758–2770CrossRef
5.
Zurück zum Zitat Guo Z, Zhao J, Jiao L, et al (2021) Multi-scale progressive attention network for video question answering. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 973–978 Guo Z, Zhao J, Jiao L, et al (2021) Multi-scale progressive attention network for video question answering. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 973–978
6.
Zurück zum Zitat Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6546–6555 Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6546–6555
7.
Zurück zum Zitat He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
8.
Zurück zum Zitat Jang Y, Song Y, Yu Y, et al (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2758–2766 Jang Y, Song Y, Yu Y, et al (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2758–2766
9.
Zurück zum Zitat Jiang J, Chen Z, Lin H, et al (2020) Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 11101–11108 Jiang J, Chen Z, Lin H, et al (2020) Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 11101–11108
10.
Zurück zum Zitat Jiang P, Han Y (2020) Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 11109–11116 Jiang P, Han Y (2020) Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 11109–11116
11.
Zurück zum Zitat Kim KM, Choi SH, Kim JH, et al (2018) Multimodal dual attention memory for video story question answering. In: Proceedings of the European conference on computer vision, pp 673–688 Kim KM, Choi SH, Kim JH, et al (2018) Multimodal dual attention memory for video story question answering. In: Proceedings of the European conference on computer vision, pp 673–688
12.
Zurück zum Zitat Le TM, Le V, Venkatesh S, et al (2020) Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9972–9981 Le TM, Le V, Venkatesh S, et al (2020) Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9972–9981
13.
Zurück zum Zitat Lei J, Berg TL, Bansal M (2021) Detecting moments and highlights in videos via natural language queries. Adv Neural Inf Process Syst 34:11846–11858 Lei J, Berg TL, Bansal M (2021) Detecting moments and highlights in videos via natural language queries. Adv Neural Inf Process Syst 34:11846–11858
14.
Zurück zum Zitat Lei J, Yu L, Bansal M, et al (2018) Tvqa: localized, compositional video question answering. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 1369–1379 Lei J, Yu L, Bansal M, et al (2018) Tvqa: localized, compositional video question answering. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 1369–1379
15.
Zurück zum Zitat Lei J, Yu L, Berg TL, et al (2020) Tvr: a large-scale dataset for video-subtitle moment retrieval. In: Computer vision—ECCV 2020: 16th European conference, pp 447–463 Lei J, Yu L, Berg TL, et al (2020) Tvr: a large-scale dataset for video-subtitle moment retrieval. In: Computer vision—ECCV 2020: 16th European conference, pp 447–463
16.
Zurück zum Zitat Li X, Gao L, Wang X, et al (2019) Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM international conference on multimedia, pp 1166–1174 Li X, Gao L, Wang X, et al (2019) Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM international conference on multimedia, pp 1166–1174
17.
Zurück zum Zitat Liu Y, Zhang X, Huang F et al (2022) Cross-attentional spatio-temporal semantic graph networks for video question answering. IEEE Trans Image Process 31:1684–1696CrossRef Liu Y, Zhang X, Huang F et al (2022) Cross-attentional spatio-temporal semantic graph networks for video question answering. IEEE Trans Image Process 31:1684–1696CrossRef
18.
Zurück zum Zitat Liu F, Liu J, Wang W, et al (2021) Hair: hierarchical visual-semantic relational reasoning for video question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1698–1707 Liu F, Liu J, Wang W, et al (2021) Hair: hierarchical visual-semantic relational reasoning for video question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1698–1707
19.
Zurück zum Zitat Liu Y, Li S, Wu Y, et al (2022a) Umt: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3042–3051 Liu Y, Li S, Wu Y, et al (2022a) Umt: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3042–3051
20.
Zurück zum Zitat Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543 Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543
21.
Zurück zum Zitat Seo PH, Nagrani A, Schmid C (2021b) Look before you speak: visually contextualized utterances. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16877–16887 Seo PH, Nagrani A, Schmid C (2021b) Look before you speak: visually contextualized utterances. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16877–16887
22.
Zurück zum Zitat Seo A, Kang GC, Park J, et al (2021a) Attend what you need: motion-appearance synergistic networks for video question answering. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 6167–6177 Seo A, Kang GC, Park J, et al (2021a) Attend what you need: motion-appearance synergistic networks for video question answering. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 6167–6177
23.
Zurück zum Zitat Sun G, Liang L, Li T, et al (2021) Video question answering: a survey of models and datasets. Mob Netw Appl 26(5):1904–1937CrossRef Sun G, Liang L, Li T, et al (2021) Video question answering: a survey of models and datasets. Mob Netw Appl 26(5):1904–1937CrossRef
24.
Zurück zum Zitat Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for computational linguistics. Meeting, pp 6558–6569 Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for computational linguistics. Meeting, pp 6558–6569
25.
Zurück zum Zitat Wang YS, Su HT, Chang CH, et al (2020b) Video question generation via semantic rich cross-modal self-attention networks learning. In: IEEE international conference on acoustics, speech and signal processing, pp 2423–2427 Wang YS, Su HT, Chang CH, et al (2020b) Video question generation via semantic rich cross-modal self-attention networks learning. In: IEEE international conference on acoustics, speech and signal processing, pp 2423–2427
26.
Zurück zum Zitat Wang W, Huang Y, Wang L (2020) Long video question answering: a matching-guided attention model. Pattern Recogn 102:107248CrossRef Wang W, Huang Y, Wang L (2020) Long video question answering: a matching-guided attention model. Pattern Recogn 102:107248CrossRef
27.
Zurück zum Zitat Wang H, Guo D, Hua XS, et al (2021) Pairwise Vlad interaction network for video question answering. In: Proceedings of the 29th ACM international conference on multimedia, pp 5119–5127 Wang H, Guo D, Hua XS, et al (2021) Pairwise Vlad interaction network for video question answering. In: Proceedings of the 29th ACM international conference on multimedia, pp 5119–5127
28.
Zurück zum Zitat Wang S, Liang D, Song J, et al (2022) DABERT: Dual attention enhanced BERT for semantic matching. In: Proceedings of the 29th international conference on computational linguistics, pp 1645–1654 Wang S, Liang D, Song J, et al (2022) DABERT: Dual attention enhanced BERT for semantic matching. In: Proceedings of the 29th international conference on computational linguistics, pp 1645–1654
29.
Zurück zum Zitat Winterbottom T, Xiao S, McLean A, et al (2020) Trying bilinear pooling in video-QA, pp 1–20. arXiv preprint arXiv:2012.10285 Winterbottom T, Xiao S, McLean A, et al (2020) Trying bilinear pooling in video-QA, pp 1–20. arXiv preprint arXiv:​2012.​10285
30.
Zurück zum Zitat Wu J, Weng W, Fu J et al (2022) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput Appl 34:5397–5416CrossRef Wu J, Weng W, Fu J et al (2022) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput Appl 34:5397–5416CrossRef
31.
Zurück zum Zitat Xu L, Huang H, Liu J (2021) Sutd-trafficqa: a question answering benchmark and an efficient network for video reasoning over traffic events. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9878–9888 Xu L, Huang H, Liu J (2021) Sutd-trafficqa: a question answering benchmark and an efficient network for video reasoning over traffic events. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9878–9888
32.
Zurück zum Zitat Xu J, Mei T, Yao T, et al (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296 Xu J, Mei T, Yao T, et al (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
33.
Zurück zum Zitat Xu D, Zhao Z, Xiao J, et al (2017) Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM international conference on multimedia, pp 1645–1653 Xu D, Zhao Z, Xiao J, et al (2017) Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM international conference on multimedia, pp 1645–1653
34.
Zurück zum Zitat Yang Z, Garcia N, Chu C, et al (2020) Bert representations for video question answering. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1556–1565 Yang Z, Garcia N, Chu C, et al (2020) Bert representations for video question answering. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1556–1565
35.
Zurück zum Zitat Yang L, Zhang RY, Li L, et al (2021) Simam: a simple, parameter-free attention module for convolutional neural networks. In: International conference on machine learning, pp 11863–11874 Yang L, Zhang RY, Li L, et al (2021) Simam: a simple, parameter-free attention module for convolutional neural networks. In: International conference on machine learning, pp 11863–11874
36.
Zurück zum Zitat Yan C, Zhang H, Li X, et al (2023) Cross-modality complementary information fusion for multispectral pedestrian detection. Neural Comput Appl 35(14):10361–10386CrossRef Yan C, Zhang H, Li X, et al (2023) Cross-modality complementary information fusion for multispectral pedestrian detection. Neural Comput Appl 35(14):10361–10386CrossRef
37.
Zurück zum Zitat Yu T, Yu J, Yu Z et al (2019) Compositional attention networks with two-stream fusion for video question answering. IEEE Trans Image Process 29:1204–1218MathSciNetCrossRef Yu T, Yu J, Yu Z et al (2019) Compositional attention networks with two-stream fusion for video question answering. IEEE Trans Image Process 29:1204–1218MathSciNetCrossRef
38.
Zurück zum Zitat Yu W, Zheng H, Li M et al (2021) Learning from inside: self-driven Siamese sampling and reasoning for video question answering. Adv Neural Inf Process Syst 34:26462–26474 Yu W, Zheng H, Li M et al (2021) Learning from inside: self-driven Siamese sampling and reasoning for video question answering. Adv Neural Inf Process Syst 34:26462–26474
39.
Zurück zum Zitat Yu Z, Xu D, Yu J, et al (2019b) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 9127–9134 Yu Z, Xu D, Yu J, et al (2019b) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 9127–9134
40.
Zurück zum Zitat Zhao J, Zhang X, Wang X et al (2022) Overcoming language priors in VQA via adding visual module. Neural Comput Appl 34(11):9015–9023CrossRef Zhao J, Zhang X, Wang X et al (2022) Overcoming language priors in VQA via adding visual module. Neural Comput Appl 34(11):9015–9023CrossRef
41.
Zurück zum Zitat Zhao Z, Yang Q, Cai D, et al (2017) Video question answering via hierarchical spatio-temporal attention networks. In: International joint conference on artificial intelligence, pp 1–7 Zhao Z, Yang Q, Cai D, et al (2017) Video question answering via hierarchical spatio-temporal attention networks. In: International joint conference on artificial intelligence, pp 1–7
42.
Zurück zum Zitat Zhong Y, Xiao J, Ji W, et al (2022) Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225 Zhong Y, Xiao J, Ji W, et al (2022) Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:​2203.​01225
44.
Zurück zum Zitat Zhuang Y, Xu D, Yan X et al (2020) Multichannel attention refinement for video question answering. ACM Trans Multimed Comput Commun Appl 16(1s):1–23CrossRef Zhuang Y, Xu D, Yan X et al (2020) Multichannel attention refinement for video question answering. ACM Trans Multimed Comput Commun Appl 16(1s):1–23CrossRef
Metadaten
Titel
Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism
verfasst von
Yuanmao Luo
Ruomei Wang
Fuwei Zhang
Fan Zhou
Mingyang Liu
Jiawei Feng
Publikationsdatum
27.02.2024
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 14/2024
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-024-09482-8

Weitere Artikel der Ausgabe 14/2024

Neural Computing and Applications 14/2024 Zur Ausgabe

Premium Partner