nach oben

Neural Computing and Applications

Erschienen in:

27.02.2024 | Original Article

Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism

verfasst von: Yuanmao Luo, Ruomei Wang, Fuwei Zhang, Fan Zhou, Mingyang Liu, Jiawei Feng

Erschienen in: Neural Computing and Applications | Ausgabe 14/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Multi-modal attention learning in video question answering (VideoQA) is a challenging task, as it requires consideration of information recognition within modalities and information interaction and fusion between modalities. Existing methods employs the cross-attention mechanism to compute feature similarity between modalities, thereby aggregating relevant information in a shared space. However, heterogeneous features have different distributions in the shared space, making it difficult to directly match semantics, which may affect similarity calculation. To address this issue, a novel enhanced cross-modal attention mechanism (ECAM) is proposed in this paper that pre-fuses two modalities to generate an enhanced key with feature importance distributions to effectively solve the semantic mismatch. Compared with the existing cross-attention mechanism, ECAM can realize the semantic matching between multiple modalities more accurately and pay more attention to the relevant feature regions. In the multi-modal fusion phase, a two-stage fusion strategy is proposed to exploit the advantages of the two fusion methods to deeply explore the complex and diverse dependency relationships between the multi-modal features. Collectively supported by these two newly designed modules, we proposed the VideoQA solution based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism which is able to conquer challenging semantic understanding and question answering tasks. Extensive experiments on four VideoQA datasets show that the new approach attains superior results in comparison with state-of-the-art peer methods. Moreover, experiments on the latest joint task datasets prove that ECAM is a general mechanism that can be easily adapted to solve other visual-linguistic tasks.

Vorheriger Artikel Contrastive feature decomposition for single image layer separation

Nächster Artikel Adaptive temperature scaling for Robust calibration of deep neural networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Anne Hendricks L, Wang O, Shechtman E, et al (2017) Localizing moments in video with natural language. In: Proceedings of the IEEE international conference on computer vision, pp 5803–5812

Dai Y, Gieseke F, Oehmcke S, et al (2021) Attentional feature fusion. In: Proceedings of the IEEE/cvf winter conference on applications of computer vision, pp 3560–3569

Fan C, Zhang X, Zhang S, et al (2019) Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1999–2007

Gu M, Zhao Z, Jin W et al (2021) Graph-based multi-interaction network for video question answering. IEEE Trans Image Process 30:2758–2770CrossRef

Guo Z, Zhao J, Jiao L, et al (2021) Multi-scale progressive attention network for video question answering. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 973–978

Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 6546–6555

He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

Jang Y, Song Y, Yu Y, et al (2017) Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2758–2766

Jiang J, Chen Z, Lin H, et al (2020) Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 11101–11108

10.

Jiang P, Han Y (2020) Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 11109–11116

11.

Kim KM, Choi SH, Kim JH, et al (2018) Multimodal dual attention memory for video story question answering. In: Proceedings of the European conference on computer vision, pp 673–688

12.

Le TM, Le V, Venkatesh S, et al (2020) Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9972–9981

13.

Lei J, Berg TL, Bansal M (2021) Detecting moments and highlights in videos via natural language queries. Adv Neural Inf Process Syst 34:11846–11858

14.

Lei J, Yu L, Bansal M, et al (2018) Tvqa: localized, compositional video question answering. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 1369–1379

15.

Lei J, Yu L, Berg TL, et al (2020) Tvr: a large-scale dataset for video-subtitle moment retrieval. In: Computer vision—ECCV 2020: 16th European conference, pp 447–463

16.

Li X, Gao L, Wang X, et al (2019) Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM international conference on multimedia, pp 1166–1174

17.

Liu Y, Zhang X, Huang F et al (2022) Cross-attentional spatio-temporal semantic graph networks for video question answering. IEEE Trans Image Process 31:1684–1696CrossRef

18.

Liu F, Liu J, Wang W, et al (2021) Hair: hierarchical visual-semantic relational reasoning for video question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1698–1707

19.

Liu Y, Li S, Wu Y, et al (2022a) Umt: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3042–3051

20.

Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543

21.

Seo PH, Nagrani A, Schmid C (2021b) Look before you speak: visually contextualized utterances. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16877–16887

22.

Seo A, Kang GC, Park J, et al (2021a) Attend what you need: motion-appearance synergistic networks for video question answering. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, pp 6167–6177

23.

Sun G, Liang L, Li T, et al (2021) Video question answering: a survey of models and datasets. Mob Netw Appl 26(5):1904–1937CrossRef

24.

Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for computational linguistics. Meeting, pp 6558–6569

25.

Wang YS, Su HT, Chang CH, et al (2020b) Video question generation via semantic rich cross-modal self-attention networks learning. In: IEEE international conference on acoustics, speech and signal processing, pp 2423–2427

26.

Wang W, Huang Y, Wang L (2020) Long video question answering: a matching-guided attention model. Pattern Recogn 102:107248CrossRef

27.

Wang H, Guo D, Hua XS, et al (2021) Pairwise Vlad interaction network for video question answering. In: Proceedings of the 29th ACM international conference on multimedia, pp 5119–5127

28.

Wang S, Liang D, Song J, et al (2022) DABERT: Dual attention enhanced BERT for semantic matching. In: Proceedings of the 29th international conference on computational linguistics, pp 1645–1654

29.

Winterbottom T, Xiao S, McLean A, et al (2020) Trying bilinear pooling in video-QA, pp 1–20. arXiv preprint arXiv:2012.10285

30.

Wu J, Weng W, Fu J et al (2022) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput Appl 34:5397–5416CrossRef

31.

Xu L, Huang H, Liu J (2021) Sutd-trafficqa: a question answering benchmark and an efficient network for video reasoning over traffic events. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9878–9888

32.

Xu J, Mei T, Yao T, et al (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

33.

Xu D, Zhao Z, Xiao J, et al (2017) Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM international conference on multimedia, pp 1645–1653

34.

Yang Z, Garcia N, Chu C, et al (2020) Bert representations for video question answering. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1556–1565

35.

Yang L, Zhang RY, Li L, et al (2021) Simam: a simple, parameter-free attention module for convolutional neural networks. In: International conference on machine learning, pp 11863–11874

36.

Yan C, Zhang H, Li X, et al (2023) Cross-modality complementary information fusion for multispectral pedestrian detection. Neural Comput Appl 35(14):10361–10386CrossRef

37.

Yu T, Yu J, Yu Z et al (2019) Compositional attention networks with two-stream fusion for video question answering. IEEE Trans Image Process 29:1204–1218MathSciNetCrossRef

38.

Yu W, Zheng H, Li M et al (2021) Learning from inside: self-driven Siamese sampling and reasoning for video question answering. Adv Neural Inf Process Syst 34:26462–26474

39.

Yu Z, Xu D, Yu J, et al (2019b) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the AAAI conference on artificial intelligence, pp 9127–9134

40.

Zhao J, Zhang X, Wang X et al (2022) Overcoming language priors in VQA via adding visual module. Neural Comput Appl 34(11):9015–9023CrossRef

41.

Zhao Z, Yang Q, Cai D, et al (2017) Video question answering via hierarchical spatio-temporal attention networks. In: International joint conference on artificial intelligence, pp 1–7

42.

Zhong Y, Xiao J, Ji W, et al (2022) Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225

43.

Zhou Z, Yang Y, Li Z, et al (2022) Image captioning with residual Swin transformer and actor-critic. Neural Comput Appl. https://doi.org/10.1007/s00521-022-07848-4CrossRef

44.

Zhuang Y, Xu D, Yan X et al (2020) Multichannel attention refinement for video question answering. ACM Trans Multimed Comput Commun Appl 16(1s):1–23CrossRef

Titel: Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism
verfasst von: Yuanmao Luo
Ruomei Wang
Fuwei Zhang
Fan Zhou
Mingyang Liu
Jiawei Feng
Publikationsdatum: 27.02.2024
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 14/2024
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-024-09482-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 14/2024

Invisible backdoor learning in regional transform domain

In-use calibration: improving domain-specific fine-grained few-shot recognition

A reinforcement learning (RL)-based hybrid method for ground penetrating radar (GPR)-driven buried object detection

Fixed-time synchronization of time-varying coupled competitive neural networks with impulsive effects

A new hybrid approach for grapevine leaves recognition based on ESRGAN data augmentation and GASVM feature selection

Prostate cancer grading framework based on deep transfer learning and Aquila optimizer

Premium Partner