Top

Neural Processing Letters

Published in:

25-08-2023

Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion

Authors: Shuqin Chen, Li Yang, Yikang Hu

Published in: Neural Processing Letters | Issue 8/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Video captioning generation has become one of the research hotspots in recent years due to its wide range of potential application scenarios. It captures the situation where recognition errors occur in the description due to insufficient interaction between visual features and text features during model encoding, and the attention mechanism is difficult to explicitly model the visual and verbal coherence. In this paper, we propose a video captioning algorithm CAVF (Cascaded Attention-guided Visual Features Fusion for Video Captioning) based on cascaded attention-guided visual features fusion. In the encoding stage, a cascaded attention mechanism is proposed to model the visual content correlation between different frames, and the global semantic information can better guide the visual features fusion, through which the network further enhances the correlation between the visual features of the model and the decoder. In the decoding stage, the overall features and word vectors obtained from the multilayer long- and short-term memory network are encoded for de-enhancement to generate the current words. Experiments on the public datasets MSVD and MSR-VTT validate the effectiveness of the model in this paper, and the proposed method in this paper improves 5.6%, 1.3%, and 4.3% in BLEU_4, ROUGE, and CIDER metrics, respectively, on the MSR-VTT public dataset compared with the benchmark method.

previous article A Novel Spatio-Temporal Network of Multi-channel CNN and GCN for Human Activity Recognition Based on BAN

next article 1T Spiking Neuron Using Ferroelectric Junctionless FET with Ultra-Low Energy Consumption of 24 aJ/Spike

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Gao W, Chen LDZ et al (2020) Implementation of pre-standardized transformer in Ukrainian–English machine translation. Small Microcomput Syst 41(11):2286–2291

Zhang H, Shao YYD et al (2021) Neural machine translation based on hierarchical analysis of syntactic rules. Small Microcomput Syst 42(11):2300–2306

Fang K, Zhou L, Jin C et al (2019) Fully convolutional video captioning with coarse-to-fine and inherited attention. In: proceedings of the AAAI conference on artificial intelligence, pp 8271–8278

Lian Z, Li H, Wang R et al (2020) Enhanced soft attention mechanism with an inception-like module for image captioning. In: 2020 IEEE 32nd international conference on tools with artificial intelligence (ICTAI), IEEE, pp 748–752

Li X, Zhang LXH (2019) Research on multi-theme image description generation method. Small Microcomput Syst 40(05):1064–1068

Wu X, Li G, Cao Q et al (2018) Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6829–6837

Liu S, Ren Z, Yuan J (2018) Sibnet: sibling convolutional encoder for video captioning. In: Proceedings of the 26th ACM international conference on multimedia, pp 1425–1434

Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8191–8198

Cherian A, Wang J, Hori C et al (2020) Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1617–1626

10.

Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643

11.

Chen S, Zhong X, Wu S et al (2021) Memory-attended semantic context-aware network for video captioning. Soft Comput pp 1–13

12.

Sun Z, Zhong X, Chen S et al (2021) Modeling context-guided visual and linguistic semantic feature for video captioning. In: International conference on artificial neural networks, Springer, pp 677–689

13.

Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8327–8336

14.

Hu Y, Chen Z, Zha ZJ et al (2019) Hierarchical global-local temporal modeling for video captioning. In: Proceedings of the 27th ACM international conference on multimedia, pp 774–783

15.

Qin Y, Du J, Zhang Y et al (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8367–8375

16.

Shi J, Li Y, Wang S (2019) Cascade attention: multiple feature based learning for image captioning. In: 2019 IEEE international conference on image processing (ICIP), IEEE, pp 1970–1974

17.

Deng J, Li L, Zhang B et al (2021) Syntax-guided hierarchical attention network for video captioning. IEEE Trans Circuits Syst Video Technol 32(2):880–892CrossRef

18.

Zhu Y, Jiang S (2019) Attention-based densely connected lstm for video captioning. In: Proceedings of the 27th ACM international conference on multimedia, pp 802–810

19.

Chen S, Zhong X, Li L et al (2020) Adaptively converting auxiliary attributes and textual embedding for video captioning based on bilstm. Neural Process Lett 52(3):2353–2369CrossRef

20.

Pan B, Cai H, Huang DA, et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,870–10,879

21.

Gao L, Li X, Song J et al (2019) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131

22.

Li X, Zhao B, Lu X, et al (2017) Mam-RNN: multi-level attention model based RNN for video captioning. In: IJCAI, pp 2208–2214

23.

Wang H, Xu Y, Han Y (2018) Spotting and aggregating salient regions for video captioning. In: Proceedings of the 26th ACM international conference on multimedia, pp 1519–1526

24.

Jiang K, Wang Z, Yi P et al (2020) Decomposition makes better rain removal: an improved attention-guided de raining network. IEEE Trans Circuits Syst Video Technol 31(10):3981–3995CrossRef

25.

Huang Z, Wang Z, Tsai CC et al (2020) Dotscn: group re-identification via domain-transferred single and couple representation learning. IEEE Trans Circuits Syst Video Technol 31(7):2739–2750CrossRef

26.

Tan G, Liu D, Wang M et al (2020) Learning to discretely compose reasoning module networks for video captioning. arXiv preprint arXiv:2007.09049

27.

Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,096–13,105

28.

Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057

29.

Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980

30.

Ren S, He K, Girshick R et al (2015) Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28

31.

Venugopalan S, Rohrbach DJM et al (2015) Sequence to sequence -video to text. In: Proceedings of IEEE international conference on computer vision, pp 4534–4542

32.

Xu J, Mei T, Yao T et al (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

33.

He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

34.

Xie S, Girshick R, Dollár P et al (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500

35.

Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vision 115(3):211–252MathSciNetCrossRef

36.

Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

37.

Chen J, Pan Y, Li Y et al (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8167–8174

38.

Chen Y, Wang S, Zhang W et al (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373

39.

Wang B, Ma L, Zhang W et al (2018) Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7622–7631

40.

Aafaq N, Akhtar N, Liu W et al (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,487–12,496

41.

Shi X, Cai J, Joty S et al (2019) Watch it twice: video captioning with a refocused video encoder. In: Proceedings of the 27th ACM international conference on multimedia, pp 818–826

42.

Wang B, Ma L, Zhang W et al (2019) Controllable video captioning with pos sequence guidance based on gated fusion network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2641–2650

43.

Hou J, Wu X, Zhao W et al (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8918–8927

44.

Chen J, Pan Y, Li Y et al (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8167–8174

45.

Li S, Yang B, Zou Y (2022) Adaptive curriculum learning for video captioning. IEEE Access 10:31,751-31,759CrossRef

46.

Chen J, Pan Y, Li Y et al (2023) Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Trans Multimed Comput Commun Appl 19(1s):1–24CrossRef

47.

Tu Y, Zhou C, Guo J et al (2021) Enhancing the alignment between target words and corresponding frames for video captioning. Pattern Recogn 111(107):702

48.

Ye H, Li G, Qi Y et al (2022) Hierarchical modular network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17,939–17,948

49.

Pei W, Zhang J, Wang X et al (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8347–8356

50.

Chen J, Chao H (2020) Videotrm: pre-training for video captioning challenge 2020. In: Proceedings of the 28th ACM international conference on multimedia, pp 4605–4609

51.

Liu S, Ren Z, Yuan J (2018) Sibnet: sibling convolutional encoder for video captioning. In: Proceedings of the 26th ACM international conference on multimedia, pp 1425–1434

52.

Lin K, Li L, Lin CC et al (2022) Swinbert: end-to-end transformers with sparse attention for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 17,949–17,958

53.

Li L, Gao X, Deng J et al (2022) Long short-term relation transformer with global gating for video captioning. IEEE Trans Image Process 31: 2726–2738

54.

Wang T, Zhang R, Lu Z et al (2021) End-to-end dense video captioning with parallel decoding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6847–6857

55.

Iashin V, Rahtu E (2020) Multi-modal dense video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 958–959

Title: Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion
Authors: Shuqin Chen
Li Yang
Yikang Hu
Publication date: 25-08-2023
Publisher: Springer US
Published in: Neural Processing Letters / Issue 8/2023
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-023-11386-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 8/2023

Dynamic Goal Tracking for Differential Drive Robot Using Deep Reinforcement Learning

Order-Dependent Sampling Control of Uncertain Fractional-Order Neural Networks System

Image Denoising Network Based on Subband Information Sharing Using Dual-Tree Complex Wavelet

Synchronization of Coupled Memristive Neural Network Based on Edge-Event Triggered Control

Supervised Learning Strategy for Spiking Neurons Based on Their Segmental Running Characteristics

Utilizing the Neural Renderer for Accurate 3D Face Reconstruction from a Single Image