Top

Multimedia Systems

Published in:

15-07-2023 | Regular Paper

Multimodal-enhanced hierarchical attention network for video captioning

Authors: Maosheng Zhong, Youde Chen, Hao Zhang, Hao Xiong, Zhixiang Wang

Published in: Multimedia Systems | Issue 5/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In video captioning, many pioneering approaches have been developed to generate higher-quality captions by exploring and adding new video feature modalities. However, as the number of modalities increases, the negative interaction between them gradually reduces the gain of caption generation. To address this problem, we propose a three-layer hierarchical attention network based on a bidirectional decoding transformer that enhances multimodal features. In the first layer, we execute different encoders according to the characteristics of each modality to enhance the vector representation of each modality. Then, in the second layer, we select keyframes from all sampled frames of the modality by calculating the attention value between the generated words and each frame of the modality. Finally, in the third layer, we allocate weights to different modalities to reduce redundancy between them before generating the current word. Additionally, we use a bidirectional decoder to consider the context of the ground-truth caption when generating captions. Experiments on two mainstream benchmark datasets, MSVD and MSR-VTT, demonstrate the effectiveness of our proposed model. The model achieves state-of-the-art performance in significant metrics, and the generated sentences are more in line with human language habits. Overall, our three-layer hierarchical attention network based on a bidirectional decoding transformer effectively enhances multimodal features and generates high-quality video captions. Codes are available on https://github.com/nickchen121/MHAN.

previous article Multilevel progressive recursive dilated networks with correlation filter (MPRDNCF) for image super-resolution

next article Image aesthetics assessment using composite features from transformer and CNN

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634) (2015)

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)

Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision (pp. 4507-4515) (2015)

Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 25th ACM international conference on Multimedia (pp. 537-545) (2017, October)

Singh, A., Singh, T.D., Bandyopadhyay, S.: Attention based video captioning framework for hindi. Multimedia Syst. 28(1), 195–207 (2022)CrossRef

Zhong, M., Zhang, H., Xiong, H., Chen, Y., Wang, M., Zhou, X.: Kgvideo: A Video Captioning Method Based on Object Detection and Knowledge Graph. Available at SSRN 4017055

Zhong, M., Zhang, H., Wang, Y., Xiong, H.: BiTransformer: augmenting semantic context in video captioning via bidirectional decoder. Mach. Vis. Appl. 33(5), 1–9 (2022)CrossRef

Yang, B., Zhang, T., Zou, Y.: (2022) CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter. In: Pattern Recognition and Computer Vision: 5th Chinese Conference. PRCV,: Shenzhen, China, November 4?7, 2022, Proceedings, Part I, pp. 368–381. Springer International Publishing, Cham (2022)

Hori, C., Hori, T., Lee, T. Y., Zhang, Z., Harsham, B., Hershey, J. R., ... Sumi, K.: Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision (pp. 4193-4202) (2017)

10.

Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRef

11.

Aafaq, N., Akhtar, N., Liu, W., Gilani, S. Z., & Mian, A.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12487-12496) (2019)

12.

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077-6086) (2018)

13.

Lee, J.Y.: Deep multimodal embedding for video captioning. Multimedia Tools Appl. 78(22), 31793–31805 (2019)CrossRef

14.

Liu, A.A., Xu, N., Wong, Y., Li, J., Su, Y.T., Kankanhalli, M.: Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput. Vis. Image Underst. 163, 113–125 (2017)CrossRef

15.

Jin, Q., Chen, J., Chen, S., Xiong, Y., & Hauptmann, A.: Describing videos using multi-modal fusion. In Proceedings of the 24th ACM international conference on Multimedia (pp. 1087-1091) (2016, October)

16.

Jiang, Y.: Multi-feature fusion for video captioning. Int. J. Comput. Appl. 181(48), 975–8887 (2019)

17.

Li, L., Zhang, Y., Tang, S., Xie, L., Li, X., Tian, Q.: Adaptive spatial location with balanced loss for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 17–30 (2020)

18.

Huang, Y., Cai, Q., Xu, S., Chen, J.: Xlanv model with adaptively multi-modality feature fusing for video captioning. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 4600-4604) (2020, October)

19.

Yan, Z., Chen, Y., Song, J., Zhu, J.: Multimodal feature fusion based on object relation for video captioning. CAAI Trans. Intell. Technol. 8(1), 247–259 (2023)

20.

Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional neural networks. Commun ACM 60(6), 84–90 (2017)

21.

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef

22.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... Polosukhin, I.: Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000–6010) (2017)

23.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... Sutskever, I.: Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR (2021, July)

24.

Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence (2017, February)

25.

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308) (2017)

26.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... Zisserman, A.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

27.

He, K., Gkioxari, G., Dollr, P., Girshick, R.: Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969) (2017)

28.

Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., & Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2 (pp. 2787–2795) (2013)

29.

Han, X., Cao, S., Lv, X., Lin, Y., Liu, Z., Sun, M., Li, J.: Openke: An open toolkit for knowledge embedding. In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations (pp. 139-144) (2018, November)

30.

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... Zitnick, C. L.: Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham (2014, September)

31.

Lin, T. Y., Dollr, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125) (2017)

32.

Ren, S., He, K., Girshick, R., & Sun, J.: Faster R-CNN: towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39(6), 1137–1149 (2017)

33.

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

34.

Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7622-7631) (2018)

35.

Xu, W., Yu, J., Miao, Z., Wan, L., Tian, Y., Ji, Q.: Deep reinforcement polishing network for video captioning. IEEE Trans. Multimedia 23, 1772–1784 (2020)CrossRef

36.

Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV) (pp. 358-373) (2018)

37.

Xu, N., Liu, A.A., Nie, W., Su, Y.: Multi-guiding long short-term memory for video captioning. Multimedia Syst. 25, 663–672 (2019)CrossRef

38.

Chen, J., Pan, Y., Li, Y., Yao, T., Chao, H., Mei, T.: Temporal deformable convolutional encoder-decoder networks for video captioning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 8167-8174) (2019, July)

39.

Li, L., Zhang, Y., Tang, S., Xie, L., Li, X., Tian, Q.: Adaptive spatial location with balanced loss for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 17–30 (2022)CrossRef

40.

Wenjie, Pei., Jiyuan, Zhang., Xiangrong, Wang., Lei, Ke., Xi-aoyong, Shen., Yu-Wing, Tai.: Memory-attended recurrentnetwork for video captioning. In CVPR, pages 8347?8356, (2019)

41.

Yang, B., Zou, Y., Liu, F., Zhang, C.: Non-autoregressive coarse-to-fine video captioning. Proc. AAAI Conf. Artif. Intell. 35(4), 3119–3127 (2021). https://doi.org/10.1609/aaai.v35i4.16421CrossRef

42.

Chen, S., & Jiang, Y. G.: Motion guided region message passing for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1543-1552) (2021)

43.

Vaidya, J., Subramaniam, A., Mittal, A.: Co-Segmentation Aided Two-Stream Architecture for Video Captioning. In Proceedings of the IEEE/CVF Win ter Conference on Applications of Computer Vision (pp. 2774-2784) (2022)

44.

Deng, J., Li, L., Zhang, B., Wang, S., Zha, Z., Huang, Q.: Syntax-guided hierarchical attention network for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(2), 880–892 (2022). https://doi.org/10.1109/TCSVT.2021.3063423CrossRef

45.

Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., Zha, Z. J.: Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13278-13288) (2020)

46.

Wu, B., Niu, G., Yu, J., Xiao, X., Zhang, J., Wu, H.: Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection. IEEE Transactions on Circuits and Systems for Video Technology. (2022)

47.

Ye, H., Li, G., Qi, Y., Wang, S., Huang, Q., Yang, M.: Hierarchical Modular Network for Video Captioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, 17918–17927 (2022)

48.

Chen, D., Dolan, W. B.: Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 190-200) (2011, June)

49.

Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5288-5296) (2016)

50.

Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y. W.: Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8347-8356) (2019)

51.

Pan, B., Cai, H., Huang, D. A., Lee, K. H., Gaidon, A., Adeli, E., Niebles, J.C.: Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10870-10879) (2020)

52.

Papineni, K., Roukos, S., Ward, T., Zhu, W. J.: Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318) (2002, July)

53.

Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72) (2005, June)

54.

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575) (2015)

55.

Lin, C. Y.: Rouge: a package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81) (2004)

56.

Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543) (2014)

57.

Novikova, J., Du?ek, O., Curry, A. C., Rieser, V.: Why We Need New Evaluation Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2241-2252) (2017)

Title: Multimodal-enhanced hierarchical attention network for video captioning
Authors: Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
Publication date: 15-07-2023
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 5/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-023-01130-w

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 5/2023

Robust facial expression recognition with global-local joint representation learning

CaDaCa: a new caching strategy in NDN using data categorization

Deepfake detection of occluded images using a patch-based approach

A MADDPG-based multi-agent antagonistic algorithm for sea battlefield confrontation

Multimodal heterogeneous graph convolutional network for image recommendation

Imitation camouflage synthesis based on shallow neural network