Skip to main content
Erschienen in: Multimedia Systems 5/2023

15.07.2023 | Regular Paper

Multimodal-enhanced hierarchical attention network for video captioning

verfasst von: Maosheng Zhong, Youde Chen, Hao Zhang, Hao Xiong, Zhixiang Wang

Erschienen in: Multimedia Systems | Ausgabe 5/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In video captioning, many pioneering approaches have been developed to generate higher-quality captions by exploring and adding new video feature modalities. However, as the number of modalities increases, the negative interaction between them gradually reduces the gain of caption generation. To address this problem, we propose a three-layer hierarchical attention network based on a bidirectional decoding transformer that enhances multimodal features. In the first layer, we execute different encoders according to the characteristics of each modality to enhance the vector representation of each modality. Then, in the second layer, we select keyframes from all sampled frames of the modality by calculating the attention value between the generated words and each frame of the modality. Finally, in the third layer, we allocate weights to different modalities to reduce redundancy between them before generating the current word. Additionally, we use a bidirectional decoder to consider the context of the ground-truth caption when generating captions. Experiments on two mainstream benchmark datasets, MSVD and MSR-VTT, demonstrate the effectiveness of our proposed model. The model achieves state-of-the-art performance in significant metrics, and the generated sentences are more in line with human language habits. Overall, our three-layer hierarchical attention network based on a bidirectional decoding transformer effectively enhances multimodal features and generates high-quality video captions. Codes are available on https://github.com/nickchen121/MHAN.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634) (2015) Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634) (2015)
2.
Zurück zum Zitat Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014) Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:​1412.​4729 (2014)
3.
Zurück zum Zitat Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision (pp. 4507-4515) (2015) Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision (pp. 4507-4515) (2015)
4.
Zurück zum Zitat Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 25th ACM international conference on Multimedia (pp. 537-545) (2017, October) Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 25th ACM international conference on Multimedia (pp. 537-545) (2017, October)
5.
Zurück zum Zitat Singh, A., Singh, T.D., Bandyopadhyay, S.: Attention based video captioning framework for hindi. Multimedia Syst. 28(1), 195–207 (2022)CrossRef Singh, A., Singh, T.D., Bandyopadhyay, S.: Attention based video captioning framework for hindi. Multimedia Syst. 28(1), 195–207 (2022)CrossRef
6.
Zurück zum Zitat Zhong, M., Zhang, H., Xiong, H., Chen, Y., Wang, M., Zhou, X.: Kgvideo: A Video Captioning Method Based on Object Detection and Knowledge Graph. Available at SSRN 4017055 Zhong, M., Zhang, H., Xiong, H., Chen, Y., Wang, M., Zhou, X.: Kgvideo: A Video Captioning Method Based on Object Detection and Knowledge Graph. Available at SSRN 4017055
7.
Zurück zum Zitat Zhong, M., Zhang, H., Wang, Y., Xiong, H.: BiTransformer: augmenting semantic context in video captioning via bidirectional decoder. Mach. Vis. Appl. 33(5), 1–9 (2022)CrossRef Zhong, M., Zhang, H., Wang, Y., Xiong, H.: BiTransformer: augmenting semantic context in video captioning via bidirectional decoder. Mach. Vis. Appl. 33(5), 1–9 (2022)CrossRef
8.
Zurück zum Zitat Yang, B., Zhang, T., Zou, Y.: (2022) CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter. In: Pattern Recognition and Computer Vision: 5th Chinese Conference. PRCV,: Shenzhen, China, November 4?7, 2022, Proceedings, Part I, pp. 368–381. Springer International Publishing, Cham (2022) Yang, B., Zhang, T., Zou, Y.: (2022) CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter. In: Pattern Recognition and Computer Vision: 5th Chinese Conference. PRCV,: Shenzhen, China, November 4?7, 2022, Proceedings, Part I, pp. 368–381. Springer International Publishing, Cham (2022)
9.
Zurück zum Zitat Hori, C., Hori, T., Lee, T. Y., Zhang, Z., Harsham, B., Hershey, J. R., ... Sumi, K.: Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision (pp. 4193-4202) (2017) Hori, C., Hori, T., Lee, T. Y., Zhang, Z., Harsham, B., Hershey, J. R., ... Sumi, K.: Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision (pp. 4193-4202) (2017)
10.
Zurück zum Zitat Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRef Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRef
11.
Zurück zum Zitat Aafaq, N., Akhtar, N., Liu, W., Gilani, S. Z., & Mian, A.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12487-12496) (2019) Aafaq, N., Akhtar, N., Liu, W., Gilani, S. Z., & Mian, A.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12487-12496) (2019)
12.
Zurück zum Zitat Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077-6086) (2018) Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077-6086) (2018)
13.
Zurück zum Zitat Lee, J.Y.: Deep multimodal embedding for video captioning. Multimedia Tools Appl. 78(22), 31793–31805 (2019)CrossRef Lee, J.Y.: Deep multimodal embedding for video captioning. Multimedia Tools Appl. 78(22), 31793–31805 (2019)CrossRef
14.
Zurück zum Zitat Liu, A.A., Xu, N., Wong, Y., Li, J., Su, Y.T., Kankanhalli, M.: Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput. Vis. Image Underst. 163, 113–125 (2017)CrossRef Liu, A.A., Xu, N., Wong, Y., Li, J., Su, Y.T., Kankanhalli, M.: Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput. Vis. Image Underst. 163, 113–125 (2017)CrossRef
15.
Zurück zum Zitat Jin, Q., Chen, J., Chen, S., Xiong, Y., & Hauptmann, A.: Describing videos using multi-modal fusion. In Proceedings of the 24th ACM international conference on Multimedia (pp. 1087-1091) (2016, October) Jin, Q., Chen, J., Chen, S., Xiong, Y., & Hauptmann, A.: Describing videos using multi-modal fusion. In Proceedings of the 24th ACM international conference on Multimedia (pp. 1087-1091) (2016, October)
16.
Zurück zum Zitat Jiang, Y.: Multi-feature fusion for video captioning. Int. J. Comput. Appl. 181(48), 975–8887 (2019) Jiang, Y.: Multi-feature fusion for video captioning. Int. J. Comput. Appl. 181(48), 975–8887 (2019)
17.
Zurück zum Zitat Li, L., Zhang, Y., Tang, S., Xie, L., Li, X., Tian, Q.: Adaptive spatial location with balanced loss for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 17–30 (2020) Li, L., Zhang, Y., Tang, S., Xie, L., Li, X., Tian, Q.: Adaptive spatial location with balanced loss for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 17–30 (2020)
18.
Zurück zum Zitat Huang, Y., Cai, Q., Xu, S., Chen, J.: Xlanv model with adaptively multi-modality feature fusing for video captioning. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 4600-4604) (2020, October) Huang, Y., Cai, Q., Xu, S., Chen, J.: Xlanv model with adaptively multi-modality feature fusing for video captioning. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 4600-4604) (2020, October)
19.
Zurück zum Zitat Yan, Z., Chen, Y., Song, J., Zhu, J.: Multimodal feature fusion based on object relation for video captioning. CAAI Trans. Intell. Technol. 8(1), 247–259 (2023) Yan, Z., Chen, Y., Song, J., Zhu, J.: Multimodal feature fusion based on object relation for video captioning. CAAI Trans. Intell. Technol. 8(1), 247–259 (2023)
20.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional neural networks. Commun ACM 60(6), 84–90 (2017) Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional neural networks. Commun ACM 60(6), 84–90 (2017)
21.
Zurück zum Zitat Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
22.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... Polosukhin, I.: Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000–6010) (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... Polosukhin, I.: Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000–6010) (2017)
23.
Zurück zum Zitat Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... Sutskever, I.: Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR (2021, July) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... Sutskever, I.: Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR (2021, July)
24.
Zurück zum Zitat Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence (2017, February) Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence (2017, February)
25.
Zurück zum Zitat Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308) (2017) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308) (2017)
26.
Zurück zum Zitat Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... Zisserman, A.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017) Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... Zisserman, A.: The kinetics human action video dataset. arXiv preprint arXiv:​1705.​06950 (2017)
27.
Zurück zum Zitat He, K., Gkioxari, G., Dollr, P., Girshick, R.: Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969) (2017) He, K., Gkioxari, G., Dollr, P., Girshick, R.: Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969) (2017)
28.
Zurück zum Zitat Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., & Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2 (pp. 2787–2795) (2013) Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., & Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2 (pp. 2787–2795) (2013)
29.
Zurück zum Zitat Han, X., Cao, S., Lv, X., Lin, Y., Liu, Z., Sun, M., Li, J.: Openke: An open toolkit for knowledge embedding. In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations (pp. 139-144) (2018, November) Han, X., Cao, S., Lv, X., Lin, Y., Liu, Z., Sun, M., Li, J.: Openke: An open toolkit for knowledge embedding. In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations (pp. 139-144) (2018, November)
30.
Zurück zum Zitat Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... Zitnick, C. L.: Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham (2014, September) Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... Zitnick, C. L.: Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham (2014, September)
31.
Zurück zum Zitat Lin, T. Y., Dollr, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125) (2017) Lin, T. Y., Dollr, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125) (2017)
32.
Zurück zum Zitat Ren, S., He, K., Girshick, R., & Sun, J.: Faster R-CNN: towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39(6), 1137–1149 (2017) Ren, S., He, K., Girshick, R., & Sun, J.: Faster R-CNN: towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39(6), 1137–1149 (2017)
33.
Zurück zum Zitat Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781 (2013)
34.
Zurück zum Zitat Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7622-7631) (2018) Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7622-7631) (2018)
35.
Zurück zum Zitat Xu, W., Yu, J., Miao, Z., Wan, L., Tian, Y., Ji, Q.: Deep reinforcement polishing network for video captioning. IEEE Trans. Multimedia 23, 1772–1784 (2020)CrossRef Xu, W., Yu, J., Miao, Z., Wan, L., Tian, Y., Ji, Q.: Deep reinforcement polishing network for video captioning. IEEE Trans. Multimedia 23, 1772–1784 (2020)CrossRef
36.
Zurück zum Zitat Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV) (pp. 358-373) (2018) Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV) (pp. 358-373) (2018)
37.
Zurück zum Zitat Xu, N., Liu, A.A., Nie, W., Su, Y.: Multi-guiding long short-term memory for video captioning. Multimedia Syst. 25, 663–672 (2019)CrossRef Xu, N., Liu, A.A., Nie, W., Su, Y.: Multi-guiding long short-term memory for video captioning. Multimedia Syst. 25, 663–672 (2019)CrossRef
38.
Zurück zum Zitat Chen, J., Pan, Y., Li, Y., Yao, T., Chao, H., Mei, T.: Temporal deformable convolutional encoder-decoder networks for video captioning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 8167-8174) (2019, July) Chen, J., Pan, Y., Li, Y., Yao, T., Chao, H., Mei, T.: Temporal deformable convolutional encoder-decoder networks for video captioning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 8167-8174) (2019, July)
39.
Zurück zum Zitat Li, L., Zhang, Y., Tang, S., Xie, L., Li, X., Tian, Q.: Adaptive spatial location with balanced loss for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 17–30 (2022)CrossRef Li, L., Zhang, Y., Tang, S., Xie, L., Li, X., Tian, Q.: Adaptive spatial location with balanced loss for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 17–30 (2022)CrossRef
40.
Zurück zum Zitat Wenjie, Pei., Jiyuan, Zhang., Xiangrong, Wang., Lei, Ke., Xi-aoyong, Shen., Yu-Wing, Tai.: Memory-attended recurrentnetwork for video captioning. In CVPR, pages 8347?8356, (2019) Wenjie, Pei., Jiyuan, Zhang., Xiangrong, Wang., Lei, Ke., Xi-aoyong, Shen., Yu-Wing, Tai.: Memory-attended recurrentnetwork for video captioning. In CVPR, pages 8347?8356, (2019)
42.
Zurück zum Zitat Chen, S., & Jiang, Y. G.: Motion guided region message passing for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1543-1552) (2021) Chen, S., & Jiang, Y. G.: Motion guided region message passing for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1543-1552) (2021)
43.
Zurück zum Zitat Vaidya, J., Subramaniam, A., Mittal, A.: Co-Segmentation Aided Two-Stream Architecture for Video Captioning. In Proceedings of the IEEE/CVF Win ter Conference on Applications of Computer Vision (pp. 2774-2784) (2022) Vaidya, J., Subramaniam, A., Mittal, A.: Co-Segmentation Aided Two-Stream Architecture for Video Captioning. In Proceedings of the IEEE/CVF Win ter Conference on Applications of Computer Vision (pp. 2774-2784) (2022)
45.
Zurück zum Zitat Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., Zha, Z. J.: Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13278-13288) (2020) Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., Zha, Z. J.: Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13278-13288) (2020)
46.
Zurück zum Zitat Wu, B., Niu, G., Yu, J., Xiao, X., Zhang, J., Wu, H.: Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection. IEEE Transactions on Circuits and Systems for Video Technology. (2022) Wu, B., Niu, G., Yu, J., Xiao, X., Zhang, J., Wu, H.: Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection. IEEE Transactions on Circuits and Systems for Video Technology. (2022)
47.
Zurück zum Zitat Ye, H., Li, G., Qi, Y., Wang, S., Huang, Q., Yang, M.: Hierarchical Modular Network for Video Captioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, 17918–17927 (2022) Ye, H., Li, G., Qi, Y., Wang, S., Huang, Q., Yang, M.: Hierarchical Modular Network for Video Captioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, 17918–17927 (2022)
48.
Zurück zum Zitat Chen, D., Dolan, W. B.: Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 190-200) (2011, June) Chen, D., Dolan, W. B.: Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 190-200) (2011, June)
49.
Zurück zum Zitat Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5288-5296) (2016) Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5288-5296) (2016)
50.
Zurück zum Zitat Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y. W.: Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8347-8356) (2019) Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y. W.: Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8347-8356) (2019)
51.
Zurück zum Zitat Pan, B., Cai, H., Huang, D. A., Lee, K. H., Gaidon, A., Adeli, E., Niebles, J.C.: Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10870-10879) (2020) Pan, B., Cai, H., Huang, D. A., Lee, K. H., Gaidon, A., Adeli, E., Niebles, J.C.: Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10870-10879) (2020)
52.
Zurück zum Zitat Papineni, K., Roukos, S., Ward, T., Zhu, W. J.: Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318) (2002, July) Papineni, K., Roukos, S., Ward, T., Zhu, W. J.: Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318) (2002, July)
53.
Zurück zum Zitat Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72) (2005, June) Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72) (2005, June)
54.
Zurück zum Zitat Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575) (2015) Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575) (2015)
55.
Zurück zum Zitat Lin, C. Y.: Rouge: a package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81) (2004) Lin, C. Y.: Rouge: a package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81) (2004)
56.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543) (2014) Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543) (2014)
57.
Zurück zum Zitat Novikova, J., Du?ek, O., Curry, A. C., Rieser, V.: Why We Need New Evaluation Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2241-2252) (2017) Novikova, J., Du?ek, O., Curry, A. C., Rieser, V.: Why We Need New Evaluation Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2241-2252) (2017)
Metadaten
Titel
Multimodal-enhanced hierarchical attention network for video captioning
verfasst von
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
Publikationsdatum
15.07.2023
Verlag
Springer Berlin Heidelberg
Erschienen in
Multimedia Systems / Ausgabe 5/2023
Print ISSN: 0942-4962
Elektronische ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-023-01130-w

Weitere Artikel der Ausgabe 5/2023

Multimedia Systems 5/2023 Zur Ausgabe