Skip to main content
Top
Published in: Multimedia Systems 6/2019

10-11-2018 | Special Issue Paper

Multi-guiding long short-term memory for video captioning

Authors: Ning Xu, An-An Liu, Weizhi Nie, Yuting Su

Published in: Multimedia Systems | Issue 6/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recently, research interests have been paid for using recurrent neural network (RNN) as the decoder in video captioning task. However, the generated sentence seems to “lose track” of the video content due to the fixed language rule. Though existing methods try to “guide” the decoder and keep it “on track”, they mainly rely on a single-modal feature that does not fit the multi-modal (visual and semantic) and the complementary (local and global) nature of the video captioning task. To this end, we propose the multi-guiding long short-term memory (mg-LSTM), an extension of LSTM network for video captioning. We add global information (i.e., detected attributes) and local information (i.e., appearance features) extracted from the video as extra input to each cell of LSTM, with the aim of collaboratively guiding the model towards solutions that are more tightly coupled to the video content. In particular, the appearance and attribute features are first used to produce local and global guiders, respectively. We propose a novel cell-wise ensemble, where the weight matrix of each cell of LSTM is extended to be a set of attribute-dependent and attention-dependent weight matrices, by which the guiders induce each cell optimization over time. Extensive experiments on three benchmark datasets (i.e., MSVD, MSR-VTT, and MPII-MD) show that our method can achieve competitive results against the state of the art. Additional ablation studies are conducted on variants of the proposed mg-LSTM.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
2.
go back to reference Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 1657–1666 (2017) Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 1657–1666 (2017)
3.
go back to reference Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM TIST 2(3), 27:1–27:27 (2011) Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM TIST 2(3), 27:1–27:27 (2011)
4.
go back to reference Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011) Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)
5.
go back to reference Denkowski, M.J., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: WMT@ACL, pp. 376–380 (2014) Denkowski, M.J., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: WMT@ACL, pp. 376–380 (2014)
6.
go back to reference Dong, J., Li, X., Lan, W., Huo, Y., Snoek, C.G.M.: Early embedding and late reranking for video captioning. In: ACMMM, pp. 1082–1086 (2016) Dong, J., Li, X., Lan, W., Huo, Y., Snoek, C.G.M.: Early embedding and late reranking for video captioning. In: ACMMM, pp. 1082–1086 (2016)
7.
go back to reference Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. In: CVPR, pp. 1141–1150 (2017) Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. In: CVPR, pp. 1141–1150 (2017)
8.
go back to reference Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)CrossRef Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)CrossRef
9.
go back to reference Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R.J., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV, pp. 2712–2719 (2013) Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R.J., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV, pp. 2712–2719 (2013)
10.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
11.
go back to reference Hori, C., Hori, T., Lee, T.Y., Zhang, Z., Harsham, B., Hershey, J.R., Marks, T.K., Sumi, K.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4193–4202 (2017) Hori, C., Hori, T., Lee, T.Y., Zhang, Z., Harsham, B., Hershey, J.R., Marks, T.K., Sumi, K.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4193–4202 (2017)
12.
go back to reference Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: ICCV, pp. 2407–2415 (2015) Jia, X., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: ICCV, pp. 2407–2415 (2015)
13.
go back to reference Jin, Q., Chen, J., Chen, S., Xiong, Y., Hauptmann, A.G.: Describing videos using multi-modal fusion. In: ACMM MM, pp. 1087–1091 (2016) Jin, Q., Chen, J., Chen, S., Xiong, Y., Hauptmann, A.G.: Describing videos using multi-modal fusion. In: ACMM MM, pp. 1087–1091 (2016)
14.
go back to reference Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
15.
go back to reference Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: ACL Workshop, pp. 74–81 (2004) Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: ACL Workshop, pp. 74–81 (2004)
16.
go back to reference Liu, Z., Cheng, L., Liu, A., Zhang, L., He, X., Zimmermann, R.: Multiview and Multimodal Pervasive Indoor Localization. In: ACMMM, pp. 109-117 (2017) Liu, Z., Cheng, L., Liu, A., Zhang, L., He, X., Zimmermann, R.: Multiview and Multimodal Pervasive Indoor Localization. In: ACMMM, pp. 109-117 (2017)
17.
go back to reference Liu, A., Xu, N., Wong, Y., Li, J., Su, Y., Kankanhalli, M.S.: Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language. CVIU 163, 113–125 (2017) Liu, A., Xu, N., Wong, Y., Li, J., Su, Y., Kankanhalli, M.S.: Hierarchical & multimodal video captioning: discovering and transferring multimodal knowledge for vision to language. CVIU 163, 113–125 (2017)
18.
go back to reference Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016) Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016)
19.
go back to reference Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016) Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016)
20.
go back to reference Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, pp. 984–992 (2017) Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR, pp. 984–992 (2017)
21.
go back to reference Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
22.
go back to reference Ramanishka, V., Das, A., Park, D.H., Venugopalan, S., Hendricks, L.A., Rohrbach, M., Saenko, K.: Multimodal video description. In: ACM MM, pp. 1092–1096 (2016) Ramanishka, V., Das, A., Park, D.H., Venugopalan, S., Hendricks, L.A., Rohrbach, M., Saenko, K.: Multimodal video description. In: ACM MM, pp. 1092–1096 (2016)
23.
go back to reference Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: GCPR, pp. 209–221 (2015) Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: GCPR, pp. 209–221 (2015)
24.
go back to reference Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR, pp. 3202–3212 (2015) Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR, pp. 3202–3212 (2015)
25.
go back to reference Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y., Xue, X.: Weakly supervised dense video captioning. In: CVPR, pp. 1916–1924 (2017) Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y., Xue, X.: Weakly supervised dense video captioning. In: CVPR, pp. 1916–1924 (2017)
26.
go back to reference Shetty, R., Laaksonen, J.: Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM MM, pp. 1073–1076 (2016) Shetty, R., Laaksonen, J.: Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM MM, pp. 1073–1076 (2016)
27.
go back to reference Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
28.
go back to reference Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014) Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)
29.
go back to reference Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)
30.
go back to reference Theano Development Team: Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016) Theano Development Team: Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:​1605.​02688 (2016)
31.
go back to reference Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.J.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: COLING, pp. 1218–1227 (2014) Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.J.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: COLING, pp. 1218–1227 (2014)
32.
go back to reference Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015) Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
33.
go back to reference Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV, pp. 4534–4542 (2015) Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R.J., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: ICCV, pp. 4534–4542 (2015)
34.
go back to reference Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: HLT-NAACL, pp. 1494–1504 (2015) Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: HLT-NAACL, pp. 1494–1504 (2015)
35.
go back to reference Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR, pp. 3156–3164 (2015) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
36.
go back to reference Wu, Q., Shen, C., Liu, L., Dick, A.R., van den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: CVPR, pp. 203–212 (2016) Wu, Q., Shen, C., Liu, L., Dick, A.R., van den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: CVPR, pp. 203–212 (2016)
37.
go back to reference Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016) Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)
38.
go back to reference Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
40.
go back to reference Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C.J., Larochelle, H., Courville, A.C.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015) Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C.J., Larochelle, H., Courville, A.C.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)
41.
go back to reference Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016) Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)
42.
go back to reference Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV, pp. 818–833 (2014) Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV, pp. 818–833 (2014)
43.
go back to reference Zhang, X., Gao, K., Zhang, Y., Zhang, D., Tian, Q.: Task-driven dynamic fusion: Reducing ambiguity in video description. In: CVPR (2017) Zhang, X., Gao, K., Zhang, Y., Zhang, D., Tian, Q.: Task-driven dynamic fusion: Reducing ambiguity in video description. In: CVPR (2017)
Metadata
Title
Multi-guiding long short-term memory for video captioning
Authors
Ning Xu
An-An Liu
Weizhi Nie
Yuting Su
Publication date
10-11-2018
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 6/2019
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-018-0598-5

Other articles of this Issue 6/2019

Multimedia Systems 6/2019 Go to the issue