Skip to main content
Top
Published in: World Wide Web 2/2019

02-03-2018

Exploiting long-term temporal dynamics for video captioning

Authors: Yuyu Guo, Jingqiu Zhang, Lianli Gao

Published in: World Wide Web | Issue 2/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Warde-farley, D., Bengio, Y.: Theano: new features and speed improvements. CoRR arXiv:1211.5590 (2012) Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Warde-farley, D., Bengio, Y.: Theano: new features and speed improvements. CoRR arXiv:1211.​5590 (2012)
2.
go back to reference Bengio, Y., Simard, P.Y., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRef Bengio, Y., Simard, P.Y., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRef
3.
go back to reference Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High Accuracy Optical Flow Estimation Based on a Theory for Warping. In: ECCV, pp. 25–36 (2004) Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High Accuracy Optical Flow Estimation Based on a Theory for Warping. In: ECCV, pp. 25–36 (2004)
4.
go back to reference Chen, D., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: ACL HLT, pp. 190–200 (2011) Chen, D., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: ACL HLT, pp. 190–200 (2011)
5.
go back to reference Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR arXiv:1412.3555 (2014) Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR arXiv:1412.​3555 (2014)
6.
go back to reference Denkowski, M.J., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014) Denkowski, M.J., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)
7.
go back to reference Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017)CrossRef Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017)CrossRef
8.
go back to reference Elman, J.L.: Finding structure in time. Cognit. Sci. 14(2), 179–211 (1990)CrossRef Elman, J.L.: Finding structure in time. Cognit. Sci. 14(2), 179–211 (1990)CrossRef
9.
go back to reference Farhadi, A., Hejrati, S.M.M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.A.: Every Picture Tells a Story: Generating Sentences from Images. In: ECCV, pp. 15–29 (2010) Farhadi, A., Hejrati, S.M.M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.A.: Every Picture Tells a Story: Generating Sentences from Images. In: ECCV, pp. 15–29 (2010)
10.
go back to reference Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional Two-Stream Network Fusion for Video Action Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp. 1933–1941 (2016) Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional Two-Stream Network Fusion for Video Action Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp. 1933–1941 (2016)
12.
go back to reference Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Shen, H.T.: Optimal Graph Learning with Partial Tags and Multiple Features for Image and Video Annotation. In: CVPR, pp. 4371–4379 (2015) Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Shen, H.T.: Optimal Graph Learning with Partial Tags and Multiple Features for Image and Video Annotation. In: CVPR, pp. 4371–4379 (2015)
13.
go back to reference Gao, L., Song, J., Nie, F., Zou, F., Sebe, N., Shen, H.T.: Graph-Without-Cut: an Ideal Graph Learning for Image Segmentation. In: AAAI, pp. 1188–1194 (2016) Gao, L., Song, J., Nie, F., Zou, F., Sebe, N., Shen, H.T.: Graph-Without-Cut: an Ideal Graph Learning for Image Segmentation. In: AAAI, pp. 1188–1194 (2016)
14.
go back to reference Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T.: Attention-Based LSTM with Semantic Consistency for Videos Captioning. In: ACM MM, pp. 357–361 (2016) Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T.: Attention-Based LSTM with Semantic Consistency for Videos Captioning. In: ACM MM, pp. 357–361 (2016)
15.
go back to reference Hanckmann, P., Schutte, K., Burghouts, G.J.: Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions. In: ECCV, pp. 372–380 (2012) Hanckmann, P., Schutte, K., Burghouts, G.J.: Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions. In: ECCV, pp. 372–380 (2012)
16.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR, pp. 770–778 (2016)
17.
go back to reference Jordan, M.I.: Serial Order: A Parallel, Distributed Processing Approach. In: Advances in Connectionist Theory: Speech. Erlbaum (1989) Jordan, M.I.: Serial Order: A Parallel, Distributed Processing Approach. In: Advances in Connectionist Theory: Speech. Erlbaum (1989)
18.
go back to reference Khan, M.U.G., Zhang, L., Gotoh, Y.: Human Focused Video Description. In: ICCV, pp. 1480–1487 (2011) Khan, M.U.G., Zhang, L., Gotoh, Y.: Human Focused Video Description. In: ICCV, pp. 1480–1487 (2011)
19.
go back to reference Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vision 50(2), 171–184 (2002) Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vision 50(2), 171–184 (2002)
20.
go back to reference Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef
21.
go back to reference Lee, M.W., Hakeem, A., Haering, N., Zhu, S.: SAVE: A Framework for Semantic Annotation of Visual Events. In: CVPR, pp. 1–8 (2008) Lee, M.W., Hakeem, A., Haering, N., Zhu, S.: SAVE: A Framework for Semantic Annotation of Visual Events. In: CVPR, pp. 1–8 (2008)
22.
go back to reference Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. CoRR arXiv:1612.00234 (2016) Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. CoRR arXiv:1612.​00234 (2016)
23.
go back to reference Ma, C., Chen, M., Kira, Z., Alregib, G.: TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. CoRR arXiv:1703.10667 (2017) Ma, C., Chen, M., Kira, Z., Alregib, G.: TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. CoRR arXiv:1703.​10667 (2017)
24.
go back to reference Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent Neural Network Based Language Model. In: INTERSPEECH, pp. 1045–1048 (2010) Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent Neural Network Based Language Model. In: INTERSPEECH, pp. 1045–1048 (2010)
25.
go back to reference Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent Models of Visual Attention. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 2204–2212 (2014) Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent Models of Visual Attention. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 2204–2212 (2014)
26.
go back to reference Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPR, pp. 1029–1038 (2016) Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPR, pp. 1029–1038 (2016)
27.
go back to reference Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, pp. 4594–4602 (2016) Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, pp. 4594–4602 (2016)
28.
go back to reference Pan, Y., Yao, T., Li, H., Mei, T.: Video Captioning with Transferred Semantic Attributes. In: CVPR (2017) Pan, Y., Yao, T., Li, H., Mei, T.: Video Captioning with Transferred Semantic Attributes. In: CVPR (2017)
29.
go back to reference Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: ACL, pp. 311–318 (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: ACL, pp. 311–318 (2002)
30.
go back to reference Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS (2015) Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS (2015)
31.
go back to reference Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: ICCV, pp. 433–440 (2013) Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: ICCV, pp. 433–440 (2013)
32.
go back to reference Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRef Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRef
33.
go back to reference Scherer, D., Müller, A.C., Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In: Artificial Neural Networks - ICANN 2010 - 20Th International Conference, Thessaloniki, Greece, September 15–18, 2010, Proceedings, Part III, pp. 92–101 (2010) Scherer, D., Müller, A.C., Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In: Artificial Neural Networks - ICANN 2010 - 20Th International Conference, Thessaloniki, Greece, September 15–18, 2010, Proceedings, Part III, pp. 92–101 (2010)
34.
go back to reference Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
35.
go back to reference Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 568–576 (2014) Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 568–576 (2014)
36.
go back to reference Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICLR (2014) Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICLR (2014)
37.
go back to reference Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19–25, 2017, pp. 2737–2743 (2017) Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19–25, 2017, pp. 2737–2743 (2017)
38.
go back to reference Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: A general framework for scalable image and video retrieval. Pattern Recogn. 75, 175–187 (2018) Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: A general framework for scalable image and video retrieval. Pattern Recogn. 75, 175–187 (2018)
39.
go back to reference Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Processing 25(11), 4999–5011 (2016)MathSciNetCrossRefMATH Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Processing 25(11), 4999–5011 (2016)MathSciNetCrossRefMATH
40.
go back to reference Song, J., Gao, L., Puscas, M.M., Nie, F., Shen, F., Sebe, N.: Joint Graph Learning and Video Segmentation via Multiple Cues and Topology Calibration. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, the Netherlands, October 15–19, 2016, pp. 831–840 (2016) Song, J., Gao, L., Puscas, M.M., Nie, F., Shen, F., Sebe, N.: Joint Graph Learning and Video Segmentation via Multiple Cues and Topology Calibration. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, the Netherlands, October 15–19, 2016, pp. 831–840 (2016)
41.
go back to reference Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. CoRR arXiv:1707.02112 (2017) Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. CoRR arXiv:1707.​02112 (2017)
42.
go back to reference Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: CVPR, pp. 1–9 (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: CVPR, pp. 1–9 (2015)
43.
go back to reference Thonnat, M., Rota, N.: Image understanding for visual surveillance applications (2000) Thonnat, M., Rota, N.: Image understanding for visual surveillance applications (2000)
44.
go back to reference Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. ICCV Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. ICCV
45.
go back to reference Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: CVPR, pp. 4566–4575 (2015) Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: CVPR, pp. 4566–4575 (2015)
46.
go back to reference Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R. J., Darrell, T., Saenko, K.: Sequence to Sequence - Video to Text. In: ICCV, pp. 4534–4542 (2015) Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R. J., Darrell, T., Saenko, K.: Sequence to Sequence - Video to Text. In: ICCV, pp. 4534–4542 (2015)
47.
go back to reference Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In: NAACL HLT, pp. 1494–1504 (2015) Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In: NAACL HLT, pp. 1494–1504 (2015)
48.
go back to reference Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: Multimodal memory modelling for video captioning. CoRR arXiv:1611.05592 (2016) Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: Multimodal memory modelling for video captioning. CoRR arXiv:1611.​05592 (2016)
49.
go back to reference Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In: Computer Vision - ECCV 2016 - 14Th European Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part VIII, pp. 20–36 (2016) Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In: Computer Vision - ECCV 2016 - 14Th European Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part VIII, pp. 20–36 (2016)
50.
go back to reference Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, pp. 5288–5296 (2016) Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, pp. 5288–5296 (2016)
51.
go back to reference Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C. J., Larochelle, H., Courville, A. C.: Describing Videos by Exploiting Temporal Structure. In: ICCV, pp. 4507–4515 (2015) Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C. J., Larochelle, H., Courville, A. C.: Describing Videos by Exploiting Temporal Structure. In: ICCV, pp. 4507–4515 (2015)
52.
go back to reference Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. CoRR arXiv:1611.01646 (2016) Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. CoRR arXiv:1611.​01646 (2016)
53.
go back to reference Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, pp. 4584–4593 (2016) Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, pp. 4584–4593 (2016)
54.
go back to reference Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.H., Kim, G.: Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In: CVPR (2017) Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.H., Kim, G.: Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In: CVPR (2017)
55.
56.
go back to reference Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143–152 (2013) Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143–152 (2013)
57.
go back to reference Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450–461 (2016)CrossRef Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450–461 (2016)CrossRef
58.
go back to reference Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737–3750 (2014)MathSciNetCrossRefMATH Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737–3750 (2014)MathSciNetCrossRefMATH
Metadata
Title
Exploiting long-term temporal dynamics for video captioning
Authors
Yuyu Guo
Jingqiu Zhang
Lianli Gao
Publication date
02-03-2018
Publisher
Springer US
Published in
World Wide Web / Issue 2/2019
Print ISSN: 1386-145X
Electronic ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-018-0530-0

Other articles of this Issue 2/2019

World Wide Web 2/2019 Go to the issue

Premium Partner