Top

World Wide Web

Published in:

02-03-2018

Exploiting long-term temporal dynamics for video captioning

Authors: Yuyu Guo, Jingqiu Zhang, Lianli Gao

Published in: World Wide Web | Issue 2/2019

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to extract spatial and/or temporal features to encode video contents; and 2) applying Recurrent Neural Networks (RNNs) to generate sentences to describe events in videos. Temporal attention-based model has gained much progress by considering the importance of each video frame. However, for a long video, especially for a video which consists of a set of sub-events, we should discover and leverage the importance of each sub-shot instead of each frame. In this paper, we propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to incorporate both spatial and temporal information to extract long-term temporal dynamics within video sub-shots; and a stacked LSTM is introduced to generate a list of words to describe the video. Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.

previous article Co-regularized kernel ensemble regression

next article Context-aware graph pattern based top-k designated nodes finding in social graphs

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

http://www.nltk.org/index.html

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Warde-farley, D., Bengio, Y.: Theano: new features and speed improvements. CoRR arXiv:1211.5590 (2012)

Bengio, Y., Simard, P.Y., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRef

Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High Accuracy Optical Flow Estimation Based on a Theory for Warping. In: ECCV, pp. 25–36 (2004)

Chen, D., Dolan, W.B.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: ACL HLT, pp. 190–200 (2011)

Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR arXiv:1412.3555 (2014)

Denkowski, M.J., Lavie, A.: Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)

Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017)CrossRef

Elman, J.L.: Finding structure in time. Cognit. Sci. 14(2), 179–211 (1990)CrossRef

Farhadi, A., Hejrati, S.M.M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.A.: Every Picture Tells a Story: Generating Sentences from Images. In: ECCV, pp. 15–29 (2010)

10.

Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional Two-Stream Network Fusion for Video Action Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp. 1933–1941 (2016)

11.

Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimed. 19(9), 2045–2055 (2017). https://doi.org/10.1109/TMM.2017.2729019 CrossRef

12.

Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., Shen, H.T.: Optimal Graph Learning with Partial Tags and Multiple Features for Image and Video Annotation. In: CVPR, pp. 4371–4379 (2015)

13.

Gao, L., Song, J., Nie, F., Zou, F., Sebe, N., Shen, H.T.: Graph-Without-Cut: an Ideal Graph Learning for Image Segmentation. In: AAAI, pp. 1188–1194 (2016)

14.

Guo, Z., Gao, L., Song, J., Xu, X., Shao, J., Shen, H.T.: Attention-Based LSTM with Semantic Consistency for Videos Captioning. In: ACM MM, pp. 357–361 (2016)

15.

Hanckmann, P., Schutte, K., Burghouts, G.J.: Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions. In: ECCV, pp. 372–380 (2012)

16.

He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: CVPR, pp. 770–778 (2016)

17.

Jordan, M.I.: Serial Order: A Parallel, Distributed Processing Approach. In: Advances in Connectionist Theory: Speech. Erlbaum (1989)

18.

Khan, M.U.G., Zhang, L., Gotoh, Y.: Human Focused Video Description. In: ICCV, pp. 1480–1487 (2011)

19.

Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vision 50(2), 171–184 (2002)

20.

Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef

21.

Lee, M.W., Hakeem, A., Haering, N., Zhu, S.: SAVE: A Framework for Semantic Annotation of Visual Events. In: CVPR, pp. 1–8 (2008)

22.

Long, X., Gan, C., de Melo, G.: Video captioning with multi-faceted attention. CoRR arXiv:1612.00234 (2016)

23.

Ma, C., Chen, M., Kira, Z., Alregib, G.: TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. CoRR arXiv:1703.10667 (2017)

24.

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent Neural Network Based Language Model. In: INTERSPEECH, pp. 1045–1048 (2010)

25.

Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent Models of Visual Attention. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 2204–2212 (2014)

26.

Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In: CVPR, pp. 1029–1038 (2016)

27.

Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly Modeling Embedding and Translation to Bridge Video and Language. In: CVPR, pp. 4594–4602 (2016)

28.

Pan, Y., Yao, T., Li, H., Mei, T.: Video Captioning with Transferred Semantic Attributes. In: CVPR (2017)

29.

Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a Method for Automatic Evaluation of Machine Translation. In: ACL, pp. 311–318 (2002)

30.

Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In: NIPS (2015)

31.

Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating Video Content to Natural Language Descriptions. In: ICCV, pp. 433–440 (2013)

32.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRef

33.

Scherer, D., Müller, A.C., Behnke, S.: Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition. In: Artificial Neural Networks - ICANN 2010 - 20Th International Conference, Thessaloniki, Greece, September 15–18, 2010, Proceedings, Part III, pp. 92–101 (2010)

34.

Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef

35.

Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp. 568–576 (2014)

36.

Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICLR (2014)

37.

Song, J., Gao, L., Guo, Z., Liu, W., Zhang, D., Shen, H.T.: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19–25, 2017, pp. 2737–2743 (2017)

38.

Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: A general framework for scalable image and video retrieval. Pattern Recogn. 75, 175–187 (2018)

39.

Song, J., Gao, L., Nie, F., Shen, H.T., Yan, Y., Sebe, N.: Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans. Image Processing 25(11), 4999–5011 (2016)MathSciNetCrossRefMATH

40.

Song, J., Gao, L., Puscas, M.M., Nie, F., Shen, F., Sebe, N.: Joint Graph Learning and Video Segmentation via Multiple Cues and Topology Calibration. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, the Netherlands, October 15–19, 2016, pp. 831–840 (2016)

41.

Song, J., He, T., Fan, H., Gao, L.: Deep discrete hashing with self-supervised pairwise labels. CoRR arXiv:1707.02112 (2017)

42.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going Deeper with Convolutions. In: CVPR, pp. 1–9 (2015)

43.

Thonnat, M., Rota, N.: Image understanding for visual surveillance applications (2000)

44.

Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. ICCV

45.

Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-Based Image Description Evaluation. In: CVPR, pp. 4566–4575 (2015)

46.

Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R. J., Darrell, T., Saenko, K.: Sequence to Sequence - Video to Text. In: ICCV, pp. 4534–4542 (2015)

47.

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In: NAACL HLT, pp. 1494–1504 (2015)

48.

Wang, J., Wang, W., Huang, Y., Wang, L., Tan, T.: Multimodal memory modelling for video captioning. CoRR arXiv:1611.05592 (2016)

49.

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In: Computer Vision - ECCV 2016 - 14Th European Conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part VIII, pp. 20–36 (2016)

50.

Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: CVPR, pp. 5288–5296 (2016)

51.

Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C. J., Larochelle, H., Courville, A. C.: Describing Videos by Exploiting Temporal Structure. In: ICCV, pp. 4507–4515 (2015)

52.

Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. CoRR arXiv:1611.01646 (2016)

53.

Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In: CVPR, pp. 4584–4593 (2016)

54.

Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.H., Kim, G.: Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In: CVPR (2017)

55.

Zeiler, M.D.: ADADELTA: An adaptive learning rate method. CoRR arXiv:1212.5701 (2012)

56.

Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear Cross-Modal Hashing for Efficient Multimedia Search. In: ACM MM, pp. 143–152 (2013)

57.

Zhu, X., Li, X., Zhang, S.: Block-row sparse multiview multilabel learning for image classification. IEEE Transactions on Cybernetics 46(2), 450–461 (2016)CrossRef

58.

Zhu, X., Zhang, L., Huang, Z.: A sparse embedding and least variance encoding approach to hashing. IEEE Trans. Image Process. 23(9), 3737–3750 (2014)MathSciNetCrossRefMATH

Title: Exploiting long-term temporal dynamics for video captioning
Authors: Yuyu Guo
Jingqiu Zhang
Lianli Gao
Publication date: 02-03-2018
Publisher: Springer US
Published in: World Wide Web / Issue 2/2019
Print ISSN: 1386-145X
Electronic ISSN: 1573-1413
DOI: https://doi.org/10.1007/s11280-018-0530-0

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 2/2019

Low-rank hypergraph feature selection for multi-output regression

Global-view hashing: harnessing global relations in near-duplicate video retrieval

Practical k-agents search algorithm towards information retrieval in complex networks

Class consistent hashing for fast Web data searching

A resource-aware approach for authenticating privacy preserving GNN queries

A novel approach for Web page modeling in personal information extraction

Premium Partner