nach oben

Neural Processing Letters

Erschienen in:

25.01.2017

Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

verfasst von: Dashan Guo, Wei Li, Xiangzhong Fang

Erschienen in: Neural Processing Letters | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.

Vorheriger Artikel Backpropagation for Fully Connected Cascade Networks

Nächster Artikel Exponential Stability of Pseudo Almost Periodic Solutions for Neutral Type Cellular Neural Networks with D Operator

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Azorin-Lopez J, Saval-Calvo M, Fuster-Guillo A, Garcia-Rodriguez J (2016) A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process Lett 43(2):363–387CrossRef

Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432

Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1. Association for Computational Linguistics, pp 190–200

Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325

Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 workshop on statistical machine translation, vol 6

Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

Fernando B, Gould S (2016) Learning end-to-end video classification with rank-pooling. In: Proceedings of the 33rd international conference on machine learning, vol 48. JMLR: W&CP, New York

10.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

11.

Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670MathSciNetCrossRef

12.

Hong C, Chen X, Wang X, Tang C (2016) Hypergraph regularized autoencoder for image-based 3d human pose recovery. Signal Process 124:132–140CrossRef

13.

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

14.

Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision, pp 1817–1824

15.

Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2015) Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:1511.03476

16.

Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318

17.

Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: European conference on computer vision, Springer, Berlin, pp 581–595

18.

Rekabdar B, Nicolescu M, Nicolescu M, Saffar MT, Kelley R (2016) A scale and translation invariant approach for early classification of spatio-temporal patterns using spiking neural networks. Neural Process Lett 43(2):327–343CrossRef

19.

Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

20.

Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

21.

Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH

22.

Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd international conference on machine learning, vol 37. JMLR: W&CP, Lille

23.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

24.

Team TTD, Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, et al. (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688

25.

Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2014) C3d: generic features for video analysis. CoRR, abs/14120767 2:7

26.

Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

27.

Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence—video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542

28.

Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729

29.

Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

30.

Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–881

31.

Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515

32.

Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2015) Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738

33.

Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern. doi:10.1109/TCYB.2016.2591583

34.

Zeiler MD (2012) Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701

Titel: Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism
verfasst von: Dashan Guo
Wei Li
Xiangzhong Fang
Publikationsdatum: 25.01.2017
Verlag: Springer US
Erschienen in: Neural Processing Letters / Ausgabe 1/2017
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-017-9591-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2017

Finite-Time Stability of Neural Networks with Impulse Effects and Time-Varying Delay

Finite-Time Stability of Stochastic Cohen–Grossberg Neural Networks with Markovian Jumping Parameters and Distributed Time-Varying Delays

Multiple Instance Learning via Semi-supervised Laplacian TSVM

Exponential Stability of Semi-Markovian Switching Complex Dynamical Networks with Mixed Time Varying Delays and Impulse Control

Impulsive Synchronization of Stochastic Neural Networks via Controlling Partial States

A Generalized Logistic Link Function for Cumulative Link Models in Ordinal Regression