Skip to main content
Erschienen in: Neural Processing Letters 1/2017

25.01.2017

Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism

verfasst von: Dashan Guo, Wei Li, Xiangzhong Fang

Erschienen in: Neural Processing Letters | Ausgabe 1/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

To generate a natural language description for videos, there has been tremendous interest in developing deep neural networks with the integration of temporal structures in different categories. Considering the spatial and temporal domains inherent in video frames, we contend that the video dynamics and the spatio-temporal contexts are both important for captioning, which correspond to two different temporal structures. However, while the video dynamics is well investigated, the spatio-temporal contexts have not been given sufficient attention. In this paper, we take both structures into account and propose a novel recurrent convolution model for captioning. Firstly, for a comprehensive and detailed representation, we propose to aggregate the local and global spatio-temporal contexts in the recurrent convolution networks. Secondly, to capture much subtler temporal dynamics, the channel attention mechanism is introduced and it helps to understand the involvement of the frame feature maps with the captioning process. Finally, a qualitative comparison with several variants of our model demonstrates the effectiveness of incorporating these two structures. Moreover, experiments on YouTube2Text dataset have shown that the proposed method achieves competitive performance to other state-of-the-art methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Azorin-Lopez J, Saval-Calvo M, Fuster-Guillo A, Garcia-Rodriguez J (2016) A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process Lett 43(2):363–387CrossRef Azorin-Lopez J, Saval-Calvo M, Fuster-Guillo A, Garcia-Rodriguez J (2016) A novel prediction method for early recognition of global human behaviour in image sequences. Neural Process Lett 43(2):363–387CrossRef
2.
Zurück zum Zitat Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:​1511.​06432
3.
Zurück zum Zitat Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1. Association for Computational Linguistics, pp 190–200 Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1. Association for Computational Linguistics, pp 190–200
4.
Zurück zum Zitat Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:​1504.​00325
5.
Zurück zum Zitat Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:​1406.​1078
6.
Zurück zum Zitat Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:​1412.​3555
7.
Zurück zum Zitat Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 workshop on statistical machine translation, vol 6 Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 workshop on statistical machine translation, vol 6
8.
Zurück zum Zitat Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634 Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
9.
Zurück zum Zitat Fernando B, Gould S (2016) Learning end-to-end video classification with rank-pooling. In: Proceedings of the 33rd international conference on machine learning, vol 48. JMLR: W&CP, New York Fernando B, Gould S (2016) Learning end-to-end video classification with rank-pooling. In: Proceedings of the 33rd international conference on machine learning, vol 48. JMLR: W&CP, New York
10.
Zurück zum Zitat Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
11.
Zurück zum Zitat Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670MathSciNetCrossRef Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670MathSciNetCrossRef
12.
Zurück zum Zitat Hong C, Chen X, Wang X, Tang C (2016) Hypergraph regularized autoencoder for image-based 3d human pose recovery. Signal Process 124:132–140CrossRef Hong C, Chen X, Wang X, Tang C (2016) Hypergraph regularized autoencoder for image-based 3d human pose recovery. Signal Process 124:132–140CrossRef
13.
Zurück zum Zitat Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732 Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
14.
Zurück zum Zitat Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision, pp 1817–1824 Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision, pp 1817–1824
15.
Zurück zum Zitat Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2015) Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:1511.03476 Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2015) Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv preprint arXiv:​1511.​03476
16.
Zurück zum Zitat Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318 Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318
17.
Zurück zum Zitat Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: European conference on computer vision, Springer, Berlin, pp 581–595 Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: European conference on computer vision, Springer, Berlin, pp 581–595
18.
Zurück zum Zitat Rekabdar B, Nicolescu M, Nicolescu M, Saffar MT, Kelley R (2016) A scale and translation invariant approach for early classification of spatio-temporal patterns using spiking neural networks. Neural Process Lett 43(2):327–343CrossRef Rekabdar B, Nicolescu M, Nicolescu M, Saffar MT, Kelley R (2016) A scale and translation invariant approach for early classification of spatio-temporal patterns using spiking neural networks. Neural Process Lett 43(2):327–343CrossRef
19.
Zurück zum Zitat Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576 Simonyan K, Zisserman A (2014a) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
20.
Zurück zum Zitat Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 Simonyan K, Zisserman A (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:​1409.​1556
21.
Zurück zum Zitat Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetMATH
22.
Zurück zum Zitat Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd international conference on machine learning, vol 37. JMLR: W&CP, Lille Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd international conference on machine learning, vol 37. JMLR: W&CP, Lille
23.
Zurück zum Zitat Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
24.
Zurück zum Zitat Team TTD, Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, et al. (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 Team TTD, Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D, Ballas N, Bastien F, Bayer J, Belikov A, et al. (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv preprint arXiv:​1605.​02688
25.
Zurück zum Zitat Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2014) C3d: generic features for video analysis. CoRR, abs/14120767 2:7 Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2014) C3d: generic features for video analysis. CoRR, abs/14120767 2:7
26.
Zurück zum Zitat Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575 Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
27.
Zurück zum Zitat Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence—video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542 Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence—video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
28.
Zurück zum Zitat Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:​1412.​4729
29.
Zurück zum Zitat Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558 Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
30.
Zurück zum Zitat Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–881 Xingjian S, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems, pp 802–881
31.
Zurück zum Zitat Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515 Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
32.
Zurück zum Zitat Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2015) Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738 Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2015) Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:​1507.​05738
Metadaten
Titel
Capturing Temporal Structures for Video Captioning by Spatio-temporal Contexts and Channel Attention Mechanism
verfasst von
Dashan Guo
Wei Li
Xiangzhong Fang
Publikationsdatum
25.01.2017
Verlag
Springer US
Erschienen in
Neural Processing Letters / Ausgabe 1/2017
Print ISSN: 1370-4621
Elektronische ISSN: 1573-773X
DOI
https://doi.org/10.1007/s11063-017-9591-9

Weitere Artikel der Ausgabe 1/2017

Neural Processing Letters 1/2017 Zur Ausgabe