Skip to main content

2018 | OriginalPaper | Buchkapitel

Less Is More: Picking Informative Frames for Video Captioning

verfasst von : Yangyu Chen, Shuhui Wang, Weigang Zhang, Qingming Huang

Erschienen in: Computer Vision – ECCV 2018

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, which may bring about redundant visual information, sensitivity to content noise and unnecessary computation cost. We propose a plug-and-play PickNet to perform informative frame picking in video captioning. Based on a standard encoder-decoder framework, we develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by maximizing visual diversity and minimizing discrepancy between generated caption and the ground-truth. The rewarded candidate will be selected and the corresponding latent representation of encoder-decoder will be updated for future trials. This procedure goes on until the end of the video sequence. Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation. Experiment results show that our model can achieve competitive performance across popular benchmarks while only 6–8 frames are used.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL, pp. 65–72 (2005) Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL, pp. 65–72 (2005)
2.
Zurück zum Zitat Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 3185–3194 (2017) Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 3185–3194 (2017)
3.
Zurück zum Zitat Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS, pp. 1171–1179 (2015) Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS, pp. 1171–1179 (2015)
4.
Zurück zum Zitat Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011) Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)
5.
Zurück zum Zitat Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp. 1724–1734 (2014) Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp. 1724–1734 (2014)
6.
Zurück zum Zitat Cromwell, H.C., Mears, R.P., Wan, L., Boutros, N.N.: Sensory gating: a translational effort from basic to clinical science. Clinical EEG Neurosci. 39(2), 69–72 (2008)CrossRef Cromwell, H.C., Mears, R.P., Wan, L., Boutros, N.N.: Sensory gating: a translational effort from basic to clinical science. Clinical EEG Neurosci. 39(2), 69–72 (2008)CrossRef
7.
Zurück zum Zitat Dong, J., Li, X., Lan, W., Huo, Y., Snoek, C.G.M.: Early embedding and late reranking for video captioning. In: ACM Multimedia, pp. 1082–1086 (2016) Dong, J., Li, X., Lan, W., Huo, Y., Snoek, C.G.M.: Early embedding and late reranking for video captioning. In: ACM Multimedia, pp. 1082–1086 (2016)
8.
Zurück zum Zitat Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015) Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)
9.
Zurück zum Zitat Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: Generating sentences from images. In: ECCV, pp. 15–29 (2010) Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: Generating sentences from images. In: ECCV, pp. 15–29 (2010)
10.
Zurück zum Zitat He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
11.
Zurück zum Zitat Hochreiter, S., Schmidhuber, J.J.J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.J.J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRef
12.
Zurück zum Zitat Hori, C., Hori, T., Lee, T.Y., Sumi, K., Hershey, J.R., Marks, T.K.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4203–4212 (2017) Hori, C., Hori, T., Lee, T.Y., Sumi, K., Hershey, J.R., Marks, T.K.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4203–4212 (2017)
13.
Zurück zum Zitat Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)CrossRef Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)CrossRef
14.
Zurück zum Zitat Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016) Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016)
15.
Zurück zum Zitat Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015) Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)
16.
Zurück zum Zitat Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. IJCV 50(2), 171–184 (2002)CrossRef Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. IJCV 50(2), 171–184 (2002)CrossRef
17.
Zurück zum Zitat Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR, pp. 3337–3345 (2017) Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR, pp. 3337–3345 (2017)
18.
Zurück zum Zitat Kulkarni, G., et al.: Baby talk: Understanding and generating image descriptions. In: CVPR, pp. 1601–1608 (2011) Kulkarni, G., et al.: Baby talk: Understanding and generating image descriptions. In: CVPR, pp. 1601–1608 (2011)
19.
Zurück zum Zitat Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: ACL (2004) Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: ACL (2004)
20.
Zurück zum Zitat Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical co-attention for visual question answering. In: NIPS, pp. 289–297 (2016) Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical co-attention for visual question answering. In: NIPS, pp. 289–297 (2016)
21.
Zurück zum Zitat Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS, pp. 2204–2212 (2014) Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS, pp. 2204–2212 (2014)
22.
Zurück zum Zitat Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016) Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016)
23.
Zurück zum Zitat Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016) Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016)
24.
Zurück zum Zitat Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
25.
Zurück zum Zitat Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2016) Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2016)
26.
Zurück zum Zitat Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 1179–1195 (2017) Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 1179–1195 (2017)
27.
Zurück zum Zitat Shen, Z., et al.: Weakly supervised dense video captioning. In: CVPR, pp. 5159–5167 (2017) Shen, Z., et al.: Weakly supervised dense video captioning. In: CVPR, pp. 5159–5167 (2017)
28.
Zurück zum Zitat Shetty, R., Laaksonen, J.: Frame-and segment-level features and candidate pool evaluation for video caption generation. In: ACM Multimedia, pp. 1073–1076 (2016) Shetty, R., Laaksonen, J.: Frame-and segment-level features and candidate pool evaluation for video caption generation. In: ACM Multimedia, pp. 1073–1076 (2016)
29.
Zurück zum Zitat Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
30.
Zurück zum Zitat Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., Shen, H.T.: From deterministic to generative: multi-modal stochastic RNNs for video captioning. arXiv (2017) Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., Shen, H.T.: From deterministic to generative: multi-modal stochastic RNNs for video captioning. arXiv (2017)
31.
Zurück zum Zitat Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016) Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016)
32.
Zurück zum Zitat Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: Summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015) Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: Summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)
33.
Zurück zum Zitat Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015) Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
34.
Zurück zum Zitat Venugopalan, S., Rohrbach, M., Darrell, T., Donahue, J., Saenko, K., Mooney, R.: Sequence to sequence - video to text. In: ICCV, pp. 4534–4542 (2015) Venugopalan, S., Rohrbach, M., Darrell, T., Donahue, J., Saenko, K., Mooney, R.: Sequence to sequence - video to text. In: ICCV, pp. 4534–4542 (2015)
35.
Zurück zum Zitat Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: CVPR, pp. 7622–7631 (2018) Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: CVPR, pp. 7622–7631 (2018)
36.
Zurück zum Zitat Wang, J., Jiang, W., Ma, L., Liu, W., Xu, Y.: Bidirectional attentive fusion with context gating for dense video captioning. In: CVPR, pp. 7190–7198 (2018) Wang, J., Jiang, W., Ma, L., Liu, W., Xu, Y.: Bidirectional attentive fusion with context gating for dense video captioning. In: CVPR, pp. 7190–7198 (2018)
37.
Zurück zum Zitat Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)MATH Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)MATH
38.
Zurück zum Zitat Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016) Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)
39.
Zurück zum Zitat Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015) Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
40.
Zurück zum Zitat Yang, Y., Teo, C.L., Daumé III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: EMNLP, pp. 444–454 (2011) Yang, Y., Teo, C.L., Daumé III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: EMNLP, pp. 444–454 (2011)
41.
Zurück zum Zitat Yao, L., Cho, K., Ballas, N., Paí, C., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015) Yao, L., Cho, K., Ballas, N., Paí, C., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)
42.
Zurück zum Zitat Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR, pp. 2678–2687 (2016) Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR, pp. 2678–2687 (2016)
43.
Zurück zum Zitat You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016) You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016)
44.
Zurück zum Zitat Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016) Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)
45.
Zurück zum Zitat Yu, Y., et al.: Supervising neural attention models for video captioning by human gaze data. In: CVPR, pp. 6119–6127 (2017) Yu, Y., et al.: Supervising neural attention models for video captioning by human gaze data. In: CVPR, pp. 6119–6127 (2017)
46.
Zurück zum Zitat Zeng, K., Chen, T., Niebles, J.C., Sun, M.: Title generation for user generated videos. In: ECCV, pp. 609–625 (2016) Zeng, K., Chen, T., Niebles, J.C., Sun, M.: Title generation for user generated videos. In: ECCV, pp. 609–625 (2016)
47.
Zurück zum Zitat Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: CVPR, pp. 2513–2520 (2014) Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: CVPR, pp. 2513–2520 (2014)
48.
Zurück zum Zitat Zheng, H., Fu, J., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR, pp. 4476–4484 (2017) Zheng, H., Fu, J., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR, pp. 4476–4484 (2017)
Metadaten
Titel
Less Is More: Picking Informative Frames for Video Captioning
verfasst von
Yangyu Chen
Shuhui Wang
Weigang Zhang
Qingming Huang
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-01261-8_22

Premium Partner