skip to main content
survey
Public Access

Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

Published:16 October 2019Publication History
Skip Abstract Section

Abstract

Video description is the automatic generation of natural language sentences that describe the contents of a given video. It has applications in human-robot interaction, helping the visually impaired and video subtitling. The past few years have seen a surge of research in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction. This article fills the gap by surveying the state-of-the-art approaches with a focus on deep learning models; comparing benchmark datasets in terms of their domains, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics, such as SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD. Classical video description approaches combined subject, object, and verb detection with template-based language models to generate sentences. However, the release of large datasets revealed that these methods cannot cope with the diversity in unconstrained open domain videos. Classical approaches were followed by a very short era of statistical methods that were soon replaced with deep learning, the current state-of-the-art in video description. Our survey shows that despite the fast-paced developments, video description research is still in its infancy due to the following reasons: Analysis of video description models is challenging, because it is difficult to ascertain the contributions towards accuracy or errors of the visual features and the adopted language model in the final description. Existing datasets neither contain adequate visual diversity nor complexity of linguistic structures. Finally, current evaluation metrics fall short of measuring the agreement between machine-generated descriptions with that of humans. We conclude our survey by listing promising future research directions.

References

  1. Casting Words transcription service, 2014. Retrieved from: http://castingwords.com/.Google ScholarGoogle Scholar
  2. Language in Vision, 2017. Retrieved from: https://www.sciencedirect.com/journal/computer-vision-and-image-understanding/vol/163.Google ScholarGoogle Scholar
  3. N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  4. J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  5. P. Anderson, B. Fernando, M. Johnson, and S. Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  6. B. Andrei, E. Georgios, H. Daniel, M. Krystian, N. Siddharth, X. Caiming, and Z. Yibiao. 2015. A workshop on language and vision at CVPR 2015.Google ScholarGoogle Scholar
  7. B. Andrei, M. Tao, N. Siddharth, Z. Quanshi, S. Nishant, L. Jiebo, and S. Rahul. 2018. A workshop on language and vision at CVPR 2018. http://languageandvision.com/.Google ScholarGoogle Scholar
  8. R. Anna, T. Atousa, R. Marcus, P. Christopher, L. Hugo, C. Aaron, and S. Bernt. 2015. The joint video and language understanding workshop at ICCV 2015.Google ScholarGoogle Scholar
  9. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. 2015. VQA: Visual question answering. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  10. D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from: arXiv preprint arXiv:1409.0473, (2014).Google ScholarGoogle Scholar
  11. N. Ballas, L. Yao, C. Pal, and A. Courville. 2015. Delving deeper into convolutional networks for learning video representations. Retrieved from: arXiv preprint arXiv:1511.06432, (2015).Google ScholarGoogle Scholar
  12. S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.Google ScholarGoogle Scholar
  13. L. Baraldi, C. Grana, and R. Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  14. A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi et al. 2012. Video in sentences out. Retrieved from: arXiv preprint arXiv:1204.2742, (2012).Google ScholarGoogle Scholar
  15. K. Barnard, P. Duygulu, D. Forsyth, N. D. Freitas, D. M. Blei, and M. I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3 (Feb. 2003), 1107--1135.Google ScholarGoogle Scholar
  16. T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller, and D. A. Forsyth. 2004. Names and faces in the news. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  17. A. F. Bobick and A. D. Wilson. 1997. A state-based approach to the representation and recognition of gesture. IEEE Trans. Pattern Anal. Mach. Intell. 19, 12 (1997), 1325--1337.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Burghouts, H. Bouma, R. D. Hollander, S. V. D. Broek, and K. Schutte. 2012. Recognition of 48 human behaviors from video. In Proceedings of the OPTRO.Google ScholarGoogle Scholar
  19. F. C. Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  20. M. Brand. 1997. The “Inverse Hollywood problem”: From video to scripts and storyboards via causal analysis. In Proceedings of the AAAI/IAAI. Citeseer, 132--137.Google ScholarGoogle Scholar
  21. R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. 2009. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  22. D. Chen and W. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL: Human Language Technologies-Volume 1. ACL, 190--200.Google ScholarGoogle Scholar
  23. D. Chen, W. Dolan, S. Raghavan, T. Huynh, and R. Mooney. 2010. Collecting highly parallel data for paraphrase evaluation. In J. Artific. Intell. Res. 37 (2010), 397--435.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Chen, S. Wang, W. Zhang, and Q. Huang. 2018. Less is more: Picking informative frames for video captioning. Retrieved from: arXiv preprint arXiv:1803.01457, (2018).Google ScholarGoogle Scholar
  25. K. Cho, B. V. Merriënboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. Retrieved from: arXiv preprint arXiv:1409.1259, (2014).Google ScholarGoogle Scholar
  26. J. Corso. 2015. GBS: Guidance by Semantics—Using High-level Visual Inference to Improve Vision-based Mobile Robot Localization. Technical Report. State University of New York at Buffalo Amherst.Google ScholarGoogle Scholar
  27. N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  28. N. Dalal, B. Triggs, and C. Schmid. 2006. Human detection using oriented histograms of flow and appearance. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  29. P. Das, C. Xu, R. F. Doell, and J. J. Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  30. A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. Moura, D. Parikh, and D. Batra. 2017. Visual dialog. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  31. J. Deng, K. Li, M. Do, H. Su, and L. Fei-Fei. 2009. Construction and analysis of a large scale image ontology. Vis. Sci. Soc. 186, 2 (2009).Google ScholarGoogle Scholar
  32. D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian, L. Bao, M. G. Christel, and A. Hauptmann. 2012. Beyond audio and video retrieval: Towards multimedia summarization. In Proceedings of the ICMR.Google ScholarGoogle Scholar
  33. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-term RCNN for visual recognition and description. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  34. J. Dong, X. Li, W. Lan, Y. Huo, and C. G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the MM. ACM, 1082--1086.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Elliott and F. Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the ACL: Short Papers, Vol. 452. 457.Google ScholarGoogle Scholar
  36. M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt et al. 2015. From captions to visual concepts and back. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  38. A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  39. C. Fellbaum. 1998. WordNet. Wiley Online Library. Bradford Books.Google ScholarGoogle Scholar
  40. P. Felzenszwalb, D. McAllester, and D. Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  41. P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. 2010. Cascade object detection with deformable part models. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  42. P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9 (2010), 1627--1645.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  44. A. George, B. Asad, F. Jonathan, J. David, D. Andrew, M. Willie, M. Martial, S. Alan, G. Yvette, and K. Wessel. 2017. TRECVID 2017: Evaluating ad hoc and instance video search, events detection, video captioning, and hyperlinking. In Proceedings of the TRECVID.Google ScholarGoogle Scholar
  45. S. Gella, M. Lewis, and M. Rohrbach. 2018. A dataset for telling the stories of social media videos. In Proceedings of the EMNLP. 968--974.Google ScholarGoogle Scholar
  46. S. Gong and T. Xiang. 2003. Recognition of group activities using dynamic probabilistic networks. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  47. A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the ICML. 1764--1772.Google ScholarGoogle Scholar
  48. A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP. 6645--6649.Google ScholarGoogle Scholar
  49. S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. 2013. Recognizing and describing activities using semantic hierarchies and zero-shot recognition. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  50. S. Guadarrama, L. Riano, D. Golland, D. Go, Y. Jia, D. Klein, P. Abbeel, T. Darrell et al. 2013. Grounding spatial relations for human-robot interaction. In Proceedings of the IROS. 1640--1647.Google ScholarGoogle Scholar
  51. A. Hakeem, Y. Sheikh, and M. Shah. 2004. CASEE: A hierarchical event representation for the analysis of videos. In Proceedings of the AAAI. 263--268.Google ScholarGoogle Scholar
  52. P. Hanckmann, K. Schutte, and G. J. Burghouts. 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  53. D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass. 2018. Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  54. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  55. S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. S. Hongeng, F. Brémond, and R. Nevatia. 2000. Bayesian framework for video surveillance application. In Proceedings of the ICPR, Vol. 1. IEEE, 164--170.Google ScholarGoogle Scholar
  57. Drew A. Hudson, Christopher D. Manning. 2018. Compositional attention networks for machine reasoning. In Proceedings of the ICLR.Google ScholarGoogle Scholar
  58. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the MM. ACM, 675--678.Google ScholarGoogle Scholar
  59. Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the MM. ACM, 1087--1091.Google ScholarGoogle Scholar
  60. J. Johnson, B. Hariharan, L. V. D. Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  61. M. U. G. Khan and Y. Gotoh. 2012. Describing video contents in natural language. In Proceedings of the EACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. ACL, 27--35.Google ScholarGoogle Scholar
  62. M. U. G. Khan, L. Zhang, and Y. Gotoh. 2011. Human focused video description. In Proceedings of the ICCV Workshops.Google ScholarGoogle Scholar
  63. M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem. 2016. Re-evaluating automatic metrics for image captioning. Retrieved from: arXiv preprint arXiv:1612.07600, (2016).Google ScholarGoogle Scholar
  64. J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. 2018. Textual explanations for self-driving vehicles. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  65. W. Kim, J. Park, and C. Kim. 2010. A novel method for efficient indoor-outdoor image classification. J. Sig. Proc. Syst. 61, 3 (2010), 251--258.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Retrieved from: arXiv preprint arXiv:1411.2539, (2014).Google ScholarGoogle Scholar
  67. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, WadeShen, C. Moran, R. Zens et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL, 177--180.Google ScholarGoogle Scholar
  68. A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171--184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. D. Koller, N. Heinze, and H. Nagel. 1991. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In Proceedings of the CVPR. 90--95.Google ScholarGoogle Scholar
  70. R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. 2017. Dense-captioning events in videos. Retrieved from: arXiv:1705.00754.Google ScholarGoogle Scholar
  71. N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the AAAI, Vol. 1. 2.Google ScholarGoogle Scholar
  72. A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the NIPS. 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. P. Kuchi, P. Gabbur, P. S. Bhat, and S. S. David. 2002. Human face detection and tracking using skin color modeling and connected component operators. IEEE J. Res. 48, 3--4 (2002), 289--293.Google ScholarGoogle Scholar
  74. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  75. T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of the NIPS. 3675--3683.Google ScholarGoogle Scholar
  76. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML.Google ScholarGoogle Scholar
  77. I. Langkilde-Geary and K. Knight. Halogen Input Representation. [Online]. http://www.isi.edu/publications/licensed-sw/halogen/interlingua.html.Google ScholarGoogle Scholar
  78. M. W. Lee, A. Hakeem, N. Haering, and S. Zhu. 2008. Save: A framework for semantic annotation of visual events. In Proceedings of the CVPR Workshops. 1--8.Google ScholarGoogle Scholar
  79. L. Li and B. Gong. 2018. End-to-end video captioning with multitask reinforcement learning. Retrieved from: arXiv preprint arXiv:1803.07950, (2018).Google ScholarGoogle Scholar
  80. S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the CNLL.Google ScholarGoogle Scholar
  81. Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  82. C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL-04 Workshop on Text Summarization Branches Out. 74--81.Google ScholarGoogle Scholar
  83. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  84. Y. Liu and Z. Shi. 2016. Boosting video description generation by explicitly translating from frame-level captions. In Proceedings of the MM. ACM, 631--634.Google ScholarGoogle Scholar
  85. D. G. Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. I. Maglogiannis, D. Vouyioukas, and C. Aggelopoulos. 2009. Face detection and recognition of natural human emotion using Markov random fields. Pers .Ubiq. Comput. 13, 1 (2009), 95--101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. M. Malinowski and M. Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the NIPS. 1682--1690.Google ScholarGoogle Scholar
  88. J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  89. M. Margaret, H. Ting-Hao, F. Frank, and M. Ishan. 2018. In Proceedings of the First Workshop on Storytelling. ACL. https://www.aclweb.org/anthology/W18-1500.Google ScholarGoogle Scholar
  90. C. Matuszek, D. Fox, and K. Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the HRI.Google ScholarGoogle Scholar
  91. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS. 3111--3119.Google ScholarGoogle Scholar
  92. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. 2013. Playing Atari with deep reinforcement learning. Retrieved from: arXiv preprint arXiv:1312.5602. (2013).Google ScholarGoogle Scholar
  93. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassbis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 9529.Google ScholarGoogle Scholar
  94. D. Moore and I. Essa. 2002. Recognizing multitasked activities from video using stochastic context-free grammar. In Proceedings of the AAAI/IAAI. 770--776.Google ScholarGoogle Scholar
  95. R. Nevatia, J. Hobbs, and B. Bolles. 2004. An ontology for video event representation. In Proceedings of the CVPR Workshop. 119--119.Google ScholarGoogle Scholar
  96. F. Nishida and S. Takamatsu. 1982. Japanese-English translation through internal expressions. In Proceedings of the COLING, Volume 1. Academia Praha, 271--276.Google ScholarGoogle Scholar
  97. F. Nishida, S. Takamatsu, T. Tani, and T. Doi. 1988. Feedback of correcting information in post editing to a machine translation system. In Proceedings of the COLING, Volume 2. ACL, 476--481.Google ScholarGoogle Scholar
  98. A. Owens and A. A. Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  99. P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  100. Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  101. Y. Pan, T. Yao, H. Li, and T. Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  102. K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the ACL. 311--318.Google ScholarGoogle Scholar
  103. R. Pasunuru and M. Bansal. 2017. Reinforced video captioning with entailment rewards. Retrieved from: arXiv preprint arXiv:1708.02300, (2017).Google ScholarGoogle Scholar
  104. S. Phan, G. E. Henter, Y. Miyao, and S. Satoh. 2017. Consensus-based sequence training for video captioning. Retrieved from: arXiv preprint arXiv:1712.09532, (2017).Google ScholarGoogle Scholar
  105. C. S. Pinhanez and A. F. Bobick. 1998. Human action detection using PNF propagation of temporal constraints. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  106. C. Pollard and I. A. Sag. 1994. Head-driven Phrase Structure Grammar. University of Chicago Press.Google ScholarGoogle Scholar
  107. S. Chen, Y. Song, Y. Zhao, J. Qiu, Q. Jin, and A. Hauptmann. 2017. RUC-CMU: System descriptions for the dense video captioning task. Retrieved from: arXiv preprint arXiv:1710.08011, (2017).Google ScholarGoogle Scholar
  108. V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, and K. Saenko. 2016. Multimodal video description. In Proceedings of the MM. ACM, 1092--1096.Google ScholarGoogle Scholar
  109. M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. 2013. Grounding action descriptions in videos. Trans. Assoc. Comput. Ling. 1 (2013), 25--36.Google ScholarGoogle ScholarCross RefCross Ref
  110. E. Reiter and R. Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press.Google ScholarGoogle Scholar
  111. M. Ren, R. Kiros, and R. Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the NIPS. 2953--2961.Google ScholarGoogle Scholar
  112. Z. Ren, X. Wang, N. Zhang, X. Lv, and L. Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. Retrieved from: arXiv preprint arXiv:1704.03899, (2017).Google ScholarGoogle Scholar
  113. S. Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc. 60, 5 (2004), 503--520.Google ScholarGoogle ScholarCross RefCross Ref
  114. A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In Proceedings of the GCPR.Google ScholarGoogle Scholar
  115. A. Rohrbach, M. Rohrbach, and B. Schiele. 2015. The long-short story of movie description. In Proceedings of the GCPR. 209--221.Google ScholarGoogle Scholar
  116. A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. 2015. A dataset for movie description. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  117. A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. 2017. Movie description. Int. J. Comput. Vis. 123, 1 (2017), 94--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. 2012. A database for fine-grained activity detection of cooking activities. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  119. M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  120. M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  121. D. Roy. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artific. Intell. 167, 1--2 (2005), 170--205.Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. D. Roy and E. Reiter. 2005. Connecting language to the world. Artific. Intell. 167, 1--2 (2005), 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Y. Rubner, C. Tomasi, and L. J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 2 (2000), 99--121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Proc. 45, 11 (1997), 2673--2681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  127. R. Shetty and J. Laaksonen. 2016. Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the MM. ACM, 1073--1076.Google ScholarGoogle Scholar
  128. J. Shi and C. Tomasi. 1994. Good features to track. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  129. A. Shin, K. Ohnishi, and T. Harada. 2016. Beyond caption to narrative: Video captioning with multiple sentences. In Proceedings of the ICIP.Google ScholarGoogle Scholar
  130. G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  131. K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from: arXiv preprint arXiv:1409.1556, (2014).Google ScholarGoogle Scholar
  132. N. Srivastava, E. Mansimov, and R. Salakhudinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the ICML. 843--852.Google ScholarGoogle Scholar
  133. C. Sun and R. Nevatia. 2014. Semantic aware video transcription using random forest classifiers. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  134. I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence-to-sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112.Google ScholarGoogle Scholar
  135. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  136. S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI.Google ScholarGoogle Scholar
  137. J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the COLING 2, 5 (2014), 9.Google ScholarGoogle Scholar
  138. C. Tomasi and T. Kanade. 1991. Detection and tracking of point features. Technical Report CMU-CS-91-132. Carnegie Mellon University.Google ScholarGoogle Scholar
  139. A. Torabi, C. Pal, H. Larochelle, and A. Courville. 2015. Using descriptive video services to create a large data source for video annotation research. Retrieved from: arXiv preprint arXiv:1503.01070, (2015).Google ScholarGoogle Scholar
  140. A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. 2003. Context-based vision system for place and object recognition. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  141. D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2014. C3D: Generic features for video analysis. Retrieved from: CoRR abs/1412.0767, (2014).Google ScholarGoogle Scholar
  142. R. Vedantam, C. L. Zitnick, and D. Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  143. S. Venugopalan, L. A. Hendricks, R. Mooney, and K. Saenko. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. Retrieved from: arXiv preprint arXiv:1604.01729, (2016).Google ScholarGoogle Scholar
  144. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence-to-sequence video to text. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  145. S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. Retrieved from: arXiv preprint arXiv:1412.4729, (2014).Google ScholarGoogle Scholar
  146. A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. 2017. Feudal networks for hierarchical reinforcement learning. Retrieved from: arXiv preprint arXiv:1703.01161, (2017).Google ScholarGoogle Scholar
  147. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  148. P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  149. Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. 2018. Talk the walk: Navigating New York City through grounded dialogue. Retrieved from: CoRR abs/1807.03367 (2018).Google ScholarGoogle Scholar
  150. B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  151. H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the BMVC. BMVA Press, 124--1.Google ScholarGoogle Scholar
  152. J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  153. J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  154. J. K. Wang and R. Gaizauskas. 2016. Cross-validating image description datasets and evaluation metrics. In Proceedings of the LREC. European Language Resources Association, 3059--3066.Google ScholarGoogle Scholar
  155. X. Wang, W. Chen, J. Wu, Y. Wang, and W. Y. Wang. 2017. Video captioning via hierarchical reinforcement learning. Retrieved from: arXiv preprint arXiv:1711.11135, (2017).Google ScholarGoogle Scholar
  156. X. Wu, G. Li, Q. Cao, Q. Ji, and L. Lin. 2018. Interpretable video captioning via trajectory structured localization. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  157. Q. Wu, P. Wang, C. Shen, A. Dick, and A. Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  158. H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko. 2018. Joint event detection and description in continuous video streams. Retrieved from: arXiv preprint arXiv:1802.10250, (2018).Google ScholarGoogle Scholar
  159. H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, and K. Saenko. 2015. A multi-scale multiple instance video description network. Retrieved from: arXiv preprint arXiv:1505.05914, (2015).Google ScholarGoogle Scholar
  160. J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  161. R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI, Vol. 5, 6.Google ScholarGoogle Scholar
  162. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  163. T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. 2017. Trimmed action recognition, temporal action proposals and dense-captioning events in videos. In Proceedings of the MSR Asia MSM at ActivityNet Challenge 2017.Google ScholarGoogle Scholar
  164. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. ACL, 2 (2014), 67--78.Google ScholarGoogle Scholar
  165. H. Yu and J. M. Siskind. 2013. Grounded language learning from video sentences. In Proceedings of the ACL 1. 53--63.Google ScholarGoogle Scholar
  166. H. Yu and J. M. Siskind. 2015. Learning to describe video with weak supervision by exploiting negative sentential information. In Proceedings of the AAAI. 3855--3863.Google ScholarGoogle Scholar
  167. H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  168. L. Yu, E. Park, A. C. Berg, and T. L. Berg. 2015. Visual Madlibs: Fill in the blank description generation and question answering. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  169. Y. Yu, J. Choi, Y. Kim, K. Yoo, S. Lee, and G. Kim. 2017. Supervising neural attention models for video captioning by human gaze data. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  170. Y. Yu, H. Ko, J. Choi, and G. Kim. 2016. End-to-end concept word detection for video captioning, retrieval, and question answering. Retrieved from: arXiv preprint arXiv:1610.02947, (2016).Google ScholarGoogle Scholar
  171. K. Zeng, T. Chen, J. C. Niebles, and M. Sun. 2016. Title generation for user-generated videos. In Proceedings of the ECCV.Google ScholarGoogle Scholar
  172. X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian. 2017. Task-driven dynamic fusion: Reducing ambiguity in video description. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  173. L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach. 2018. Grounded video description. Retrieved from: arXiv preprint arXiv:1812.06587 (2018).Google ScholarGoogle Scholar
  174. L. Zhou, C. Xu, and J. J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI.Google ScholarGoogle Scholar
  175. L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  176. S. Zhu and D. Mumford. 2007. A stochastic grammar of images. Found. Trends Comput. Graph. Vis. 2, 4 (2007), 259--362.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Computing Surveys
        ACM Computing Surveys  Volume 52, Issue 6
        November 2020
        806 pages
        ISSN:0360-0300
        EISSN:1557-7341
        DOI:10.1145/3368196
        • Editor:
        • Sartaj Sahni
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 October 2019
        • Accepted: 1 August 2019
        • Received: 1 March 2019
        Published in csur Volume 52, Issue 6

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • survey
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format