Abstract
Video description is the automatic generation of natural language sentences that describe the contents of a given video. It has applications in human-robot interaction, helping the visually impaired and video subtitling. The past few years have seen a surge of research in this area due to the unprecedented success of deep learning in computer vision and natural language processing. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, calling the need for a comprehensive survey to focus research efforts in this flourishing new direction. This article fills the gap by surveying the state-of-the-art approaches with a focus on deep learning models; comparing benchmark datasets in terms of their domains, number of classes, and repository size; and identifying the pros and cons of various evaluation metrics, such as SPICE, CIDEr, ROUGE, BLEU, METEOR, and WMD. Classical video description approaches combined subject, object, and verb detection with template-based language models to generate sentences. However, the release of large datasets revealed that these methods cannot cope with the diversity in unconstrained open domain videos. Classical approaches were followed by a very short era of statistical methods that were soon replaced with deep learning, the current state-of-the-art in video description. Our survey shows that despite the fast-paced developments, video description research is still in its infancy due to the following reasons: Analysis of video description models is challenging, because it is difficult to ascertain the contributions towards accuracy or errors of the visual features and the adopted language model in the final description. Existing datasets neither contain adequate visual diversity nor complexity of linguistic structures. Finally, current evaluation metrics fall short of measuring the agreement between machine-generated descriptions with that of humans. We conclude our survey by listing promising future research directions.
- Casting Words transcription service, 2014. Retrieved from: http://castingwords.com/.Google Scholar
- Language in Vision, 2017. Retrieved from: https://www.sciencedirect.com/journal/computer-vision-and-image-understanding/vol/163.Google Scholar
- N. Aafaq, N. Akhtar, W. Liu, S. Z. Gilani, and A. Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the CVPR.Google Scholar
- J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and S. Lacoste-Julien. 2016. Unsupervised learning from narrated instruction videos. In Proceedings of the CVPR.Google Scholar
- P. Anderson, B. Fernando, M. Johnson, and S. Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the ECCV.Google Scholar
- B. Andrei, E. Georgios, H. Daniel, M. Krystian, N. Siddharth, X. Caiming, and Z. Yibiao. 2015. A workshop on language and vision at CVPR 2015.Google Scholar
- B. Andrei, M. Tao, N. Siddharth, Z. Quanshi, S. Nishant, L. Jiebo, and S. Rahul. 2018. A workshop on language and vision at CVPR 2018. http://languageandvision.com/.Google Scholar
- R. Anna, T. Atousa, R. Marcus, P. Christopher, L. Hugo, C. Aaron, and S. Bernt. 2015. The joint video and language understanding workshop at ICCV 2015.Google Scholar
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. 2015. VQA: Visual question answering. In Proceedings of the ICCV.Google Scholar
- D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from: arXiv preprint arXiv:1409.0473, (2014).Google Scholar
- N. Ballas, L. Yao, C. Pal, and A. Courville. 2015. Delving deeper into convolutional networks for learning video representations. Retrieved from: arXiv preprint arXiv:1511.06432, (2015).Google Scholar
- S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.Google Scholar
- L. Baraldi, C. Grana, and R. Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the CVPR.Google Scholar
- A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi et al. 2012. Video in sentences out. Retrieved from: arXiv preprint arXiv:1204.2742, (2012).Google Scholar
- K. Barnard, P. Duygulu, D. Forsyth, N. D. Freitas, D. M. Blei, and M. I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3 (Feb. 2003), 1107--1135.Google Scholar
- T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller, and D. A. Forsyth. 2004. Names and faces in the news. In Proceedings of the CVPR.Google Scholar
- A. F. Bobick and A. D. Wilson. 1997. A state-based approach to the representation and recognition of gesture. IEEE Trans. Pattern Anal. Mach. Intell. 19, 12 (1997), 1325--1337.Google ScholarDigital Library
- G. Burghouts, H. Bouma, R. D. Hollander, S. V. D. Broek, and K. Schutte. 2012. Recognition of 48 human behaviors from video. In Proceedings of the OPTRO.Google Scholar
- F. C. Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the CVPR.Google Scholar
- M. Brand. 1997. The “Inverse Hollywood problem”: From video to scripts and storyboards via causal analysis. In Proceedings of the AAAI/IAAI. Citeseer, 132--137.Google Scholar
- R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. 2009. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Proceedings of the CVPR.Google Scholar
- D. Chen and W. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL: Human Language Technologies-Volume 1. ACL, 190--200.Google Scholar
- D. Chen, W. Dolan, S. Raghavan, T. Huynh, and R. Mooney. 2010. Collecting highly parallel data for paraphrase evaluation. In J. Artific. Intell. Res. 37 (2010), 397--435.Google ScholarDigital Library
- Y. Chen, S. Wang, W. Zhang, and Q. Huang. 2018. Less is more: Picking informative frames for video captioning. Retrieved from: arXiv preprint arXiv:1803.01457, (2018).Google Scholar
- K. Cho, B. V. Merriënboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. Retrieved from: arXiv preprint arXiv:1409.1259, (2014).Google Scholar
- J. Corso. 2015. GBS: Guidance by Semantics—Using High-level Visual Inference to Improve Vision-based Mobile Robot Localization. Technical Report. State University of New York at Buffalo Amherst.Google Scholar
- N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the CVPR.Google Scholar
- N. Dalal, B. Triggs, and C. Schmid. 2006. Human detection using oriented histograms of flow and appearance. In Proceedings of the ECCV.Google Scholar
- P. Das, C. Xu, R. F. Doell, and J. J. Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the CVPR.Google Scholar
- A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. Moura, D. Parikh, and D. Batra. 2017. Visual dialog. In Proceedings of the CVPR.Google Scholar
- J. Deng, K. Li, M. Do, H. Su, and L. Fei-Fei. 2009. Construction and analysis of a large scale image ontology. Vis. Sci. Soc. 186, 2 (2009).Google Scholar
- D. Ding, F. Metze, S. Rawat, P. F. Schulam, S. Burger, E. Younessian, L. Bao, M. G. Christel, and A. Hauptmann. 2012. Beyond audio and video retrieval: Towards multimedia summarization. In Proceedings of the ICMR.Google Scholar
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. 2015. Long-term RCNN for visual recognition and description. In Proceedings of the CVPR.Google Scholar
- J. Dong, X. Li, W. Lan, Y. Huo, and C. G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the MM. ACM, 1082--1086.Google ScholarDigital Library
- D. Elliott and F. Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the ACL: Short Papers, Vol. 452. 457.Google Scholar
- M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.Google ScholarDigital Library
- H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt et al. 2015. From captions to visual concepts and back. In Proceedings of the CVPR.Google Scholar
- A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the ECCV.Google Scholar
- C. Fellbaum. 1998. WordNet. Wiley Online Library. Bradford Books.Google Scholar
- P. Felzenszwalb, D. McAllester, and D. Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the CVPR.Google Scholar
- P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. 2010. Cascade object detection with deformable part models. In Proceedings of the CVPR.Google Scholar
- P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9 (2010), 1627--1645.Google ScholarDigital Library
- Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the CVPR.Google Scholar
- A. George, B. Asad, F. Jonathan, J. David, D. Andrew, M. Willie, M. Martial, S. Alan, G. Yvette, and K. Wessel. 2017. TRECVID 2017: Evaluating ad hoc and instance video search, events detection, video captioning, and hyperlinking. In Proceedings of the TRECVID.Google Scholar
- S. Gella, M. Lewis, and M. Rohrbach. 2018. A dataset for telling the stories of social media videos. In Proceedings of the EMNLP. 968--974.Google Scholar
- S. Gong and T. Xiang. 2003. Recognition of group activities using dynamic probabilistic networks. In Proceedings of the ICCV.Google Scholar
- A. Graves and N. Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the ICML. 1764--1772.Google Scholar
- A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the ICASSP. 6645--6649.Google Scholar
- S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. 2013. Recognizing and describing activities using semantic hierarchies and zero-shot recognition. In Proceedings of the ICCV.Google Scholar
- S. Guadarrama, L. Riano, D. Golland, D. Go, Y. Jia, D. Klein, P. Abbeel, T. Darrell et al. 2013. Grounding spatial relations for human-robot interaction. In Proceedings of the IROS. 1640--1647.Google Scholar
- A. Hakeem, Y. Sheikh, and M. Shah. 2004. CASEE: A hierarchical event representation for the analysis of videos. In Proceedings of the AAAI. 263--268.Google Scholar
- P. Hanckmann, K. Schutte, and G. J. Burghouts. 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In Proceedings of the ECCV.Google Scholar
- D. Harwath, A. Recasens, D. Suris, G. Chuang, A. Torralba, and J. Glass. 2018. Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the ECCV.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.Google Scholar
- S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google ScholarDigital Library
- S. Hongeng, F. Brémond, and R. Nevatia. 2000. Bayesian framework for video surveillance application. In Proceedings of the ICPR, Vol. 1. IEEE, 164--170.Google Scholar
- Drew A. Hudson, Christopher D. Manning. 2018. Compositional attention networks for machine reasoning. In Proceedings of the ICLR.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the MM. ACM, 675--678.Google Scholar
- Q. Jin, J. Chen, S. Chen, Y. Xiong, and A. Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the MM. ACM, 1087--1091.Google Scholar
- J. Johnson, B. Hariharan, L. V. D. Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the CVPR.Google Scholar
- M. U. G. Khan and Y. Gotoh. 2012. Describing video contents in natural language. In Proceedings of the EACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. ACL, 27--35.Google Scholar
- M. U. G. Khan, L. Zhang, and Y. Gotoh. 2011. Human focused video description. In Proceedings of the ICCV Workshops.Google Scholar
- M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem. 2016. Re-evaluating automatic metrics for image captioning. Retrieved from: arXiv preprint arXiv:1612.07600, (2016).Google Scholar
- J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. 2018. Textual explanations for self-driving vehicles. In Proceedings of the ECCV.Google Scholar
- W. Kim, J. Park, and C. Kim. 2010. A novel method for efficient indoor-outdoor image classification. J. Sig. Proc. Syst. 61, 3 (2010), 251--258.Google ScholarDigital Library
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. Retrieved from: arXiv preprint arXiv:1411.2539, (2014).Google Scholar
- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, WadeShen, C. Moran, R. Zens et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL, 177--180.Google Scholar
- A. Kojima, T. Tamura, and K. Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171--184.Google ScholarDigital Library
- D. Koller, N. Heinze, and H. Nagel. 1991. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In Proceedings of the CVPR. 90--95.Google Scholar
- R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. 2017. Dense-captioning events in videos. Retrieved from: arXiv:1705.00754.Google Scholar
- N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the AAAI, Vol. 1. 2.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the NIPS. 1097--1105.Google ScholarDigital Library
- P. Kuchi, P. Gabbur, P. S. Bhat, and S. S. David. 2002. Human face detection and tracking using skin color modeling and connected component operators. IEEE J. Res. 48, 3--4 (2002), 289--293.Google Scholar
- G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the CVPR.Google Scholar
- T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of the NIPS. 3675--3683.Google Scholar
- M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the ICML.Google Scholar
- I. Langkilde-Geary and K. Knight. Halogen Input Representation. [Online]. http://www.isi.edu/publications/licensed-sw/halogen/interlingua.html.Google Scholar
- M. W. Lee, A. Hakeem, N. Haering, and S. Zhu. 2008. Save: A framework for semantic annotation of visual events. In Proceedings of the CVPR Workshops. 1--8.Google Scholar
- L. Li and B. Gong. 2018. End-to-end video captioning with multitask reinforcement learning. Retrieved from: arXiv preprint arXiv:1803.07950, (2018).Google Scholar
- S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the CNLL.Google Scholar
- Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the CVPR.Google Scholar
- C. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL-04 Workshop on Text Summarization Branches Out. 74--81.Google Scholar
- T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the ECCV.Google Scholar
- Y. Liu and Z. Shi. 2016. Boosting video description generation by explicitly translating from frame-level captions. In Proceedings of the MM. ACM, 631--634.Google Scholar
- D. G. Lowe. 1999. Object recognition from local scale-invariant features. In Proceedings of the ICCV.Google ScholarDigital Library
- I. Maglogiannis, D. Vouyioukas, and C. Aggelopoulos. 2009. Face detection and recognition of natural human emotion using Markov random fields. Pers .Ubiq. Comput. 13, 1 (2009), 95--101.Google ScholarDigital Library
- M. Malinowski and M. Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the NIPS. 1682--1690.Google Scholar
- J. Mao, X. Wei, Y. Yang, J. Wang, Z. Huang, and A. L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the ICCV.Google Scholar
- M. Margaret, H. Ting-Hao, F. Frank, and M. Ishan. 2018. In Proceedings of the First Workshop on Storytelling. ACL. https://www.aclweb.org/anthology/W18-1500.Google Scholar
- C. Matuszek, D. Fox, and K. Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the HRI.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS. 3111--3119.Google Scholar
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. 2013. Playing Atari with deep reinforcement learning. Retrieved from: arXiv preprint arXiv:1312.5602. (2013).Google Scholar
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassbis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 9529.Google Scholar
- D. Moore and I. Essa. 2002. Recognizing multitasked activities from video using stochastic context-free grammar. In Proceedings of the AAAI/IAAI. 770--776.Google Scholar
- R. Nevatia, J. Hobbs, and B. Bolles. 2004. An ontology for video event representation. In Proceedings of the CVPR Workshop. 119--119.Google Scholar
- F. Nishida and S. Takamatsu. 1982. Japanese-English translation through internal expressions. In Proceedings of the COLING, Volume 1. Academia Praha, 271--276.Google Scholar
- F. Nishida, S. Takamatsu, T. Tani, and T. Doi. 1988. Feedback of correcting information in post editing to a machine translation system. In Proceedings of the COLING, Volume 2. ACL, 476--481.Google Scholar
- A. Owens and A. A. Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the ECCV.Google Scholar
- P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the CVPR.Google Scholar
- Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the CVPR.Google Scholar
- Y. Pan, T. Yao, H. Li, and T. Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the CVPR.Google Scholar
- K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the ACL. 311--318.Google Scholar
- R. Pasunuru and M. Bansal. 2017. Reinforced video captioning with entailment rewards. Retrieved from: arXiv preprint arXiv:1708.02300, (2017).Google Scholar
- S. Phan, G. E. Henter, Y. Miyao, and S. Satoh. 2017. Consensus-based sequence training for video captioning. Retrieved from: arXiv preprint arXiv:1712.09532, (2017).Google Scholar
- C. S. Pinhanez and A. F. Bobick. 1998. Human action detection using PNF propagation of temporal constraints. In Proceedings of the CVPR.Google Scholar
- C. Pollard and I. A. Sag. 1994. Head-driven Phrase Structure Grammar. University of Chicago Press.Google Scholar
- S. Chen, Y. Song, Y. Zhao, J. Qiu, Q. Jin, and A. Hauptmann. 2017. RUC-CMU: System descriptions for the dense video captioning task. Retrieved from: arXiv preprint arXiv:1710.08011, (2017).Google Scholar
- V. Ramanishka, A. Das, D. H. Park, S. Venugopalan, L. A. Hendricks, M. Rohrbach, and K. Saenko. 2016. Multimodal video description. In Proceedings of the MM. ACM, 1092--1096.Google Scholar
- M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. 2013. Grounding action descriptions in videos. Trans. Assoc. Comput. Ling. 1 (2013), 25--36.Google ScholarCross Ref
- E. Reiter and R. Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press.Google Scholar
- M. Ren, R. Kiros, and R. Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the NIPS. 2953--2961.Google Scholar
- Z. Ren, X. Wang, N. Zhang, X. Lv, and L. Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. Retrieved from: arXiv preprint arXiv:1704.03899, (2017).Google Scholar
- S. Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Doc. 60, 5 (2004), 503--520.Google ScholarCross Ref
- A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In Proceedings of the GCPR.Google Scholar
- A. Rohrbach, M. Rohrbach, and B. Schiele. 2015. The long-short story of movie description. In Proceedings of the GCPR. 209--221.Google Scholar
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. 2015. A dataset for movie description. In Proceedings of the CVPR.Google Scholar
- A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele. 2017. Movie description. Int. J. Comput. Vis. 123, 1 (2017), 94--120.Google ScholarDigital Library
- M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. 2012. A database for fine-grained activity detection of cooking activities. In Proceedings of the CVPR.Google Scholar
- M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the ICCV.Google Scholar
- M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the ECCV.Google Scholar
- D. Roy. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artific. Intell. 167, 1--2 (2005), 170--205.Google ScholarDigital Library
- D. Roy and E. Reiter. 2005. Connecting language to the world. Artific. Intell. 167, 1--2 (2005), 1--12.Google ScholarDigital Library
- Y. Rubner, C. Tomasi, and L. J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 2 (2000), 99--121.Google ScholarDigital Library
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.Google ScholarDigital Library
- M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Proc. 45, 11 (1997), 2673--2681.Google ScholarDigital Library
- Z. Shen, J. Li, Z. Su, M. Li, Y. Chen, Y. Jiang, and X. Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the CVPR.Google Scholar
- R. Shetty and J. Laaksonen. 2016. Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the MM. ACM, 1073--1076.Google Scholar
- J. Shi and C. Tomasi. 1994. Good features to track. In Proceedings of the CVPR.Google Scholar
- A. Shin, K. Ohnishi, and T. Harada. 2016. Beyond caption to narrative: Video captioning with multiple sentences. In Proceedings of the ICIP.Google Scholar
- G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the ECCV.Google Scholar
- K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from: arXiv preprint arXiv:1409.1556, (2014).Google Scholar
- N. Srivastava, E. Mansimov, and R. Salakhudinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the ICML. 843--852.Google Scholar
- C. Sun and R. Nevatia. 2014. Semantic aware video transcription using random forest classifiers. In Proceedings of the ECCV.Google Scholar
- I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence-to-sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112.Google Scholar
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR.Google Scholar
- S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI.Google Scholar
- J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the COLING 2, 5 (2014), 9.Google Scholar
- C. Tomasi and T. Kanade. 1991. Detection and tracking of point features. Technical Report CMU-CS-91-132. Carnegie Mellon University.Google Scholar
- A. Torabi, C. Pal, H. Larochelle, and A. Courville. 2015. Using descriptive video services to create a large data source for video annotation research. Retrieved from: arXiv preprint arXiv:1503.01070, (2015).Google Scholar
- A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. 2003. Context-based vision system for place and object recognition. In Proceedings of the ICCV.Google Scholar
- D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2014. C3D: Generic features for video analysis. Retrieved from: CoRR abs/1412.0767, (2014).Google Scholar
- R. Vedantam, C. L. Zitnick, and D. Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the CVPR.Google Scholar
- S. Venugopalan, L. A. Hendricks, R. Mooney, and K. Saenko. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. Retrieved from: arXiv preprint arXiv:1604.01729, (2016).Google Scholar
- S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence-to-sequence video to text. In Proceedings of the ICCV.Google Scholar
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. Retrieved from: arXiv preprint arXiv:1412.4729, (2014).Google Scholar
- A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. 2017. Feudal networks for hierarchical reinforcement learning. Retrieved from: arXiv preprint arXiv:1703.01161, (2017).Google Scholar
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the CVPR.Google Scholar
- P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the CVPR.Google Scholar
- Harm de Vries, Kurt Shuster, Dhruv Batra, Devi Parikh, Jason Weston, and Douwe Kiela. 2018. Talk the walk: Navigating New York City through grounded dialogue. Retrieved from: CoRR abs/1807.03367 (2018).Google Scholar
- B. Wang, L. Ma, W. Zhang, and W. Liu. 2018. Reconstruction network for video captioning. In Proceedings of the CVPR.Google Scholar
- H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the BMVC. BMVA Press, 124--1.Google Scholar
- J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the CVPR.Google Scholar
- J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the CVPR.Google Scholar
- J. K. Wang and R. Gaizauskas. 2016. Cross-validating image description datasets and evaluation metrics. In Proceedings of the LREC. European Language Resources Association, 3059--3066.Google Scholar
- X. Wang, W. Chen, J. Wu, Y. Wang, and W. Y. Wang. 2017. Video captioning via hierarchical reinforcement learning. Retrieved from: arXiv preprint arXiv:1711.11135, (2017).Google Scholar
- X. Wu, G. Li, Q. Cao, Q. Ji, and L. Lin. 2018. Interpretable video captioning via trajectory structured localization. In Proceedings of the CVPR.Google Scholar
- Q. Wu, P. Wang, C. Shen, A. Dick, and A. Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the CVPR.Google Scholar
- H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko. 2018. Joint event detection and description in continuous video streams. Retrieved from: arXiv preprint arXiv:1802.10250, (2018).Google Scholar
- H. Xu, S. Venugopalan, V. Ramanishka, M. Rohrbach, and K. Saenko. 2015. A multi-scale multiple instance video description network. Retrieved from: arXiv preprint arXiv:1505.05914, (2015).Google Scholar
- J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the CVPR.Google Scholar
- R. Xu, C. Xiong, W. Chen, and J. J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the AAAI, Vol. 5, 6.Google Scholar
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the ICCV.Google Scholar
- T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. 2017. Trimmed action recognition, temporal action proposals and dense-captioning events in videos. In Proceedings of the MSR Asia MSM at ActivityNet Challenge 2017.Google Scholar
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. ACL, 2 (2014), 67--78.Google Scholar
- H. Yu and J. M. Siskind. 2013. Grounded language learning from video sentences. In Proceedings of the ACL 1. 53--63.Google Scholar
- H. Yu and J. M. Siskind. 2015. Learning to describe video with weak supervision by exploiting negative sentential information. In Proceedings of the AAAI. 3855--3863.Google Scholar
- H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the CVPR.Google Scholar
- L. Yu, E. Park, A. C. Berg, and T. L. Berg. 2015. Visual Madlibs: Fill in the blank description generation and question answering. In Proceedings of the ICCV.Google Scholar
- Y. Yu, J. Choi, Y. Kim, K. Yoo, S. Lee, and G. Kim. 2017. Supervising neural attention models for video captioning by human gaze data. In Proceedings of the CVPR.Google Scholar
- Y. Yu, H. Ko, J. Choi, and G. Kim. 2016. End-to-end concept word detection for video captioning, retrieval, and question answering. Retrieved from: arXiv preprint arXiv:1610.02947, (2016).Google Scholar
- K. Zeng, T. Chen, J. C. Niebles, and M. Sun. 2016. Title generation for user-generated videos. In Proceedings of the ECCV.Google Scholar
- X. Zhang, K. Gao, Y. Zhang, D. Zhang, J. Li, and Q. Tian. 2017. Task-driven dynamic fusion: Reducing ambiguity in video description. In Proceedings of the CVPR.Google Scholar
- L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach. 2018. Grounded video description. Retrieved from: arXiv preprint arXiv:1812.06587 (2018).Google Scholar
- L. Zhou, C. Xu, and J. J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI.Google Scholar
- L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the CVPR.Google Scholar
- S. Zhu and D. Mumford. 2007. A stochastic grammar of images. Found. Trends Comput. Graph. Vis. 2, 4 (2007), 259--362.Google ScholarDigital Library
Index Terms
- Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
Recommendations
Multimodal Video Description
MM '16: Proceedings of the 24th ACM international conference on MultimediaReal-world web videos often contain cues to supplement visual information for generating natural language descriptions. In this paper we propose a sequence-to-sequence model which explores such auxiliary information. In particular, audio and the topic ...
Movie Description
Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer ...
Toward Automatic Audio Description Generation for Accessible Videos
CHI '21: Proceedings of the 2021 CHI Conference on Human Factors in Computing SystemsVideo accessibility is essential for people with visual impairments. Audio descriptions describe what is happening on-screen, e.g., physical actions, facial expressions, and scene changes. Generating high-quality audio descriptions requires a lot of ...
Comments