Skip to main content
Top
Published in: Artificial Intelligence Review 5/2022

16-01-2022

A comprehensive review of the video-to-text problem

Authors: Jesus Perez-Martin, Benjamin Bustos, Silvio Jamil F. Guimarães, Ivan Sipiran, Jorge Pérez, Grethel Coello Said

Published in: Artificial Intelligence Review | Issue 5/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
Dense-Captioning Events in Videos task of ActivityNet 2019 challenge website: http://​activity-net.​org/​challenges/​2019/​tasks/​anet_​captioning.​html.
 
3
Joint embeddings are usually done by mapping semantically associated inputs from two or more domains (e.g., images and text) into a common vector space
 
8
 
9
labelnote:actnettaskDense-Captioning Events in Videos task of ActivityNet 2019 challenge website: http://​activity-net.​org/​challenges/​2019/​tasks/​anet_​captioning.​html.
 
10
labelnote:actnet18Captioning tab in ActivityNet 2018 evaluation website: http://​activity-net.​org/​challenges/​2018/​evaluation.​html.
 
11
labelnote:actnet19Captioning tab in ActivityNet 2019 evaluation website: http://​activity-net.​org/​challenges/​2019/​evaluation.​html.
 
14
Global Vectors for Word Representation (GloVe) website: https://​nlp.​stanford.​edu/​projects/​glove/​
 
16
MTurk is an online framework that allows researchers to post annotation tasks, called HITs (Human Intelligence Task)
 
17
labelnote:msrvtt16MSR-VTT-10K dataset website: http://​ms-multimedia-challenge.​com/​2016/​dataset
 
18
labelnote:msrvtt17MSR-VTT 2017 dataset website: http://​ms-multimedia-challenge.​com/​2017/​dataset
 
22
Descriptive Video Service (DVS) is a major specialized in audio description, which aims to describe the visual content in form of narration. These narrations are commonly placed during natural pauses in the original audio of the video, and sometimes during dialogues
 
23
MPII-MD dataset website: www.​mpi-inf.​mpg.​de
 
24
MPII-MD Co-ref+Gender dataset website: www.​mpi-inf.​mpg.​de
 
25
MP-II Cooking Activities dataset website: www.​mpi-inf.​mpg.​de
 
26
MPII Cooking Composite Activities dataset website: www.​mpi-inf.​mpg.​de
 
28
TACoS-Multilevel dataset website: www.​mpi-inf.​mpg.​de
 
31
20BN-something-something dataset website: https://​20bn.​com/​datasets/​something-something
 
32
WikiHow is an online resource that contains 120,000 articles on How to ... for a variety of domains ranging from cooking to human relationships structured in a hierarchy: https://​www.​wikihow.​com/​
 
33
labelnote:hrl16They achieved the results on the Charades Caption dataset, which was obtained by pre-processing the raw Charades dataset.
 
Literature
go back to reference Aafaq N, Akhtar N, Syed WL, Gilani Z, Mian A (2019a) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: IEEE CVPR, pp 12487–12496 Aafaq N, Akhtar N, Syed WL, Gilani Z, Mian A (2019a) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: IEEE CVPR, pp 12487–12496
go back to reference Aafaq N, Mian A, Liu W, Zulqarnain Gilani S, Mian A, Liu W, Gilani SZ, Shah M (2019b) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv 52(6) Aafaq N, Mian A, Liu W, Zulqarnain Gilani S, Mian A, Liu W, Gilani SZ, Shah M (2019b) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv 52(6)
go back to reference Abbas Q, Ibrahim ME, Jaffar MA (2019) A comprehensive review of recent advances on deep vision systems. Artif Intell Rev 52(1):39–76CrossRef Abbas Q, Ibrahim ME, Jaffar MA (2019) A comprehensive review of recent advances on deep vision systems. Artif Intell Rev 52(1):39–76CrossRef
go back to reference Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: semantic propositional image caption evaluation. ECCV. Springer, Springer Nature, pp 382–398 Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: semantic propositional image caption evaluation. ECCV. Springer, Springer Nature, pp 382–398
go back to reference Awad G, Fiscus J, Joy D, Michel M, Smeaton AF, Kraaij W, Eskevich M, Aly R, Ordelman R, Jones GJF, Huet B, Larson M (2016) TRECVID 2016: Evaluating vdeo search, video event detection, localization, and hyperlinking. In: TRECVID, Gaithersburg, Ma, US Awad G, Fiscus J, Joy D, Michel M, Smeaton AF, Kraaij W, Eskevich M, Aly R, Ordelman R, Jones GJF, Huet B, Larson M (2016) TRECVID 2016: Evaluating vdeo search, video event detection, localization, and hyperlinking. In: TRECVID, Gaithersburg, Ma, US
go back to reference Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton A, Graham Y, Kraaij W, Quénot G, Eskevich M, Ordelman R, Jones GJ, Huet B (2017) Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: TRECVID, Gaithersburg, Ma, US Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton A, Graham Y, Kraaij W, Quénot G, Eskevich M, Ordelman R, Jones GJ, Huet B (2017) Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: TRECVID, Gaithersburg, Ma, US
go back to reference Awad G, Butt AA, Curtis K, Yooyoung L, Fiscus J, Godil A, Joy D, Delgado A, Smeaton AF, Graham Y, Kraaij W, Quénot G, Magalhaes J, Semedo D, Blasi S (2018) TRECVID 2018: benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In: TRECVID, NIST, Gaithersburg, Ma, US Awad G, Butt AA, Curtis K, Yooyoung L, Fiscus J, Godil A, Joy D, Delgado A, Smeaton AF, Graham Y, Kraaij W, Quénot G, Magalhaes J, Semedo D, Blasi S (2018) TRECVID 2018: benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In: TRECVID, NIST, Gaithersburg, Ma, US
go back to reference Awad G, Butt AA, Curtis K, Lee Y, Fiscus J, Godil A, Delgado A, Zhang J, Godard E, Diduch L, Smeaton AF, Graham Y, Kraaij W, Quénot G (2019) TRECVID 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search and retrieval. In: TRECVID, Gaithersburg, Ma, US Awad G, Butt AA, Curtis K, Lee Y, Fiscus J, Godil A, Delgado A, Zhang J, Godard E, Diduch L, Smeaton AF, Graham Y, Kraaij W, Quénot G (2019) TRECVID 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search and retrieval. In: TRECVID, Gaithersburg, Ma, US
go back to reference Awad G, Butt AA, Curtis K, Lee Y, Fiscus J, Godil A, Delgado A, Zhang J, Godard E, Diduch L, Liu J, Smeaton AF, Graham Y, Jones GJF, Kraaij W, Quénot G (2020) TRECVID 2020: comprehensive campaign for evaluating video retrieval tasks across multiple application domains. In: TRECVID, NIST, US Awad G, Butt AA, Curtis K, Lee Y, Fiscus J, Godil A, Delgado A, Zhang J, Godard E, Diduch L, Liu J, Smeaton AF, Graham Y, Jones GJF, Kraaij W, Quénot G (2020) TRECVID 2020: comprehensive campaign for evaluating video retrieval tasks across multiple application domains. In: TRECVID, NIST, US
go back to reference Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio Y, LeCun Y (eds) ICLR Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio Y, LeCun Y (eds) ICLR
go back to reference Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or Summarization, pp 65–72 Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or Summarization, pp 65–72
go back to reference Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: IEEE CVPR, IEEE, pp 3185–3194 Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: IEEE CVPR, IEEE, pp 3185–3194
go back to reference Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. arXiv:1204.2742 Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. arXiv:​1204.​2742
go back to reference Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybernet 49(7):2631–2641CrossRef Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybernet 49(7):2631–2641CrossRef
go back to reference Bojanowski P, Lajugie R, Grave E, Bach F, Laptev I, Ponce J, Schmid C (2015) Weakly-supervised alignment of video with text. In: IEEE ICCV, IEEE, pp 4462–4470 Bojanowski P, Lajugie R, Grave E, Bach F, Laptev I, Ponce J, Schmid C (2015) Weakly-supervised alignment of video with text. In: IEEE ICCV, IEEE, pp 4462–4470
go back to reference Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE CVPR, IEEE, vol 2017-January, pp 6373–6382 Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE CVPR, IEEE, vol 2017-January, pp 6373–6382
go back to reference Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE CVPR, IEEE, pp 4724–4733 Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE CVPR, IEEE, pp 4724–4733
go back to reference Caruana R (1998) Multitask learning. In: Thrun S, Pratt L (eds) Learning to learn. Springer, Boston, pp 95–133CrossRef Caruana R (1998) Multitask learning. In: Thrun S, Pratt L (eds) Learning to learn. Springer, Boston, pp 95–133CrossRef
go back to reference Celikyilmaz A, Clark E, Gao J (2020) Evaluation of text generation: a survey Celikyilmaz A, Clark E, Gao J (2020) Evaluation of text generation: a survey
go back to reference Chen H, Li J, Hu X (2020a) Delving deeper into the decoder for video captioning. CoRR Chen H, Li J, Hu X (2020a) Delving deeper into the decoder for video captioning. CoRR
go back to reference Chen H, Lin K, Maye A, Li J, Hu X (2020b) A semantics-assisted video captioning model trained with scheduled sampling. Front Robot AI, 7 Chen H, Lin K, Maye A, Li J, Hu X (2020b) A semantics-assisted video captioning model trained with scheduled sampling. Front Robot AI, 7
go back to reference Chen J, Liang J, Liu J, Chen S, Gao C, Jin Q, Hauptmann A (2017a) Informedia @ TRECVID 2017. In: TRECVID Chen J, Liang J, Liu J, Chen S, Gao C, Jin Q, Hauptmann A (2017a) Informedia @ TRECVID 2017. In: TRECVID
go back to reference Chen J, Chen S, Jin Q, Hauptmann A (2018a) Informedia@TRECVID 2018. In: TRECVID Chen J, Chen S, Jin Q, Hauptmann A (2018a) Informedia@TRECVID 2018. In: TRECVID
go back to reference Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019a) Temporal deformable convolutional encoder-decoder networks for video captioning. AAAI 33:8167–8174CrossRef Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019a) Temporal deformable convolutional encoder-decoder networks for video captioning. AAAI 33:8167–8174CrossRef
go back to reference Chen S, Chen J, Jin Q, Hauptmann A (2017b) Video captioning with guidance of multimodal latent topics. ACM MM. ACM Press, New York, pp 1838–1846 Chen S, Chen J, Jin Q, Hauptmann A (2017b) Video captioning with guidance of multimodal latent topics. ACM MM. ACM Press, New York, pp 1838–1846
go back to reference Chen S, Song Y, Zhao Y, Qiu J, Jin Q, Hauptmann A (2018b) RUC+CMU: system report for dense captioning events in videos. CoRR abs/1806.0 Chen S, Song Y, Zhao Y, Qiu J, Jin Q, Hauptmann A (2018b) RUC+CMU: system report for dense captioning events in videos. CoRR abs/1806.0
go back to reference Chen S, Jin Q, Chen J, Hauptmann A (2019) Generating video descriptions with latent topic guidance. IEEE Trans Multimedia 21:2407–2418CrossRef Chen S, Jin Q, Chen J, Hauptmann A (2019) Generating video descriptions with latent topic guidance. IEEE Trans Multimedia 21:2407–2418CrossRef
go back to reference Chen S, Song Y, Zhao Y, Jin Q, Zeng Z, Liu B, Fu J, Hauptmann A (2019c) Activitynet 2019 task 3: exploring contexts for dense captioning events in videos. arXiv:1907.05092 Chen S, Song Y, Zhao Y, Jin Q, Zeng Z, Liu B, Fu J, Hauptmann A (2019c) Activitynet 2019 task 3: exploring contexts for dense captioning events in videos. arXiv:​1907.​05092
go back to reference Chen S, Zhao Y, Jin Q, Wu Q (2020c) Fine-grained video-text retrieval with hierarchical graph reasoning. In: IEEE/CVF CVPR Chen S, Zhao Y, Jin Q, Wu Q (2020c) Fine-grained video-text retrieval with hierarchical graph reasoning. In: IEEE/CVF CVPR
go back to reference Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: IEEE CVPR, IEEE, vol 07–12, pp 2422–2431, June 2015 Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: IEEE CVPR, IEEE, vol 07–12, pp 2422–2431, June 2015
go back to reference Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.0 Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.0
go back to reference Chen X, Rohrbach M, Parikh D (2019d) Cycle-consistency for robust visual question answering. In: IEEE/CVF CVPR, pp 6649–6658 Chen X, Rohrbach M, Parikh D (2019d) Cycle-consistency for robust visual question answering. In: IEEE/CVF CVPR, pp 6649–6658
go back to reference Chen Y, Wang S, Zhang W, Huang Q (2018c) Less is more: picking informative frames for video captioning. In: ECCV, Springer International Publishing, pp 367–384 Chen Y, Wang S, Zhang W, Huang Q (2018c) Less is more: picking informative frames for video captioning. In: ECCV, Springer International Publishing, pp 367–384
go back to reference Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP. ACL, Stroudsburg, pp 1724–1734 Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP. ACL, Stroudsburg, pp 1724–1734
go back to reference Craswell N (2009) Mean reciprocal rank. In: LIU L (ed) Encyclopedia of database systems, Springer US, Boston, pp 1703–1703 Craswell N (2009) Mean reciprocal rank. In: LIU L (ed) Encyclopedia of database systems, Springer US, Boston, pp 1703–1703
go back to reference Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: NIPS, Barcelona, Spain, NIPS’16 Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: NIPS, Barcelona, Spain, NIPS’16
go back to reference Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Soc CVPR I:886–893 Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Soc CVPR I:886–893
go back to reference Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. IEEE Comput Soc CVPR. IEEE, Portland, OR, USA, pp 2634–2641 Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. IEEE Comput Soc CVPR. IEEE, Portland, OR, USA, pp 2634–2641
go back to reference Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366CrossRef Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366CrossRef
go back to reference Deshpande A, Aneja J, Wang L, Schwing AG, Forsyth D (2019) Fast, diverse and accurate image captioning guided by part-of-speech. In: IEEE/CVF CVPR, IEEE, pp 10687–10696 Deshpande A, Aneja J, Wang L, Schwing AG, Forsyth D (2019) Fast, diverse and accurate image captioning guided by part-of-speech. In: IEEE/CVF CVPR, IEEE, pp 10687–10696
go back to reference Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, IEEE, pp 65–72 Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, IEEE, pp 65–72
go back to reference Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) DeCAF: a deep convolutional activation feature for generic visual recognition. In: ICML, JMLR.org, Beijing, China Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) DeCAF: a deep convolutional activation feature for generic visual recognition. In: ICML, JMLR.org, Beijing, China
go back to reference Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691CrossRef Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691CrossRef
go back to reference Dong J, Li X, Snoek CGM (2016) Word2VisualVec: image and video to sentence matching by visual feature prediction. CoRR abs/1604.0 Dong J, Li X, Snoek CGM (2016) Word2VisualVec: image and video to sentence matching by visual feature prediction. CoRR abs/1604.0
go back to reference Dong J, Li X, Snoek CGM (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimedia 20(12):3377–3388CrossRef Dong J, Li X, Snoek CGM (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimedia 20(12):3377–3388CrossRef
go back to reference Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: IEEE/CVF CVPR, IEEE, pp 9338–9347 Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: IEEE/CVF CVPR, IEEE, pp 9338–9347
go back to reference Dwibedi D, Aytar Y, Tompson J, Sermanet P, Zisserman A (2019) Temporal cycle-consistency learning. In: IEEE/CVF CVPR, IEEE Comput Soc , vol 2019-June, pp 1801–1810 Dwibedi D, Aytar Y, Tompson J, Sermanet P, Zisserman A (2019) Temporal cycle-consistency learning. In: IEEE/CVF CVPR, IEEE Comput Soc , vol 2019-June, pp 1801–1810
go back to reference Eisenstein J (2019) Introduction to natural language processing. MIT Press, Cambridge Eisenstein J (2019) Introduction to natural language processing. MIT Press, Cambridge
go back to reference Elhamifar E, Sapiro G, Sastry SS (2016) Dissimilarity-based sparse subset selection. IEEE Trans Pattern Anal Mach Intell 38(11):2182–2197CrossRef Elhamifar E, Sapiro G, Sastry SS (2016) Dissimilarity-based sparse subset selection. IEEE Trans Pattern Anal Mach Intell 38(11):2182–2197CrossRef
go back to reference Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC
go back to reference Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE CVPR, IEEE, vol 07–12-June, pp 1473–1482 Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE CVPR, IEEE, vol 07–12-June, pp 1473–1482
go back to reference Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: IEEE CVPR, IEEE, pp 7445–7454 Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: IEEE CVPR, IEEE, pp 7445–7454
go back to reference Gan C, Gan Z, He X, Gao J, Deng L (2017a) StyleNet: generating attractive visual captions with styles. In: IEEE CVPR, IEEE, pp 955–964 Gan C, Gan Z, He X, Gao J, Deng L (2017a) StyleNet: generating attractive visual captions with styles. In: IEEE CVPR, IEEE, pp 955–964
go back to reference Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017b) Semantic compositional networks for visual captioning. In: IEEE CVPR, IEEE, vol 2017, pp 1141–1150 Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017b) Semantic compositional networks for visual captioning. In: IEEE CVPR, IEEE, vol 2017, pp 1141–1150
go back to reference Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia 19(9) Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia 19(9)
go back to reference Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell, pp 1–19 Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell, pp 1–19
go back to reference Gatt A, Krahmer E (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J Artif Intell Res 61:65–170MathSciNetMATHCrossRef Gatt A, Krahmer E (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J Artif Intell Res 61:65–170MathSciNetMATHCrossRef
go back to reference Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) COOT: cooperative hierarchical transformer for video-text representation learning. In: NIPS Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) COOT: cooperative hierarchical transformer for video-text representation learning. In: NIPS
go back to reference Girshick R (2015) Fast R-CNN. In: IEEE ICCV, IEEE, pp 1440–1448 Girshick R (2015) Fast R-CNN. In: IEEE ICCV, IEEE, pp 1440–1448
go back to reference Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE CVPR, IEEE, pp 580–587 Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE CVPR, IEEE, pp 580–587
go back to reference Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, CambridgwMATH Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, CambridgwMATH
go back to reference Graham Y, Awad G, Smeaton A (2018) Evaluation of automatic video captioning using direct assessment. PLOS ONE 13(9):e0202789CrossRef Graham Y, Awad G, Smeaton A (2018) Evaluation of automatic video captioning using direct assessment. PLOS ONE 13(9):e0202789CrossRef
go back to reference Graves A, Mohamed Ar, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE ICASSP, IEEE, pp 6645–6649 Graves A, Mohamed Ar, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE ICASSP, IEEE, pp 6645–6649
go back to reference Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. IEEE ICCV 1:2712–2719 Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. IEEE ICCV 1:2712–2719
go back to reference Guo Y, Yao B, Liu Y (2020) Sequence to sequence model for video captioning. Pattern Recognition Letters, pp 327–334 Guo Y, Yao B, Liu Y (2020) Sequence to sequence model for video captioning. Pattern Recognition Letters, pp 327–334
go back to reference Han L, Kashyap AL, Finin T, Mayfield J, Weese J (2013) UMBC\_EBIQUITY-CORE: semantic textual similarity systems. In: Second joint conference on lexical and computational semantics Han L, Kashyap AL, Finin T, Mayfield J, Weese J (2013) UMBC\_EBIQUITY-CORE: semantic textual similarity systems. In: Second joint conference on lexical and computational semantics
go back to reference Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet? In: IEEE CVPR Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet? In: IEEE CVPR
go back to reference He K, Zhang X, Ren S, Sun J (2015) Delving Deep into rectifiers: surpassing human-level performance on imagenet classification. In: IEEE ICCV, IEEE, pp 1026–1034 He K, Zhang X, Ren S, Sun J (2015) Delving Deep into rectifiers: surpassing human-level performance on imagenet classification. In: IEEE ICCV, IEEE, pp 1026–1034
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE CVPR, IEEE, vol 2016-Decem, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE CVPR, IEEE, vol 2016-Decem, pp 770–778
go back to reference He X, Shi B, Bai X, Xia GS, Zhang Z, Dong W (2019) Image caption generation with part of speech guidance. Pattern Recogn Lett 119:229–237CrossRef He X, Shi B, Bai X, Xia GS, Zhang Z, Dong W (2019) Image caption generation with part of speech guidance. Pattern Recogn Lett 119:229–237CrossRef
go back to reference Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE CVPR, IEEE, pp 961–970 Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE CVPR, IEEE, pp 961–970
go back to reference Hemalatha M, Chandra Sekhar C (2020) Domain-specific semantics guided approach to video captioning. In: IEEE WACV, pp 1587–1596 Hemalatha M, Chandra Sekhar C (2020) Domain-specific semantics guided approach to video captioning. In: IEEE WACV, pp 1587–1596
go back to reference Hendricks LA, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. In: 2017 IEEE international conference on computer vision (ICCV), IEEE, pp 5804–5813 Hendricks LA, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. In: 2017 IEEE international conference on computer vision (ICCV), IEEE, pp 5804–5813
go back to reference Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
go back to reference Hodosh M, Young P, Hockenmaier J (2015) Framing image description as a ranking task data, models and evaluation metrics extended abstract. In: IJCAI, pp 4188–4192 Hodosh M, Young P, Hockenmaier J (2015) Framing image description as a ranking task data, models and evaluation metrics extended abstract. In: IJCAI, pp 4188–4192
go back to reference Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. In: IEEE ICCV Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. In: IEEE ICCV
go back to reference Hu Y, Chen Z, Zha ZJ, Wu F (2019) Hierarchical global-local temporal modeling for video captioning. ACM MM. ACM, New York, NY, USA, pp 774–783 Hu Y, Chen Z, Zha ZJ, Wu F (2019) Hierarchical global-local temporal modeling for video captioning. ACM MM. ACM, New York, NY, USA, pp 774–783
go back to reference Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE CVPR, pp 2462–2470 Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE CVPR, pp 2462–2470
go back to reference Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef
go back to reference Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE CVPR, IEEE, pp 3128–3137 Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE CVPR, IEEE, pp 3128–3137
go back to reference Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. IEEE CVPR. IEEE, Columbus, OH, US, pp 1725–1732 Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. IEEE CVPR. IEEE, Columbus, OH, US, pp 1725–1732
go back to reference Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: ICLR, Neptune, Toulon, France Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: ICLR, Neptune, Toulon, France
go back to reference Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:​1411.​2539
go back to reference Kojima A, Tamura T, Fukunaga K (2002) Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision 50(2):171–184MATHCrossRef Kojima A, Tamura T, Fukunaga K (2002) Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision 50(2):171–184MATHCrossRef
go back to reference Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017a) Dense-captioning events in videos. In: IEEE ICCV, IEEE, vol 2017-Octob, pp 706–715 Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017a) Dense-captioning events in videos. In: IEEE ICCV, IEEE, vol 2017-Octob, pp 706–715
go back to reference Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2017b) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73MathSciNetCrossRef Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2017b) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73MathSciNetCrossRef
go back to reference Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. NAACL HLT workshop on vision and language pp 10–19 Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. NAACL HLT workshop on vision and language pp 10–19
go back to reference Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS, Curran Associates Inc., Lake Tahoe, Nevada, vol 1, pp 1097–1105 Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS, Curran Associates Inc., Lake Tahoe, Nevada, vol 1, pp 1097–1105
go back to reference Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) TREETALK: composition and compression of trees for image descriptions. Trans ACL 2(1):351–362 Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) TREETALK: composition and compression of trees for image descriptions. Trans ACL 2(1):351–362
go back to reference Chen L, D, B Dolan W, (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the ACL: human language technologies, ACL 1:190–200 Chen L, D, B Dolan W, (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the ACL: human language technologies, ACL 1:190–200
go back to reference Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123CrossRef Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123CrossRef
go back to reference Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE CVPR, IEEE, pp 1–8 Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE CVPR, IEEE, pp 1–8
go back to reference Le DD, Phan S, Nguyen VT, Renoust B, Nguyen TA, Hoang VN, Duc Ngo T, Tran MT, Watanabe Y, Klinkigt M, Hiroike A, Duong DA, Miyao Y, Ichi Satoh S (2016) NII-HITACHI-UIT at TRECVID 2016. In: TRECVID, p 25 Le DD, Phan S, Nguyen VT, Renoust B, Nguyen TA, Hoang VN, Duc Ngo T, Tran MT, Watanabe Y, Klinkigt M, Hiroike A, Duong DA, Miyao Y, Ichi Satoh S (2016) NII-HITACHI-UIT at TRECVID 2016. In: TRECVID, p 25
go back to reference Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: ICML, JMLR.org, Beijing, China, vol 32, pp 1188–1196 Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: ICML, JMLR.org, Beijing, China, vol 32, pp 1188–1196
go back to reference Lee J, Lee Y, Seong S, Kim K, Kim S, Kim J (2019) Capturing long-range dependencies in video captioning. In: IEEE ICIP, IEEE, pp 1880–1884 Lee J, Lee Y, Seong S, Kim K, Kim S, Kim J (2019) Capturing long-range dependencies in video captioning. In: IEEE ICIP, IEEE, pp 1880–1884
go back to reference Lei J, Wang L, Shen Y, Yu D, Berg TL, Bansal M (2020) MART: memory-augmented recurrent transformer for coherent video paragraph captioning. Ann Meet ACL, pp 2603–2614 Lei J, Wang L, Shen Y, Yu D, Berg TL, Bansal M (2020) MART: memory-augmented recurrent transformer for coherent video paragraph captioning. Ann Meet ACL, pp 2603–2614
go back to reference Li G, Pan P, Yang Y (2018a) UTS\_CETC\_D2DCRC Submission at the TRECVID 2018 video to text description task. In: TRECVID Li G, Pan P, Yang Y (2018a) UTS\_CETC\_D2DCRC Submission at the TRECVID 2018 video to text description task. In: TRECVID
go back to reference Li H, Song D, Liao L, Peng C (2019) REVnet: bring reviewing into video captioning for a better description. In: IEEE ICME, IEEE, pp 1312–1317 Li H, Song D, Liao L, Peng C (2019) REVnet: bring reviewing into video captioning for a better description. In: IEEE ICME, IEEE, pp 1312–1317
go back to reference Li L, Gong B (2019) End-to-end video captioning with multitask reinforcement learning. In: IEEE WACV, IEEE, pp 339–348 Li L, Gong B (2019) End-to-end video captioning with multitask reinforcement learning. In: IEEE WACV, IEEE, pp 339–348
go back to reference Li X, Liao S, Lan W, Du X, Yang G (2015) Zero-shot image tagging by Hierarchical semantic embedding. In: SIGIR, ACM, pp 879–882 Li X, Liao S, Lan W, Du X, Yang G (2015) Zero-shot image tagging by Hierarchical semantic embedding. In: SIGIR, ACM, pp 879–882
go back to reference Li X, Dong J, Xu C, Cao J, Wang X, Yang G (2018b) Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: deep cross-modal embeddings for video-text retrieval. In: TRECVID Li X, Dong J, Xu C, Cao J, Wang X, Yang G (2018b) Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: deep cross-modal embeddings for video-text retrieval. In: TRECVID
go back to reference Li Y, Song Y, Cao L, Tetreault J, Goldberg L, Jaimes A, Luo J (2016) TGIF: a new dataset and benchmark on animated GIF description. In: IEEE CVPR, IEEE, vol 2016-Decem, pp 4641–4650 Li Y, Song Y, Cao L, Tetreault J, Goldberg L, Jaimes A, Luo J (2016) TGIF: a new dataset and benchmark on animated GIF description. In: IEEE CVPR, IEEE, vol 2016-Decem, pp 4641–4650
go back to reference Lin Cy (2004) Rouge: a package for automatic evaluation of summaries. In: ACL post-conference workshop, Barcelona, Spain, pp 25–26 Lin Cy (2004) Rouge: a package for automatic evaluation of summaries. In: ACL post-conference workshop, Barcelona, Spain, pp 25–26
go back to reference Lin K, Gan Z, Wang L (2020) Multi-modal feature fusion with feature attention for VATEX captioning challenge 2020 Lin K, Gan Z, Wang L (2020) Multi-modal feature fusion with feature attention for VATEX captioning challenge 2020
go back to reference Liu S, Ren Z, Yuan J (2018) SibNet: Sibling convolutional encoder for video captioning. ACM MM. ACM, New York, NY, USA, pp 1425–1434 Liu S, Ren Z, Yuan J (2018) SibNet: Sibling convolutional encoder for video captioning. ACM MM. ACM, New York, NY, USA, pp 1425–1434
go back to reference Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: ECCV, pp 21–37 Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: ECCV, pp 21–37
go back to reference Long X, Gan C, De Melo G (2018) Video captioning with multi-faceted attention. In: Transactions of the ACL, pp 173–184 Long X, Gan C, De Melo G (2018) Video captioning with multi-faceted attention. In: Transactions of the ACL, pp 173–184
go back to reference Lu J, Goswami V, Rohrbach M, Parikh D, Lee S (2020) 12-in-1: Multi-task vision and language representation learning. In: IEEE/CVF CVPR Lu J, Goswami V, Rohrbach M, Parikh D, Lee S (2020) 12-in-1: Multi-task vision and language representation learning. In: IEEE/CVF CVPR
go back to reference Mahdisoltani F, Berger G, Gharbieh W, Fleet D, Memisevic R (2018a) Fine-grained video classification and captioning. CoRR abs/1804.0 Mahdisoltani F, Berger G, Gharbieh W, Fleet D, Memisevic R (2018a) Fine-grained video classification and captioning. CoRR abs/1804.0
go back to reference Mahdisoltani F, Berger G, Gharbieh W, Fleet D, Memisevic R (2018b) On the effectiveness of task granularity for transfer learning. CoRR abs/1804.0 Mahdisoltani F, Berger G, Gharbieh W, Fleet D, Memisevic R (2018b) On the effectiveness of task granularity for transfer learning. CoRR abs/1804.0
go back to reference Manmadhan S, Kovoor BC (2020) Visual question answering: a state-of-the-art review. Artif Intell Rev 53(8):5705–5745CrossRef Manmadhan S, Kovoor BC (2020) Visual question answering: a state-of-the-art review. Artif Intell Rev 53(8):5705–5745CrossRef
go back to reference Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-RNN). CoRR abs/1412.6 Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-RNN). CoRR abs/1412.6
go back to reference Markatopoulou F, Moumtzidou A, Galanopoulos D, Mironidis T, Kaltsa V, Ioannidou A, Symeonidis S, Avgerinakis K, Andreadis S, Gialampoukidis I, Vrochidis S, Briassouli A, Mezaris V, Kompatsiaris I, Patras I (2016) ITI-CERTH participation in TRECVID 2016. In: TRECVID Markatopoulou F, Moumtzidou A, Galanopoulos D, Mironidis T, Kaltsa V, Ioannidou A, Symeonidis S, Avgerinakis K, Andreadis S, Gialampoukidis I, Vrochidis S, Briassouli A, Mezaris V, Kompatsiaris I, Patras I (2016) ITI-CERTH participation in TRECVID 2016. In: TRECVID
go back to reference Marsden M, Mohedano E, Mcguinness K, Calafell A, Giró-I-Nieto X, O’connor NE, Zhou J, Azevedo L, Daudert T, Davis B, Hürlimann M, Afli H, Du J, Ganguly D, Li W, Way A, Smeaton AF (2016) Dublin City University and partners’ participation in the INS and VTT tracks at TRECVid 2016. In: TRECVID Marsden M, Mohedano E, Mcguinness K, Calafell A, Giró-I-Nieto X, O’connor NE, Zhou J, Azevedo L, Daudert T, Davis B, Hürlimann M, Afli H, Du J, Ganguly D, Li W, Way A, Smeaton AF (2016) Dublin City University and partners’ participation in the INS and VTT tracks at TRECVid 2016. In: TRECVID
go back to reference Meister S, Hur J, Roth S (2018) UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI Meister S, Hur J, Roth S (2018) UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI
go back to reference Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. CoRR abs/1804.0 Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. CoRR abs/1804.0
go back to reference Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I, Sivic J (2019) HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In: IEEE/CVF ICCV, IEEE, pp 2630–2640 Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I, Sivic J (2019) HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In: IEEE/CVF ICCV, IEEE, pp 2630–2640
go back to reference Miech A, Alayrac JB, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: IEEE/CVF CVPR Miech A, Alayrac JB, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: IEEE/CVF CVPR
go back to reference Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS, Curran Associates Inc., vol 2, pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS, Curran Associates Inc., vol 2, pp 3111–3119
go back to reference Mithun NC, Li JB, Metze F, Roy-Chowdhury AK, Das S, Bosch R (2017) CMU-UCR-BOSCH @ TRECVID 2017: VIDEO TO TEXT RETRIEVAL. In: TRECVID Mithun NC, Li JB, Metze F, Roy-Chowdhury AK, Das S, Bosch R (2017) CMU-UCR-BOSCH @ TRECVID 2017: VIDEO TO TEXT RETRIEVAL. In: TRECVID
go back to reference Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. ACM ICMR. ACM, New York, NY, USA, pp 19–27 Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. ACM ICMR. ACM, New York, NY, USA, pp 19–27
go back to reference Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2019) Joint embeddings with multimodal cues for video-text retrieval. Int J Multimedia Inf Retriev 8(1):3–18CrossRef Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2019) Joint embeddings with multimodal cues for video-text retrieval. Int J Multimedia Inf Retriev 8(1):3–18CrossRef
go back to reference Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: IEEE CVPR Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: IEEE CVPR
go back to reference Nguyen PA, Li Q, Cheng ZQ, Lu YJ, Zhang H, Wu X, Ngo CW (2017a) VIREO @ TRECVID 2017: Video-to-text, ad-hoc video search and video hyperlinking. In: TRECVID Nguyen PA, Li Q, Cheng ZQ, Lu YJ, Zhang H, Wu X, Ngo CW (2017a) VIREO @ TRECVID 2017: Video-to-text, ad-hoc video search and video hyperlinking. In: TRECVID
go back to reference Nguyen T, Sah S, Ptucha R (2017b) Multistream hierarchical boundary network for video captioning. In: IEEE WNYISPW, IEEE, pp 1–5 Nguyen T, Sah S, Ptucha R (2017b) Multistream hierarchical boundary network for video captioning. In: IEEE WNYISPW, IEEE, pp 1–5
go back to reference Nina O, Garcia W, Clouse S, Yilmaz A (2018) MTLE: A multitask learning encoder of visual feature representations for video and movie description. CoRR abs/1809.0 Nina O, Garcia W, Clouse S, Yilmaz A (2018) MTLE: A multitask learning encoder of visual feature representations for video and movie description. CoRR abs/1809.0
go back to reference Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: ECCV, Springer International Publishing, pp 651–667 Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: ECCV, Springer International Publishing, pp 651–667
go back to reference Pan B, Cai H, Huang DA, Lee KH, Gaidon A, Adeli E, Niebles JC (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: IEEE/CVF CVPR, pp 10870–10879 Pan B, Cai H, Huang DA, Lee KH, Gaidon A, Adeli E, Niebles JC (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: IEEE/CVF CVPR, pp 10870–10879
go back to reference Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016a) Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE CVPR, pp 1029–1038 Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016a) Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE CVPR, pp 1029–1038
go back to reference Pan Y, Mei T, Yao T, Li H, Rui Y (2016b) Jointly modeling embedding and translation to bridge video and language. In: IEEE CVPR, IEEE, pp 4594–4602 Pan Y, Mei T, Yao T, Li H, Rui Y (2016b) Jointly modeling embedding and translation to bridge video and language. In: IEEE CVPR, IEEE, pp 4594–4602
go back to reference Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: IEEE CVPR, IEEE, vol 2017-Janua, pp 984–992 Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: IEEE CVPR, IEEE, vol 2017-Janua, pp 984–992
go back to reference Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, ACL, Morristown, NJ, USA, no. July in ACL ’02, p 311 Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, ACL, Morristown, NJ, USA, no. July in ACL ’02, p 311
go back to reference Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVC, British Machine Vision Association, pp 1–41 Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVC, British Machine Vision Association, pp 1–41
go back to reference Pasunuru R, Bansal M (2017) Reinforced video captioning with entailment rewards. EMNLP. ACL, Stroudsburg, PA, USA, pp 979–985 Pasunuru R, Bansal M (2017) Reinforced video captioning with entailment rewards. EMNLP. ACL, Stroudsburg, PA, USA, pp 979–985
go back to reference Perez-Martin J, Bustos B, Pérez J (2020a) Attentive visual semantic specialized network for video captioning. In: ICPR Perez-Martin J, Bustos B, Pérez J (2020a) Attentive visual semantic specialized network for video captioning. In: ICPR
go back to reference Perez-Martin J, Bustos B, Pérez J, Barrios JM (2020b) IMFD-IMPRESEE at TRECVID 2020: description generation by visual-syntactic embedding. In: TRECVID Perez-Martin J, Bustos B, Pérez J, Barrios JM (2020b) IMFD-IMPRESEE at TRECVID 2020: description generation by visual-syntactic embedding. In: TRECVID
go back to reference Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. In: IEEE/CVF WACV Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. In: IEEE/CVF WACV
go back to reference Phan S, Henter GE, Miyao Y, Satoh S (2017a) Consensus-based sequence training for video captioning. CoRR abs/1712.0 Phan S, Henter GE, Miyao Y, Satoh S (2017a) Consensus-based sequence training for video captioning. CoRR abs/1712.0
go back to reference Phan S, Klinkigt M, Nguyen VT, Mai TD, Xalabarder AG, Hinami R, Renoust B, Duc Ngo T, Tran MT, Watanabe Y, Hiroike A, Duong DA, Le DD, Miyao Y, Ichi Satoh S (2017b) NII-Hitachi-UIT at TRECVID 2017. In: TRECVID, p 18 Phan S, Klinkigt M, Nguyen VT, Mai TD, Xalabarder AG, Hinami R, Renoust B, Duc Ngo T, Tran MT, Watanabe Y, Hiroike A, Duong DA, Le DD, Miyao Y, Ichi Satoh S (2017b) NII-Hitachi-UIT at TRECVID 2017. In: TRECVID, p 18
go back to reference Plummer BA, Brown M, Lazebnik S (2017) Enhancing video summarization via vision-language embedding. In: IEEE CVPR, IEEE, pp 1052–1060 Plummer BA, Brown M, Lazebnik S (2017) Enhancing video summarization via vision-language embedding. In: IEEE CVPR, IEEE, pp 1052–1060
go back to reference Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. In: ICLR Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. In: ICLR
go back to reference Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazons mechanical turk. In: NAACL HLT 2010 workshop on creating speech and language data with amazons mechanical turk. ACL, Los Angeles, California, pp 139–147 Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazons mechanical turk. In: NAACL HLT 2010 workshop on creating speech and language data with amazons mechanical turk. ACL, Los Angeles, California, pp 139–147
go back to reference Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: IEEE CVPR, IEEE, pp 779–788 Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: IEEE CVPR, IEEE, pp 779–788
go back to reference Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans ACL 1:25–36 Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans ACL 1:25–36
go back to reference Reiter E (2018) A structured review of the validity of BLEU. Comput Linguist 44(3):393–401CrossRef Reiter E (2018) A structured review of the validity of BLEU. Comput Linguist 44(3):393–401CrossRef
go back to reference Reiter E, Dale R (2000) Building natural language generation systems. Cambridge University Press, CambridgeCrossRef Reiter E, Dale R (2000) Building natural language generation systems. Cambridge University Press, CambridgeCrossRef
go back to reference Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRef Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRef
go back to reference Rijsbergen CJV (1979) Information Retrieval. Butterworth-Heinemann313 Washington Street Newton, MAUnited States Rijsbergen CJV (1979) Information Retrieval. Butterworth-Heinemann313 Washington Street Newton, MAUnited States
go back to reference Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (2014) Coherent multi-sentence video description with variable level of detail. In: Pattern Recognition, Springer International Publishing, pp 184–195 Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (2014) Coherent multi-sentence video description with variable level of detail. In: Pattern Recognition, Springer International Publishing, pp 184–195
go back to reference Rohrbach A, Rohrbach M, Schiele B (2015a) The Long-short story of movie description. In: Pattern Recognition, Springer International Publishing, pp 209–221 Rohrbach A, Rohrbach M, Schiele B (2015a) The Long-short story of movie description. In: Pattern Recognition, Springer International Publishing, pp 209–221
go back to reference Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015b) A dataset for Movie Description. In: IEEE CVPR, IEEE, vol 07-12-June, pp 3202–3212 Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015b) A dataset for Movie Description. In: IEEE CVPR, IEEE, vol 07-12-June, pp 3202–3212
go back to reference Rohrbach A, Rohrbach M, Tang S, Oh SJ, Schiele B (2017) Generating descriptions with grounded and co-referenced people. In: IEEE CVPR Rohrbach A, Rohrbach M, Tang S, Oh SJ, Schiele B (2017) Generating descriptions with grounded and co-referenced people. In: IEEE CVPR
go back to reference Rohrbach M, Amin S, Andriluka M, Schiele B (2012a) A database for fine grained activity detection of cooking activities. In: IEEE CVPR, IEEE, pp 1194–1201 Rohrbach M, Amin S, Andriluka M, Schiele B (2012a) A database for fine grained activity detection of cooking activities. In: IEEE CVPR, IEEE, pp 1194–1201
go back to reference Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012b) Script data for attribute-based recognition of composite activities. ECCV. Springer, Berlin, pp 144–157 Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012b) Script data for attribute-based recognition of composite activities. ECCV. Springer, Berlin, pp 144–157
go back to reference Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE ICCV, IEEE, December, pp 433–440 Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE ICCV, IEEE, December, pp 433–440
go back to reference Rohrbach M, Rohrbach A, Regneri M, Amin S, Andriluka M, Pinkal M, Schiele B (2016) Recognizing fine-grained and composite activities using hand-centric features and script data. Int J Comput Vis 119(3):346–373MathSciNetCrossRef Rohrbach M, Rohrbach A, Regneri M, Amin S, Andriluka M, Pinkal M, Schiele B (2016) Recognizing fine-grained and composite activities using hand-centric features and script data. Int J Comput Vis 119(3):346–373MathSciNetCrossRef
go back to reference Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536MATHCrossRef Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536MATHCrossRef
go back to reference Sah S, Nguyen T, Ptucha R (2019) Understanding temporal structure for video captioning. Pattern Anal Appl Sah S, Nguyen T, Ptucha R (2019) Understanding temporal structure for video captioning. Pattern Anal Appl
go back to reference Sah S, Nguyen T, Ptucha R (2020) Understanding temporal structure for video captioning. Pattern Anal Appl 23(1):147–159CrossRef Sah S, Nguyen T, Ptucha R (2020) Understanding temporal structure for video captioning. Pattern Anal Appl 23(1):147–159CrossRef
go back to reference Saha TK, Joty S, Al Hasan M (2017) Con-S2V: a generic framework for incorporating extra-sentential context into Sen2Vec. In: Mach Learn Knowl Dis Databases, Springer International Publishing, pp 753–769 Saha TK, Joty S, Al Hasan M (2017) Con-S2V: a generic framework for incorporating extra-sentential context into Sen2Vec. In: Mach Learn Knowl Dis Databases, Springer International Publishing, pp 753–769
go back to reference Schluter N (2017) The limits of automatic summarisation according to ROUGE. Conf Eur Chap ACL 2:41–45 Schluter N (2017) The limits of automatic summarisation according to ROUGE. Conf Eur Chap ACL 2:41–45
go back to reference Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: IEEE CVPR, IEEE, vol 07–12-June, pp 815–823 Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: IEEE CVPR, IEEE, vol 07–12-June, pp 815–823
go back to reference Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) OverFeat: integrated recognition, localization and detection using convolutional networks. https://arxiv.org/abs/1312.6229 Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) OverFeat: integrated recognition, localization and detection using convolutional networks. https://​arxiv.​org/​abs/​1312.​6229
go back to reference Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: IEEE CVPR, IEEE, vol 07-12-June, pp 4657–4666 Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: IEEE CVPR, IEEE, vol 07-12-June, pp 4657–4666
go back to reference Sharif N, White L, Bennamoun M, Shah SAA (2018) Learning-based composite metrics for improved caption evaluation. ACL. Student Research Workshop, ACL, pp 14–20 Sharif N, White L, Bennamoun M, Shah SAA (2018) Learning-based composite metrics for improved caption evaluation. ACL. Student Research Workshop, ACL, pp 14–20
go back to reference Shen Z, Li J, Su Z, Li M, Chen Y, Jiang YG, Xue X (2017) Weakly supervised dense video captioning. In: IEEE CVPR, pp 1916–1924 Shen Z, Li J, Su Z, Li M, Chen Y, Jiang YG, Xue X (2017) Weakly supervised dense video captioning. In: IEEE CVPR, pp 1916–1924
go back to reference Shetty R, Laaksonen J (2016) Frame- and segment-level features and candidate pool evaluation for video caption generation. ACM MM. ACM, New York, NY, USA, pp 1073–1076 Shetty R, Laaksonen J (2016) Frame- and segment-level features and candidate pool evaluation for video caption generation. ACM MM. ACM, New York, NY, USA, pp 1073–1076
go back to reference Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe B, Matas J, Sebe N, Welling M (eds) ECCV. Springer International Publishing, Amsterdam, The Netherlands, pp 510–526 Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe B, Matas J, Sebe N, Welling M (eds) ECCV. Springer International Publishing, Amsterdam, The Netherlands, pp 510–526
go back to reference Sigurdsson GA, Gupta A, Schmid C, Farhadi A, Alahari K (2018) Actor and observer: joint modeling of first and third-person videos. In: IEEE CVPR, pp 7396–7404 Sigurdsson GA, Gupta A, Schmid C, Farhadi A, Alahari K (2018) Actor and observer: joint modeling of first and third-person videos. In: IEEE CVPR, pp 7396–7404
go back to reference Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR, San Diego, CA, USA Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR, San Diego, CA, USA
go back to reference Singh A, Singh TD, Bandyopadhyay S (2020) NITS-VC system for VATEX video captioning challenge 2020 Singh A, Singh TD, Bandyopadhyay S (2020) NITS-VC system for VATEX video captioning challenge 2020
go back to reference Snoek CGM, Dong J, Li X, Wang X, Wei Q, Lan W, Gavves E, Hussein N, Koelma DC, M Smeulders AW (2016) University of Amsterdam and Renmin University at TRECVID 2016: searching video, detecting events and describing video. In: TRECVID, p 5 Snoek CGM, Dong J, Li X, Wang X, Wei Q, Lan W, Gavves E, Hussein N, Koelma DC, M Smeulders AW (2016) University of Amsterdam and Renmin University at TRECVID 2016: searching video, detecting events and describing video. In: TRECVID, p 5
go back to reference Snoek CGM, Li X, Xu C, Koelma DC (2017a) Searching video, detecting events and describing video. In: TRECVID Snoek CGM, Li X, Xu C, Koelma DC (2017a) Searching video, detecting events and describing video. In: TRECVID
go back to reference Snoek CGM, Li X, Xu C, Koelma DC (2017b) University of Amsterdam and Renmin University at TRECVID 2017: searching video, detecting events and describing video. In: TRECVID Snoek CGM, Li X, Xu C, Koelma DC (2017b) University of Amsterdam and Renmin University at TRECVID 2017: searching video, detecting events and describing video. In: TRECVID
go back to reference Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2019a) From deterministic to generative: multimodal stochastic RNNS for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058CrossRef Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2019a) From deterministic to generative: multimodal stochastic RNNS for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058CrossRef
go back to reference Song Y, Zhao Y, Chen S, Jin Q (2019b) RUC\_AIM3 at TRECVID 2019: Video to text. In: TRECVID Song Y, Zhao Y, Chen S, Jin Q (2019b) RUC\_AIM3 at TRECVID 2019: Video to text. In: TRECVID
go back to reference Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: ICML, JMLR.org, Lille, France, ICML ’15, vol 37, p 843-852 Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: ICML, JMLR.org, Lille, France, ICML ’15, vol 37, p 843-852
go back to reference Srivastava Y, Murali V, Dubey SR, Mukherjee S (2019) visual question answering using deep learning: a survey and performance analysis. CoRR abs/1909.0 Srivastava Y, Murali V, Dubey SR, Mukherjee S (2019) visual question answering using deep learning: a survey and performance analysis. CoRR abs/1909.0
go back to reference Sun C, Myers A, Vondrick C, Murphy K, Schmid C, Research G (2019a) VideoBERT: a joint model for video and language representation learning. In: IEEE ICCV, pp 7464–7473 Sun C, Myers A, Vondrick C, Murphy K, Schmid C, Research G (2019a) VideoBERT: a joint model for video and language representation learning. In: IEEE ICCV, pp 7464–7473
go back to reference Sun L, Li B, Yuan C, Zha Z, Hu W (2019b) Multimodal semantic attention network for video captioning. In: IEEE ICME, IEEE, pp 1300–1305 Sun L, Li B, Yuan C, Zha Z, Hu W (2019b) Multimodal semantic attention network for video captioning. In: IEEE ICME, IEEE, pp 1300–1305
go back to reference Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE CVPR, IEEE, vol 07–12-June, pp 1–9 Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE CVPR, IEEE, vol 07–12-June, pp 1–9
go back to reference Tang P, Wang H, Li Q (2019) Rich visual and language representation with complementary semantics for video captioning. ACM Trans Multimedia Compu Commun Appl 15(2):1–23CrossRef Tang P, Wang H, Li Q (2019) Rich visual and language representation with complementary semantics for video captioning. ACM Trans Multimedia Compu Commun Appl 15(2):1–23CrossRef
go back to reference Tapaswi M, Zhu Y, Stiefelhagen R, Torralba A, Urtasun R, Fidler S (2016) MovieQA: understanding stories in movies through question-answering. In: IEEE CVPR, IEEE, pp 4631–4640 Tapaswi M, Zhu Y, Stiefelhagen R, Torralba A, Urtasun R, Fidler S (2016) MovieQA: understanding stories in movies through question-answering. In: IEEE CVPR, IEEE, pp 4631–4640
go back to reference Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. COLING. Dublin, Ireland, pp 1218–1227 Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. COLING. Dublin, Ireland, pp 1218–1227
go back to reference Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. CoRR abs/1503.0 Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. CoRR abs/1503.0
go back to reference Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: IEEE/CVF CVPR, IEEE, pp 6450–6459 Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: IEEE/CVF CVPR, IEEE, pp 6450–6459
go back to reference Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517CrossRef Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517CrossRef
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhim I (2017) Attention is all you need. NIPS. Curran Associates Inc., Long Beach, California, USA, pp 6000–6010 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhim I (2017) Attention is all you need. NIPS. Curran Associates Inc., Long Beach, California, USA, pp 6000–6010
go back to reference Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: Consensus-based image description evaluation. In: IEEE CVPR, IEEE, pp 4566–4575 Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: Consensus-based image description evaluation. In: IEEE CVPR, IEEE, pp 4566–4575
go back to reference Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015a) Sequence to sequence – video to text. In: IEEE ICCV, IEEE, vol 2015 Inter, pp 4534–4542 Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015a) Sequence to sequence – video to text. In: IEEE ICCV, IEEE, vol 2015 Inter, pp 4534–4542
go back to reference Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015b) Translating videos to natural language using deep recurrent neural networks. In: Conference of the North American chapter of the acl: human language technologies. ACL, Stroudsburg, PA, USA, June, pp 1494–1504 Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015b) Translating videos to natural language using deep recurrent neural networks. In: Conference of the North American chapter of the acl: human language technologies. ACL, Stroudsburg, PA, USA, June, pp 1494–1504
go back to reference Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving LSTM-based video description with linguistic knowledge mined from text. EMNLP. ACL, Stroudsburg, PA, USA, pp 1961–1966 Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving LSTM-based video description with linguistic knowledge mined from text. EMNLP. ACL, Stroudsburg, PA, USA, pp 1961–1966
go back to reference Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE CVPR, IEEE, vol 07–12-June, pp 3156–3164 Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE CVPR, IEEE, vol 07–12-June, pp 3156–3164
go back to reference Wang B, Ma L, Zhang W, Liu W (2018a) Reconstruction network for video captioning. In: IEEE CVPR, pp 7622–7631 Wang B, Ma L, Zhang W, Liu W (2018a) Reconstruction network for video captioning. In: IEEE CVPR, pp 7622–7631
go back to reference Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019a) Controllable video captioning with POS sequence guidance based on gated fusion network. In: IEEE ICCV Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019a) Controllable video captioning with POS sequence guidance based on gated fusion network. In: IEEE ICCV
go back to reference Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE ICCV, IEEE, pp 3551–3558 Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE ICCV, IEEE, pp 3551–3558
go back to reference Wang H, Divakaran A, Vetro A, Chang SF, Sun H (2003) Survey of compressed-domain features used in audio-visual indexing and analysis. J Vis Commun Image Represent 14:150–183CrossRef Wang H, Divakaran A, Vetro A, Chang SF, Sun H (2003) Survey of compressed-domain features used in audio-visual indexing and analysis. J Vis Commun Image Represent 14:150–183CrossRef
go back to reference Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC, British Machine Vision Association, BMVA Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC, British Machine Vision Association, BMVA
go back to reference Wang H, Klaser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE CVPR, IEEE, pp 3169–3176 Wang H, Klaser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE CVPR, IEEE, pp 3169–3176
go back to reference Wang J, Jiang W, Ma L, Liu W, Xu Y (2018b) Bidirectional attentive fusion with context gating for dense video captioning. In: IEEE/CVF CVPR, IEEE, pp 7190–7198 Wang J, Jiang W, Ma L, Liu W, Xu Y (2018b) Bidirectional attentive fusion with context gating for dense video captioning. In: IEEE/CVF CVPR, IEEE, pp 7190–7198
go back to reference Wang J, Wang W, Huang Y, Wang L, Tan T (2018c) M3: Multimodal memory modelling for video captioning. In: IEEE/CVF CVPR, IEEE, pp 7512–7520 Wang J, Wang W, Huang Y, Wang L, Tan T (2018c) M3: Multimodal memory modelling for video captioning. In: IEEE/CVF CVPR, IEEE, pp 7512–7520
go back to reference Wang X, Chen W, Wu J, Wang YF, Wang WY (2018d) Video captioning via hierarchical reinforcement learning. In: IEEE/CVF CVPR, IEEE, pp 4213–4222 Wang X, Chen W, Wu J, Wang YF, Wang WY (2018d) Video captioning via hierarchical reinforcement learning. In: IEEE/CVF CVPR, IEEE, pp 4213–4222
go back to reference Wang X, Wang YF, Wang WY (2018e) Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. In: Conference of the North American chapter of the acl: human language technologies, ACL, Stroudsburg, PA, USA 2:795–801 Wang X, Wang YF, Wang WY (2018e) Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. In: Conference of the North American chapter of the acl: human language technologies, ACL, Stroudsburg, PA, USA 2:795–801
go back to reference Wang X, Jabri A, Efros AA (2019b) Learning Correspondence from the Cycle-consistency of Time. In: IEEE/CVF CVPR, pp. 2566–2576 Wang X, Jabri A, Efros AA (2019b) Learning Correspondence from the Cycle-consistency of Time. In: IEEE/CVF CVPR, pp. 2566–2576
go back to reference Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019c) VATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In: IEEE ICCV, pp 4581–4591 Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019c) VATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In: IEEE ICCV, pp 4581–4591
go back to reference Wei R, Mi L, Hu Y, Chen Z (2020) Exploiting the local temporal information for video captioning. J Vis Commun Image Represent 67:102751CrossRef Wei R, Mi L, Hu Y, Chen Z (2020) Exploiting the local temporal information for video captioning. J Vis Commun Image Represent 67:102751CrossRef
go back to reference Weinberger KQ, Blitzer J, Lawrence K S (2005) Distance metric learning for large margin nearest neighbor classification. In: NIPS, pp 1473–1480 Weinberger KQ, Blitzer J, Lawrence K S (2005) Distance metric learning for large margin nearest neighbor classification. In: NIPS, pp 1473–1480
go back to reference Wray M, Csurka G, Larlus D, Damen D (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In: IEEE ICCV, IEEE, pp 450–459 Wray M, Csurka G, Larlus D, Damen D (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In: IEEE ICCV, IEEE, pp 450–459
go back to reference Wu X, Li G, Cao Q, Ji Q, Lin L (2018) Interpretable video captioning via trajectory structured localization. In: IEEE/CVF CVPR, IEEE, pp 6829–6837 Wu X, Li G, Cao Q, Ji Q, Lin L (2018) Interpretable video captioning via trajectory structured localization. In: IEEE/CVF CVPR, IEEE, pp 6829–6837
go back to reference Xiao H, Shi J (2019) A Novel attribute selection mechanism for video captioning. In: IEEE ICIP, IEEE, pp 619–623 Xiao H, Shi J (2019) A Novel attribute selection mechanism for video captioning. In: IEEE ICIP, IEEE, pp 619–623
go back to reference Xie S, Girshick R, Dollar P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: IEEE CVPR, IEEE, vol 2017-Janua, pp 5987–5995 Xie S, Girshick R, Dollar P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: IEEE CVPR, IEEE, vol 2017-Janua, pp 5987–5995
go back to reference Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: ECCV, pp 305–321 Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: ECCV, pp 305–321
go back to reference Xu H, Venugopalan S, Ramanishka V, Rohrbach M, Saenko K (2015a) A multi-scale multiple instance video description network. arXiv:1505.05914 Xu H, Venugopalan S, Ramanishka V, Rohrbach M, Saenko K (2015a) A multi-scale multiple instance video description network. arXiv:​1505.​05914
go back to reference Xu H, Li B, Ramanishka V, Sigal L, Saenko K (2019a) Joint event detection and description in continuous video streams. In: IEEE WACVW, IEEE, pp 25–26 Xu H, Li B, Ramanishka V, Sigal L, Saenko K (2019a) Joint event detection and description in continuous video streams. In: IEEE WACVW, IEEE, pp 25–26
go back to reference Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. 2016 IEEE CVPR pp 5288–5296 Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. 2016 IEEE CVPR pp 5288–5296
go back to reference Xu R, Xiong C, Chen W, Corso JJ (2015b) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, pp 2346–2352 Xu R, Xiong C, Chen W, Corso JJ (2015b) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, pp 2346–2352
go back to reference Xu Y, Yang J, Mao K (2019b) Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357:24–35CrossRef Xu Y, Yang J, Mao K (2019b) Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357:24–35CrossRef
go back to reference Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2020) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241CrossRef Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2020) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241CrossRef
go back to reference Yang X, Zhang T, Xu C (2016) Semantic feature mining for video event understanding. ACM Trans Multimedia Comput Commun Appl 12(4):1–22CrossRef Yang X, Zhang T, Xu C (2016) Semantic feature mining for video event understanding. ACM Trans Multimedia Comput Commun Appl 12(4):1–22CrossRef
go back to reference Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial LSTM. IEEE Trans Image Process 27(11):5600–5611MathSciNetCrossRef Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial LSTM. IEEE Trans Image Process 27(11):5600–5611MathSciNetCrossRef
go back to reference Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: IEEE ICCV, IEEE, pp 4507–4515 Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: IEEE ICCV, IEEE, pp 4507–4515
go back to reference Yao T, Li Y, Qiu Z, Long F, Pan Y, Li D, Mei T (2017) MSR Asia MSM at activitynet challenge 2017: trimmed action recognition. In: Temporal action proposals and dense-captioning events in videos. Tech rep, Microsoft Yao T, Li Y, Qiu Z, Long F, Pan Y, Li D, Mei T (2017) MSR Asia MSM at activitynet challenge 2017: trimmed action recognition. In: Temporal action proposals and dense-captioning events in videos. Tech rep, Microsoft
go back to reference Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: NIPS, MIT Press, pp 3320–3328 Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: NIPS, MIT Press, pp 3320–3328
go back to reference Yu E, Gao M, Li Y, Dong X, Sun J (2017a) Shandong Normal University in the VTT Tasks at TRECVID 2017. In: TRECVID Yu E, Gao M, Li Y, Dong X, Sun J (2017a) Shandong Normal University in the VTT Tasks at TRECVID 2017. In: TRECVID
go back to reference Yu H, Siskind JM, Lafayette W (2015a) Learning to describe video with weak supervision by exploiting negative sentential information. AAAI. AAAI Press, Austin, Texas, pp 3855–3863 Yu H, Siskind JM, Lafayette W (2015a) Learning to describe video with weak supervision by exploiting negative sentential information. AAAI. AAAI Press, Austin, Texas, pp 3855–3863
go back to reference Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video Paragraph captioning using hierarchical recurrent neural networks. In: IEEE CVPR, IEEE, pp 4584–4593 Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video Paragraph captioning using hierarchical recurrent neural networks. In: IEEE CVPR, IEEE, pp 4584–4593
go back to reference Yu L, Park E, Berg AC, Berg TL (2015b) Visual Madlibs: Fill in the blank description generation and question answering. In: IEEE ICCV, IEEE, vol 2015 Inter, pp 2461–2469 Yu L, Park E, Berg AC, Berg TL (2015b) Visual Madlibs: Fill in the blank description generation and question answering. In: IEEE ICCV, IEEE, vol 2015 Inter, pp 2461–2469
go back to reference Yu Y, Choi J, Kim Y, Yoo K, Lee SH, Kim G (2017b) Supervising neural attention models for video captioning by human gaze data. In: IEEE CVPR, IEEE, pp 6119–6127 Yu Y, Choi J, Kim Y, Yoo K, Lee SH, Kim G (2017b) Supervising neural attention models for video captioning by human gaze data. In: IEEE CVPR, IEEE, pp 6119–6127
go back to reference Yu Y, Ko H, Choi J, Kim G (2017c) End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE CVPR, IEEE, pp 3261–3269 Yu Y, Ko H, Choi J, Kim G (2017c) End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE CVPR, IEEE, pp 3261–3269
go back to reference Yuan J, Tian C, Zhang X, Ding Y, Wei W (2018) Video captioning with semantic guiding. In: IEEE BigMM, IEEE, pp 1–5 Yuan J, Tian C, Zhang X, Ding Y, Wei W (2018) Video captioning with semantic guiding. In: IEEE BigMM, IEEE, pp 1–5
go back to reference Zeng KH, Chen TH, Niebles JC, Sun M (2016) Title generation for user generated videos. In: ECCV, Springer International Publishing, pp 609–625 Zeng KH, Chen TH, Niebles JC, Sun M (2016) Title generation for user generated videos. In: ECCV, Springer International Publishing, pp 609–625
go back to reference Zhang B, Hu H, Sha F (2018) Cross-modal and hierarchical modeling of video and text. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) ECCV. Springer International Publishing, Cham, pp 385–401 Zhang B, Hu H, Sha F (2018) Cross-modal and hierarchical modeling of video and text. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) ECCV. Springer International Publishing, Cham, pp 385–401
go back to reference Zhang H, Pang L, Lu YJ, Ngo CW (2016) VIREO @ TRECVID 2016: multimedia event detection, ad-hoc video search, video-to-text description. In: TRECVID Zhang H, Pang L, Lu YJ, Ngo CW (2016) VIREO @ TRECVID 2016: multimedia event detection, ad-hoc video search, video-to-text description. In: TRECVID
go back to reference Zhang W, Wang B, Ma L, Liu W (2019a) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Anal Mach Intell Zhang W, Wang B, Ma L, Liu W (2019a) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Anal Mach Intell
go back to reference Zhang X, Zhang Y, Zhang D, Li J, Qi Tian A (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE CVPR, IEEE, pp 6250–6258 Zhang X, Zhang Y, Zhang D, Li J, Qi Tian A (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE CVPR, IEEE, pp 6250–6258
go back to reference Zhang Z, Xu D, Ouyang W, Tan C (2019b) Show, tell and summarize: dense video captioning using visual cue aided sentence summarization. IEEE Trans Circuits Syst Video Technol Zhang Z, Xu D, Ouyang W, Tan C (2019b) Show, tell and summarize: dense video captioning using visual cue aided sentence summarization. IEEE Trans Circuits Syst Video Technol
go back to reference Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. In: IEEE/CVF CVPR, pp 13278–13288 Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. In: IEEE/CVF CVPR, pp 13278–13288
go back to reference Zhao Y, Song Y, Chen S, Jin Q (2020) RUC\_AIM3 at TRECVID 2020: Ad-hoc Video Search and Video to Text Description. In: TRECVID Zhao Y, Song Y, Chen S, Jin Q (2020) RUC\_AIM3 at TRECVID 2020: Ad-hoc Video Search and Video to Text Description. In: TRECVID
go back to reference Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018a) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464CrossRef Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018a) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464CrossRef
go back to reference Zhou L, Xu C, Corso JJ (2018b) Towards automatic learning of procedures from web instructional videos. In: AAAI, Association for the Advancement of Artificial Intelligence, pp 7590–7598 Zhou L, Xu C, Corso JJ (2018b) Towards automatic learning of procedures from web instructional videos. In: AAAI, Association for the Advancement of Artificial Intelligence, pp 7590–7598
go back to reference Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018c) End-to-end dense video captioning with masked transformer. In: IEEE/CVF CVPR, IEEE, pp 8739–8748 Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018c) End-to-end dense video captioning with masked transformer. In: IEEE/CVF CVPR, IEEE, pp 8739–8748
go back to reference Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: IEEE/CVF CVPR, IEEE, pp 6571–6580 Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: IEEE/CVF CVPR, IEEE, pp 6571–6580
go back to reference Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE ICCV, IEEE, vol 2017-October, pp 2242–2251 Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE ICCV, IEEE, vol 2017-October, pp 2242–2251
go back to reference Zolfaghari M, Singh K, Brox T (2018) ECO: efficient convolutional network for online video understanding. In: ECCV, Springer International Publishing, pp 713–730 Zolfaghari M, Singh K, Brox T (2018) ECO: efficient convolutional network for online video understanding. In: ECCV, Springer International Publishing, pp 713–730
Metadata
Title
A comprehensive review of the video-to-text problem
Authors
Jesus Perez-Martin
Benjamin Bustos
Silvio Jamil F. Guimarães
Ivan Sipiran
Jorge Pérez
Grethel Coello Said
Publication date
16-01-2022
Publisher
Springer Netherlands
Published in
Artificial Intelligence Review / Issue 5/2022
Print ISSN: 0269-2821
Electronic ISSN: 1573-7462
DOI
https://doi.org/10.1007/s10462-021-10104-1

Other articles of this Issue 5/2022

Artificial Intelligence Review 5/2022 Go to the issue

Premium Partner