Top

Artificial Intelligence Review

Published in:

16-01-2022

A comprehensive review of the video-to-text problem

Authors: Jesus Perez-Martin, Benjamin Bustos, Silvio Jamil F. Guimarães, Ivan Sipiran, Jorge Pérez, Grethel Coello Said

Published in: Artificial Intelligence Review | Issue 5/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several challenging tasks such as video question answering, video summarization with natural language, and video-to-text and text-to-video conversion. This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description. This association can be mainly made by retrieving the most relevant descriptions from a corpus or generating a new one given a context video. These two ways represent essential tasks for Computer Vision and Natural Language Processing communities, called text retrieval from video task and video captioning/description task. These two tasks are substantially more complex than predicting or retrieving a single sentence from an image. The spatiotemporal information present in videos introduces diversity and complexity regarding the visual content and the structure of associated language descriptions. This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. We analyze twenty-six benchmark datasets, showing their drawbacks and strengths for the problem requirements. We also show the progress that researchers have made on each dataset, we cover the challenges in the field, and we discuss future research directions.

previous article Hybrid group decision-making technique under spherical fuzzy N-soft expert sets

next article Information-utilization strengthened equilibrium optimizer

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Dense-Captioning Events in Videos task of ActivityNet 2019 challenge website: http://activity-net.org/challenges/2019/tasks/anet_captioning.html.

https://www.nltk.org/book/ch05.html.

Joint embeddings are usually done by mapping semantically associated inputs from two or more domains (e.g., images and text) into a common vector space

labelnote:lsmdcLSMDC 2017 challenge website: https://sites.google.com/site/describingmovies/lsmdc-2017

LSMDC 2019 challenge website: https://sites.google.com/site/describingmovies/lsmdc-2019

ICCV Workshop for LSMDC 2015 challenge website: https://sites.google.com/site/describingmovies/workshop-at-iccv-15

MS COCO Caption Evaluation: https://github.com/tylin/coco-caption

Microsoft Multimedia Challenge website: http://ms-multimedia-challenge.com/2017/challenge

labelnote:actnettaskDense-Captioning Events in Videos task of ActivityNet 2019 challenge website: http://activity-net.org/challenges/2019/tasks/anet_captioning.html.

labelnote:actnet18Captioning tab in ActivityNet 2018 evaluation website: http://activity-net.org/challenges/2018/evaluation.html.

labelnote:actnet19Captioning tab in ActivityNet 2019 evaluation website: http://activity-net.org/challenges/2019/evaluation.html.

VATEX Captioning Challenge website: https://eric-xw.github.io/vatex-website/index.html.

Categorizing and tagging words: http://www.nltk.org/book/ch05.html.

Global Vectors for Word Representation (GloVe) website: https://nlp.stanford.edu/projects/glove/

Datasets summary available at: https://github.com/jssprz/video_captioning_datasets

MTurk is an online framework that allows researchers to post annotation tasks, called HITs (Human Intelligence Task)

labelnote:msrvtt16MSR-VTT-10K dataset website: http://ms-multimedia-challenge.com/2016/dataset

labelnote:msrvtt17MSR-VTT 2017 dataset website: http://ms-multimedia-challenge.com/2017/dataset

Tumblr website: https://www.tumblr.com

labelnote:vatexVaTeX dataset website: https://eric-xw.github.io/vatex-website/index.html.

Kinetics dataset website: https://deepmind.com/research/open-source/kinetics

Descriptive Video Service (DVS) is a major specialized in audio description, which aims to describe the visual content in form of narration. These narrations are commonly placed during natural pauses in the original audio of the video, and sometimes during dialogues

MPII-MD dataset website: www.mpi-inf.mpg.de

MPII-MD Co-ref+Gender dataset website: www.mpi-inf.mpg.de

MP-II Cooking Activities dataset website: www.mpi-inf.mpg.de

MPII Cooking Composite Activities dataset website: www.mpi-inf.mpg.de

YouCook dataset website: http://web.eecs.umich.edu/~jjcorso/r/youcook/

TACoS-Multilevel dataset website: www.mpi-inf.mpg.de

Charades dataset website: https://allenai.org/plato/charades/

ActivityNet Captions dataset website: https://cs.stanford.edu/people/ranjaykrishna/densevid/

20BN-something-something dataset website: https://20bn.com/datasets/something-something

WikiHow is an online resource that contains 120,000 articles on How to ... for a variety of domains ranging from cooking to human relationships structured in a hierarchy: https://www.wikihow.com/

labelnote:hrl16They achieved the results on the Charades Caption dataset, which was obtained by pre-processing the raw Charades dataset.

Aafaq N, Akhtar N, Syed WL, Gilani Z, Mian A (2019a) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: IEEE CVPR, pp 12487–12496

Aafaq N, Mian A, Liu W, Zulqarnain Gilani S, Mian A, Liu W, Gilani SZ, Shah M (2019b) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv 52(6)

Abbas Q, Ibrahim ME, Jaffar MA (2019) A comprehensive review of recent advances on deep vision systems. Artif Intell Rev 52(1):39–76CrossRef

Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: semantic propositional image caption evaluation. ECCV. Springer, Springer Nature, pp 382–398

Awad G, Fiscus J, Joy D, Michel M, Smeaton AF, Kraaij W, Eskevich M, Aly R, Ordelman R, Jones GJF, Huet B, Larson M (2016) TRECVID 2016: Evaluating vdeo search, video event detection, localization, and hyperlinking. In: TRECVID, Gaithersburg, Ma, US

Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton A, Graham Y, Kraaij W, Quénot G, Eskevich M, Ordelman R, Jones GJ, Huet B (2017) Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: TRECVID, Gaithersburg, Ma, US

Awad G, Butt AA, Curtis K, Yooyoung L, Fiscus J, Godil A, Joy D, Delgado A, Smeaton AF, Graham Y, Kraaij W, Quénot G, Magalhaes J, Semedo D, Blasi S (2018) TRECVID 2018: benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In: TRECVID, NIST, Gaithersburg, Ma, US

Awad G, Butt AA, Curtis K, Lee Y, Fiscus J, Godil A, Delgado A, Zhang J, Godard E, Diduch L, Smeaton AF, Graham Y, Kraaij W, Quénot G (2019) TRECVID 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search and retrieval. In: TRECVID, Gaithersburg, Ma, US

Awad G, Butt AA, Curtis K, Lee Y, Fiscus J, Godil A, Delgado A, Zhang J, Godard E, Diduch L, Liu J, Smeaton AF, Graham Y, Jones GJF, Kraaij W, Quénot G (2020) TRECVID 2020: comprehensive campaign for evaluating video retrieval tasks across multiple application domains. In: TRECVID, NIST, US

Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio Y, LeCun Y (eds) ICLR

Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or Summarization, pp 65–72

Baraldi L, Grana C, Cucchiara R (2017) Hierarchical boundary-aware neural encoder for video captioning. In: IEEE CVPR, IEEE, pp 3185–3194

Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Michaux A, Mussman S, Narayanaswamy S, Salvi D, Schmidt L, Shangguan J, Siskind JM, Waggoner J, Wang S, Wei J, Yin Y, Zhang Z (2012) Video in sentences out. arXiv:1204.2742

Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybernet 49(7):2631–2641CrossRef

Bojanowski P, Lajugie R, Grave E, Bach F, Laptev I, Ponce J, Schmid C (2015) Weakly-supervised alignment of video with text. In: IEEE ICCV, IEEE, pp 4462–4470

Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE CVPR, IEEE, vol 2017-January, pp 6373–6382

Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE CVPR, IEEE, pp 4724–4733

Caruana R (1998) Multitask learning. In: Thrun S, Pratt L (eds) Learning to learn. Springer, Boston, pp 95–133CrossRef

Celikyilmaz A, Clark E, Gao J (2020) Evaluation of text generation: a survey

Chen H, Li J, Hu X (2020a) Delving deeper into the decoder for video captioning. CoRR

Chen H, Lin K, Maye A, Li J, Hu X (2020b) A semantics-assisted video captioning model trained with scheduled sampling. Front Robot AI, 7

Chen J, Liang J, Liu J, Chen S, Gao C, Jin Q, Hauptmann A (2017a) Informedia @ TRECVID 2017. In: TRECVID

Chen J, Chen S, Jin Q, Hauptmann A (2018a) Informedia@TRECVID 2018. In: TRECVID

Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019a) Temporal deformable convolutional encoder-decoder networks for video captioning. AAAI 33:8167–8174CrossRef

Chen S, Chen J, Jin Q, Hauptmann A (2017b) Video captioning with guidance of multimodal latent topics. ACM MM. ACM Press, New York, pp 1838–1846

Chen S, Song Y, Zhao Y, Qiu J, Jin Q, Hauptmann A (2018b) RUC+CMU: system report for dense captioning events in videos. CoRR abs/1806.0

Chen S, Jin Q, Chen J, Hauptmann A (2019) Generating video descriptions with latent topic guidance. IEEE Trans Multimedia 21:2407–2418CrossRef

Chen S, Song Y, Zhao Y, Jin Q, Zeng Z, Liu B, Fu J, Hauptmann A (2019c) Activitynet 2019 task 3: exploring contexts for dense captioning events in videos. arXiv:1907.05092

Chen S, Zhao Y, Jin Q, Wu Q (2020c) Fine-grained video-text retrieval with hierarchical graph reasoning. In: IEEE/CVF CVPR

Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: IEEE CVPR, IEEE, vol 07–12, pp 2422–2431, June 2015

Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft COCO captions: data collection and evaluation server. CoRR abs/1504.0

Chen X, Rohrbach M, Parikh D (2019d) Cycle-consistency for robust visual question answering. In: IEEE/CVF CVPR, pp 6649–6658

Chen Y, Wang S, Zhang W, Huang Q (2018c) Less is more: picking informative frames for video captioning. In: ECCV, Springer International Publishing, pp 367–384

Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP. ACL, Stroudsburg, pp 1724–1734

Craswell N (2009) Mean reciprocal rank. In: LIU L (ed) Encyclopedia of database systems, Springer US, Boston, pp 1703–1703

Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. In: NIPS, Barcelona, Spain, NIPS’16

Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Soc CVPR I:886–893

Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. IEEE Comput Soc CVPR. IEEE, Portland, OR, USA, pp 2634–2641

Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 28(4):357–366CrossRef

Deshpande A, Aneja J, Wang L, Schwing AG, Forsyth D (2019) Fast, diverse and accurate image captioning guided by part-of-speech. In: IEEE/CVF CVPR, IEEE, pp 10687–10696

Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, IEEE, pp 65–72

Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) DeCAF: a deep convolutional activation feature for generic visual recognition. In: ICML, JMLR.org, Beijing, China

Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691CrossRef

Dong J, Li X, Snoek CGM (2016) Word2VisualVec: image and video to sentence matching by visual feature prediction. CoRR abs/1604.0

Dong J, Li X, Snoek CGM (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimedia 20(12):3377–3388CrossRef

Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: IEEE/CVF CVPR, IEEE, pp 9338–9347

Dwibedi D, Aytar Y, Tompson J, Sermanet P, Zisserman A (2019) Temporal cycle-consistency learning. In: IEEE/CVF CVPR, IEEE Comput Soc , vol 2019-June, pp 1801–1810

Eisenstein J (2019) Introduction to natural language processing. MIT Press, Cambridge

Elhamifar E, Sapiro G, Sastry SS (2016) Dissimilarity-based sparse subset selection. IEEE Trans Pattern Anal Mach Intell 38(11):2182–2197CrossRef

Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC

Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE CVPR, IEEE, vol 07–12-June, pp 1473–1482

Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: IEEE CVPR, IEEE, pp 7445–7454

Gan C, Gan Z, He X, Gao J, Deng L (2017a) StyleNet: generating attractive visual captions with styles. In: IEEE CVPR, IEEE, pp 955–964

Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017b) Semantic compositional networks for visual captioning. In: IEEE CVPR, IEEE, vol 2017, pp 1141–1150

Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia 19(9)

Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell, pp 1–19

Gatt A, Krahmer E (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation. J Artif Intell Res 61:65–170MathSciNetMATHCrossRef

Ging S, Zolfaghari M, Pirsiavash H, Brox T (2020) COOT: cooperative hierarchical transformer for video-text representation learning. In: NIPS

Girshick R (2015) Fast R-CNN. In: IEEE ICCV, IEEE, pp 1440–1448

Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE CVPR, IEEE, pp 580–587

Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, CambridgwMATH

Graham Y, Awad G, Smeaton A (2018) Evaluation of automatic video captioning using direct assessment. PLOS ONE 13(9):e0202789CrossRef

Graves A, Mohamed Ar, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE ICASSP, IEEE, pp 6645–6649

Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. IEEE ICCV 1:2712–2719

Guo Y, Yao B, Liu Y (2020) Sequence to sequence model for video captioning. Pattern Recognition Letters, pp 327–334

Han L, Kashyap AL, Finin T, Mayfield J, Weese J (2013) UMBC\_EBIQUITY-CORE: semantic textual similarity systems. In: Second joint conference on lexical and computational semantics

Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet? In: IEEE CVPR

He K, Zhang X, Ren S, Sun J (2015) Delving Deep into rectifiers: surpassing human-level performance on imagenet classification. In: IEEE ICCV, IEEE, pp 1026–1034

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE CVPR, IEEE, vol 2016-Decem, pp 770–778

He X, Shi B, Bai X, Xia GS, Zhang Z, Dong W (2019) Image caption generation with part of speech guidance. Pattern Recogn Lett 119:229–237CrossRef

Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE CVPR, IEEE, pp 961–970

Hemalatha M, Chandra Sekhar C (2020) Domain-specific semantics guided approach to video captioning. In: IEEE WACV, pp 1587–1596

Hendricks LA, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. In: 2017 IEEE international conference on computer vision (ICCV), IEEE, pp 5804–5813

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

Hodosh M, Young P, Hockenmaier J (2015) Framing image description as a ranking task data, models and evaluation metrics extended abstract. In: IJCAI, pp 4188–4192

Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. In: IEEE ICCV

Hu Y, Chen Z, Zha ZJ, Wu F (2019) Hierarchical global-local temporal modeling for video captioning. ACM MM. ACM, New York, NY, USA, pp 774–783

Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In: IEEE CVPR, pp 2462–2470

Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRef

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE CVPR, IEEE, pp 3128–3137

Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. IEEE CVPR. IEEE, Columbus, OH, US, pp 1725–1732

Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: ICLR, Neptune, Toulon, France

Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539

Kojima A, Tamura T, Fukunaga K (2002) Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions. International Journal of Computer Vision 50(2):171–184MATHCrossRef

Kong Y, Fu Y (2018) Human action recognition and prediction: a survey. arXiv:1806.11230

Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC (2017a) Dense-captioning events in videos. In: IEEE ICCV, IEEE, vol 2017-Octob, pp 706–715

Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2017b) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73MathSciNetCrossRef

Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. NAACL HLT workshop on vision and language pp 10–19

Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: NIPS, Curran Associates Inc., Lake Tahoe, Nevada, vol 1, pp 1097–1105

Kuznetsova P, Ordonez V, Berg T, Choi Y (2014) TREETALK: composition and compression of trees for image descriptions. Trans ACL 2(1):351–362

Chen L, D, B Dolan W, (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the ACL: human language technologies, ACL 1:190–200

Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123CrossRef

Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE CVPR, IEEE, pp 1–8

Le DD, Phan S, Nguyen VT, Renoust B, Nguyen TA, Hoang VN, Duc Ngo T, Tran MT, Watanabe Y, Klinkigt M, Hiroike A, Duong DA, Miyao Y, Ichi Satoh S (2016) NII-HITACHI-UIT at TRECVID 2016. In: TRECVID, p 25

Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: ICML, JMLR.org, Beijing, China, vol 32, pp 1188–1196

Lee J, Lee Y, Seong S, Kim K, Kim S, Kim J (2019) Capturing long-range dependencies in video captioning. In: IEEE ICIP, IEEE, pp 1880–1884

Lei J, Wang L, Shen Y, Yu D, Berg TL, Bansal M (2020) MART: memory-augmented recurrent transformer for coherent video paragraph captioning. Ann Meet ACL, pp 2603–2614

Li G, Pan P, Yang Y (2018a) UTS\_CETC\_D2DCRC Submission at the TRECVID 2018 video to text description task. In: TRECVID

Li H, Song D, Liao L, Peng C (2019) REVnet: bring reviewing into video captioning for a better description. In: IEEE ICME, IEEE, pp 1312–1317

Li L, Gong B (2019) End-to-end video captioning with multitask reinforcement learning. In: IEEE WACV, IEEE, pp 339–348

Li X, Liao S, Lan W, Du X, Yang G (2015) Zero-shot image tagging by Hierarchical semantic embedding. In: SIGIR, ACM, pp 879–882

Li X, Dong J, Xu C, Cao J, Wang X, Yang G (2018b) Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: deep cross-modal embeddings for video-text retrieval. In: TRECVID

Li Y, Song Y, Cao L, Tetreault J, Goldberg L, Jaimes A, Luo J (2016) TGIF: a new dataset and benchmark on animated GIF description. In: IEEE CVPR, IEEE, vol 2016-Decem, pp 4641–4650

Li Y, Min MR, Shen D, Carlson D, Carin L (2017) Video generation from text. arXiv:1710.00421

Lin Cy (2004) Rouge: a package for automatic evaluation of summaries. In: ACL post-conference workshop, Barcelona, Spain, pp 25–26

Lin K, Gan Z, Wang L (2020) Multi-modal feature fusion with feature attention for VATEX captioning challenge 2020

Liu S, Ren Z, Yuan J (2018) SibNet: Sibling convolutional encoder for video captioning. ACM MM. ACM, New York, NY, USA, pp 1425–1434

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: ECCV, pp 21–37

Long X, Gan C, De Melo G (2018) Video captioning with multi-faceted attention. In: Transactions of the ACL, pp 173–184

Lu J, Goswami V, Rohrbach M, Parikh D, Lee S (2020) 12-in-1: Multi-task vision and language representation learning. In: IEEE/CVF CVPR

Mahdisoltani F, Berger G, Gharbieh W, Fleet D, Memisevic R (2018a) Fine-grained video classification and captioning. CoRR abs/1804.0

Mahdisoltani F, Berger G, Gharbieh W, Fleet D, Memisevic R (2018b) On the effectiveness of task granularity for transfer learning. CoRR abs/1804.0

Manmadhan S, Kovoor BC (2020) Visual question answering: a state-of-the-art review. Artif Intell Rev 53(8):5705–5745CrossRef

Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-RNN). CoRR abs/1412.6

Markatopoulou F, Moumtzidou A, Galanopoulos D, Mironidis T, Kaltsa V, Ioannidou A, Symeonidis S, Avgerinakis K, Andreadis S, Gialampoukidis I, Vrochidis S, Briassouli A, Mezaris V, Kompatsiaris I, Patras I (2016) ITI-CERTH participation in TRECVID 2016. In: TRECVID

Marsden M, Mohedano E, Mcguinness K, Calafell A, Giró-I-Nieto X, O’connor NE, Zhou J, Azevedo L, Daudert T, Davis B, Hürlimann M, Afli H, Du J, Ganguly D, Li W, Way A, Smeaton AF (2016) Dublin City University and partners’ participation in the INS and VTT tracks at TRECVid 2016. In: TRECVID

Meister S, Hur J, Roth S (2018) UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI

Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. CoRR abs/1804.0

Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I, Sivic J (2019) HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In: IEEE/CVF ICCV, IEEE, pp 2630–2640

Miech A, Alayrac JB, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: IEEE/CVF CVPR

Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS, Curran Associates Inc., vol 2, pp 3111–3119

Mithun NC, Li JB, Metze F, Roy-Chowdhury AK, Das S, Bosch R (2017) CMU-UCR-BOSCH @ TRECVID 2017: VIDEO TO TEXT RETRIEVAL. In: TRECVID

Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. ACM ICMR. ACM, New York, NY, USA, pp 19–27

Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2019) Joint embeddings with multimodal cues for video-text retrieval. Int J Multimedia Inf Retriev 8(1):3–18CrossRef

Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: IEEE CVPR

Nguyen PA, Li Q, Cheng ZQ, Lu YJ, Zhang H, Wu X, Ngo CW (2017a) VIREO @ TRECVID 2017: Video-to-text, ad-hoc video search and video hyperlinking. In: TRECVID

Nguyen T, Sah S, Ptucha R (2017b) Multistream hierarchical boundary network for video captioning. In: IEEE WNYISPW, IEEE, pp 1–5

Nina O, Garcia W, Clouse S, Yilmaz A (2018) MTLE: A multitask learning encoder of visual feature representations for video and movie description. CoRR abs/1809.0

Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: ECCV, Springer International Publishing, pp 651–667

Pan B, Cai H, Huang DA, Lee KH, Gaidon A, Adeli E, Niebles JC (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: IEEE/CVF CVPR, pp 10870–10879

Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016a) Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE CVPR, pp 1029–1038

Pan Y, Mei T, Yao T, Li H, Rui Y (2016b) Jointly modeling embedding and translation to bridge video and language. In: IEEE CVPR, IEEE, pp 4594–4602

Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: IEEE CVPR, IEEE, vol 2017-Janua, pp 984–992

Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL, ACL, Morristown, NJ, USA, no. July in ACL ’02, p 311

Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition. In: BMVC, British Machine Vision Association, pp 1–41

Pasunuru R, Bansal M (2017) Reinforced video captioning with entailment rewards. EMNLP. ACL, Stroudsburg, PA, USA, pp 979–985

Perez-Martin J, Bustos B, Pérez J (2020a) Attentive visual semantic specialized network for video captioning. In: ICPR

Perez-Martin J, Bustos B, Pérez J, Barrios JM (2020b) IMFD-IMPRESEE at TRECVID 2020: description generation by visual-syntactic embedding. In: TRECVID

Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. In: IEEE/CVF WACV

Phan S, Henter GE, Miyao Y, Satoh S (2017a) Consensus-based sequence training for video captioning. CoRR abs/1712.0

Phan S, Klinkigt M, Nguyen VT, Mai TD, Xalabarder AG, Hinami R, Renoust B, Duc Ngo T, Tran MT, Watanabe Y, Hiroike A, Duong DA, Le DD, Miyao Y, Ichi Satoh S (2017b) NII-Hitachi-UIT at TRECVID 2017. In: TRECVID, p 18

Plummer BA, Brown M, Lazebnik S (2017) Enhancing video summarization via vision-language embedding. In: IEEE CVPR, IEEE, pp 1052–1060

Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. In: ICLR

Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazons mechanical turk. In: NAACL HLT 2010 workshop on creating speech and language data with amazons mechanical turk. ACL, Los Angeles, California, pp 139–147

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: IEEE CVPR, IEEE, pp 779–788

Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans ACL 1:25–36

Reiter E (2018) A structured review of the validity of BLEU. Comput Linguist 44(3):393–401CrossRef

Reiter E, Dale R (2000) Building natural language generation systems. Cambridge University Press, CambridgeCrossRef

Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149CrossRef

Rijsbergen CJV (1979) Information Retrieval. Butterworth-Heinemann313 Washington Street Newton, MAUnited States

Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (2014) Coherent multi-sentence video description with variable level of detail. In: Pattern Recognition, Springer International Publishing, pp 184–195

Rohrbach A, Rohrbach M, Schiele B (2015a) The Long-short story of movie description. In: Pattern Recognition, Springer International Publishing, pp 209–221

Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015b) A dataset for Movie Description. In: IEEE CVPR, IEEE, vol 07-12-June, pp 3202–3212

Rohrbach A, Rohrbach M, Tang S, Oh SJ, Schiele B (2017) Generating descriptions with grounded and co-referenced people. In: IEEE CVPR

Rohrbach M, Amin S, Andriluka M, Schiele B (2012a) A database for fine grained activity detection of cooking activities. In: IEEE CVPR, IEEE, pp 1194–1201

Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012b) Script data for attribute-based recognition of composite activities. ECCV. Springer, Berlin, pp 144–157

Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE ICCV, IEEE, December, pp 433–440

Rohrbach M, Rohrbach A, Regneri M, Amin S, Andriluka M, Pinkal M, Schiele B (2016) Recognizing fine-grained and composite activities using hand-centric features and script data. Int J Comput Vis 119(3):346–373MathSciNetCrossRef

Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536MATHCrossRef

Sah S, Nguyen T, Ptucha R (2019) Understanding temporal structure for video captioning. Pattern Anal Appl

Sah S, Nguyen T, Ptucha R (2020) Understanding temporal structure for video captioning. Pattern Anal Appl 23(1):147–159CrossRef

Saha TK, Joty S, Al Hasan M (2017) Con-S2V: a generic framework for incorporating extra-sentential context into Sen2Vec. In: Mach Learn Knowl Dis Databases, Springer International Publishing, pp 753–769

Schluter N (2017) The limits of automatic summarisation according to ROUGE. Conf Eur Chap ACL 2:41–45

Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: a unified embedding for face recognition and clustering. In: IEEE CVPR, IEEE, vol 07–12-June, pp 815–823

Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) OverFeat: integrated recognition, localization and detection using convolutional networks. https://arxiv.org/abs/1312.6229

Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: IEEE CVPR, IEEE, vol 07-12-June, pp 4657–4666

Sharif N, White L, Bennamoun M, Shah SAA (2018) Learning-based composite metrics for improved caption evaluation. ACL. Student Research Workshop, ACL, pp 14–20

Shen Z, Li J, Su Z, Li M, Chen Y, Jiang YG, Xue X (2017) Weakly supervised dense video captioning. In: IEEE CVPR, pp 1916–1924

Shetty R, Laaksonen J (2016) Frame- and segment-level features and candidate pool evaluation for video caption generation. ACM MM. ACM, New York, NY, USA, pp 1073–1076

Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe B, Matas J, Sebe N, Welling M (eds) ECCV. Springer International Publishing, Amsterdam, The Netherlands, pp 510–526

Sigurdsson GA, Gupta A, Schmid C, Farhadi A, Alahari K (2018) Actor and observer: joint modeling of first and third-person videos. In: IEEE CVPR, pp 7396–7404

Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR, San Diego, CA, USA

Singh A, Singh TD, Bandyopadhyay S (2020) NITS-VC system for VATEX video captioning challenge 2020

Snoek CGM, Dong J, Li X, Wang X, Wei Q, Lan W, Gavves E, Hussein N, Koelma DC, M Smeulders AW (2016) University of Amsterdam and Renmin University at TRECVID 2016: searching video, detecting events and describing video. In: TRECVID, p 5

Snoek CGM, Li X, Xu C, Koelma DC (2017a) Searching video, detecting events and describing video. In: TRECVID

Snoek CGM, Li X, Xu C, Koelma DC (2017b) University of Amsterdam and Renmin University at TRECVID 2017: searching video, detecting events and describing video. In: TRECVID

Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2019a) From deterministic to generative: multimodal stochastic RNNS for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058CrossRef

Song Y, Zhao Y, Chen S, Jin Q (2019b) RUC\_AIM3 at TRECVID 2019: Video to text. In: TRECVID

Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: ICML, JMLR.org, Lille, France, ICML ’15, vol 37, p 843-852

Srivastava Y, Murali V, Dubey SR, Mukherjee S (2019) visual question answering using deep learning: a survey and performance analysis. CoRR abs/1909.0

Sun C, Myers A, Vondrick C, Murphy K, Schmid C, Research G (2019a) VideoBERT: a joint model for video and language representation learning. In: IEEE ICCV, pp 7464–7473

Sun L, Li B, Yuan C, Zha Z, Hu W (2019b) Multimodal semantic attention network for video captioning. In: IEEE ICME, IEEE, pp 1300–1305

Szegedy C, Wei Liu, Yangqing Jia, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE CVPR, IEEE, vol 07–12-June, pp 1–9

Tang P, Wang H, Li Q (2019) Rich visual and language representation with complementary semantics for video captioning. ACM Trans Multimedia Compu Commun Appl 15(2):1–23CrossRef

Tapaswi M, Zhu Y, Stiefelhagen R, Torralba A, Urtasun R, Fidler S (2016) MovieQA: understanding stories in movies through question-answering. In: IEEE CVPR, IEEE, pp 4631–4640

Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. COLING. Dublin, Ireland, pp 1218–1227

Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. CoRR abs/1503.0

Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: IEEE/CVF CVPR, IEEE, pp 6450–6459

Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517CrossRef

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhim I (2017) Attention is all you need. NIPS. Curran Associates Inc., Long Beach, California, USA, pp 6000–6010

Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: Consensus-based image description evaluation. In: IEEE CVPR, IEEE, pp 4566–4575

Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015a) Sequence to sequence – video to text. In: IEEE ICCV, IEEE, vol 2015 Inter, pp 4534–4542

Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015b) Translating videos to natural language using deep recurrent neural networks. In: Conference of the North American chapter of the acl: human language technologies. ACL, Stroudsburg, PA, USA, June, pp 1494–1504

Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving LSTM-based video description with linguistic knowledge mined from text. EMNLP. ACL, Stroudsburg, PA, USA, pp 1961–1966

Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE CVPR, IEEE, vol 07–12-June, pp 3156–3164

Wang B, Ma L, Zhang W, Liu W (2018a) Reconstruction network for video captioning. In: IEEE CVPR, pp 7622–7631

Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019a) Controllable video captioning with POS sequence guidance based on gated fusion network. In: IEEE ICCV

Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE ICCV, IEEE, pp 3551–3558

Wang H, Divakaran A, Vetro A, Chang SF, Sun H (2003) Survey of compressed-domain features used in audio-visual indexing and analysis. J Vis Commun Image Represent 14:150–183CrossRef

Wang H, Ullah MM, Kläser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC, British Machine Vision Association, BMVA

Wang H, Klaser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE CVPR, IEEE, pp 3169–3176

Wang J, Jiang W, Ma L, Liu W, Xu Y (2018b) Bidirectional attentive fusion with context gating for dense video captioning. In: IEEE/CVF CVPR, IEEE, pp 7190–7198

Wang J, Wang W, Huang Y, Wang L, Tan T (2018c) M3: Multimodal memory modelling for video captioning. In: IEEE/CVF CVPR, IEEE, pp 7512–7520

Wang X, Chen W, Wu J, Wang YF, Wang WY (2018d) Video captioning via hierarchical reinforcement learning. In: IEEE/CVF CVPR, IEEE, pp 4213–4222

Wang X, Wang YF, Wang WY (2018e) Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. In: Conference of the North American chapter of the acl: human language technologies, ACL, Stroudsburg, PA, USA 2:795–801

Wang X, Jabri A, Efros AA (2019b) Learning Correspondence from the Cycle-consistency of Time. In: IEEE/CVF CVPR, pp. 2566–2576

Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019c) VATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In: IEEE ICCV, pp 4581–4591

Wei R, Mi L, Hu Y, Chen Z (2020) Exploiting the local temporal information for video captioning. J Vis Commun Image Represent 67:102751CrossRef

Weinberger KQ, Blitzer J, Lawrence K S (2005) Distance metric learning for large margin nearest neighbor classification. In: NIPS, pp 1473–1480

Wray M, Csurka G, Larlus D, Damen D (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In: IEEE ICCV, IEEE, pp 450–459

Wu X, Li G, Cao Q, Ji Q, Lin L (2018) Interpretable video captioning via trajectory structured localization. In: IEEE/CVF CVPR, IEEE, pp 6829–6837

Xiao H, Shi J (2019) A Novel attribute selection mechanism for video captioning. In: IEEE ICIP, IEEE, pp 619–623

Xie S, Girshick R, Dollar P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: IEEE CVPR, IEEE, vol 2017-Janua, pp 5987–5995

Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: ECCV, pp 305–321

Xu H, Venugopalan S, Ramanishka V, Rohrbach M, Saenko K (2015a) A multi-scale multiple instance video description network. arXiv:1505.05914

Xu H, Li B, Ramanishka V, Sigal L, Saenko K (2019a) Joint event detection and description in continuous video streams. In: IEEE WACVW, IEEE, pp 25–26

Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. 2016 IEEE CVPR pp 5288–5296

Xu R, Xiong C, Chen W, Corso JJ (2015b) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, pp 2346–2352

Xu Y, Yang J, Mao K (2019b) Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing 357:24–35CrossRef

Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2020) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241CrossRef

Yang X, Zhang T, Xu C (2016) Semantic feature mining for video event understanding. ACM Trans Multimedia Comput Commun Appl 12(4):1–22CrossRef

Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial LSTM. IEEE Trans Image Process 27(11):5600–5611MathSciNetCrossRef

Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: IEEE ICCV, IEEE, pp 4507–4515

Yao T, Li Y, Qiu Z, Long F, Pan Y, Li D, Mei T (2017) MSR Asia MSM at activitynet challenge 2017: trimmed action recognition. In: Temporal action proposals and dense-captioning events in videos. Tech rep, Microsoft

Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: NIPS, MIT Press, pp 3320–3328

Yu E, Gao M, Li Y, Dong X, Sun J (2017a) Shandong Normal University in the VTT Tasks at TRECVID 2017. In: TRECVID

Yu H, Siskind JM, Lafayette W (2015a) Learning to describe video with weak supervision by exploiting negative sentential information. AAAI. AAAI Press, Austin, Texas, pp 3855–3863

Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video Paragraph captioning using hierarchical recurrent neural networks. In: IEEE CVPR, IEEE, pp 4584–4593

Yu L, Park E, Berg AC, Berg TL (2015b) Visual Madlibs: Fill in the blank description generation and question answering. In: IEEE ICCV, IEEE, vol 2015 Inter, pp 2461–2469

Yu Y, Choi J, Kim Y, Yoo K, Lee SH, Kim G (2017b) Supervising neural attention models for video captioning by human gaze data. In: IEEE CVPR, IEEE, pp 6119–6127

Yu Y, Ko H, Choi J, Kim G (2017c) End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE CVPR, IEEE, pp 3261–3269

Yuan J, Tian C, Zhang X, Ding Y, Wei W (2018) Video captioning with semantic guiding. In: IEEE BigMM, IEEE, pp 1–5

Zeng KH, Chen TH, Niebles JC, Sun M (2016) Title generation for user generated videos. In: ECCV, Springer International Publishing, pp 609–625

Zhang B, Hu H, Sha F (2018) Cross-modal and hierarchical modeling of video and text. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) ECCV. Springer International Publishing, Cham, pp 385–401

Zhang H, Pang L, Lu YJ, Ngo CW (2016) VIREO @ TRECVID 2016: multimedia event detection, ad-hoc video search, video-to-text description. In: TRECVID

Zhang W, Wang B, Ma L, Liu W (2019a) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Anal Mach Intell

Zhang X, Zhang Y, Zhang D, Li J, Qi Tian A (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE CVPR, IEEE, pp 6250–6258

Zhang Z, Xu D, Ouyang W, Tan C (2019b) Show, tell and summarize: dense video captioning using visual cue aided sentence summarization. IEEE Trans Circuits Syst Video Technol

Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. In: IEEE/CVF CVPR, pp 13278–13288

Zhao B, Li X, Lu X (2019) CAM-RNN: Co-Attention model based RNN for video captioning. IEEE Trans Image Process 28:5552–5565MathSciNetMATHCrossRef

Zhao Y, Song Y, Chen S, Jin Q (2020) RUC\_AIM3 at TRECVID 2020: Ad-hoc Video Search and Video to Text Description. In: TRECVID

Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018a) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464CrossRef

Zhou L, Xu C, Corso JJ (2018b) Towards automatic learning of procedures from web instructional videos. In: AAAI, Association for the Advancement of Artificial Intelligence, pp 7590–7598

Zhou L, Zhou Y, Corso JJ, Socher R, Xiong C (2018c) End-to-end dense video captioning with masked transformer. In: IEEE/CVF CVPR, IEEE, pp 8739–8748

Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: IEEE/CVF CVPR, IEEE, pp 6571–6580

Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE ICCV, IEEE, vol 2017-October, pp 2242–2251

Zolfaghari M, Singh K, Brox T (2018) ECO: efficient convolutional network for online video understanding. In: ECCV, Springer International Publishing, pp 713–730

Title: A comprehensive review of the video-to-text problem
Authors: Jesus Perez-Martin
Benjamin Bustos
Silvio Jamil F. Guimarães
Ivan Sipiran
Jorge Pérez
Grethel Coello Said
Publication date: 16-01-2022
Publisher: Springer Netherlands
Published in: Artificial Intelligence Review / Issue 5/2022
Print ISSN: 0269-2821
Electronic ISSN: 1573-7462
DOI: https://doi.org/10.1007/s10462-021-10104-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 5/2022

Intuitionistic principal value Z-linguistic hybrid geometric operator and their applications for multi-attribute group decision-making

A survey on artificial intelligence techniques for chronic diseases: open issues and challenges

Fine-grained attention-based phrase-aware network for aspect-level sentiment analysis

A comparative review of graph convolutional networks for human skeleton-based action recognition

New measures of uncertainty based on the granularity distribution of approximation sets

A survey, taxonomy and progress evaluation of three decades of swarm optimisation

Premium Partner