1 Introduction
Abbreviation | Full Form | Abbreviation | Full Form |
---|---|---|---|
ACT | Adaptive Computation Time | MPII-MD | Max Plank Institute for Informatics - Movie Description |
ADs | Audio Descriptions | MSR-VTT | Microsoft research - Video To Text |
AMT | Amazon Mechanical Turk | MSVD | Microsoft Video Description |
BAST | Bag of aggregated semantic tuples | M-VAD | Montreal Video Annotation Dataset |
BERT | Bidirectional Encoder representations from Transformers | NAVC | Non-Auto Regressive Video Captioning |
BFVD | Buyer-generated Fashion Video Dataset | NLP | Natural Language Processing |
BLEU | Bilingual Evaluation Understudy | NMT | Neural Machine Translation |
BP | Brevity Penalty | NN | Neural Network |
C3D | 3D-CNN | RNN | Recurrent Neural Network |
CGAN | Conditional Generative Adversarial Network | Rouge | Recall Oriented Understudy for Gisting Evaluation |
CIDEr | Consensus-based Image Description Evaluation | Rouge-L | Recall Oriented Understudy for Gisting Evaluation: Longest Common Subsequence |
CNN | Convolution Neural Network | Rouge-N | Recall Oriented Understudy for Gisting Evaluation : n-gram Co-occurrence |
CV | Computer Vision | Rouge-S | Recall Oriented Understudy for Gisting Evaluation : Skip-bigram Co-occurrence |
DCE | Diverse Captioning Evaluation metric | Rouge-W | Recall Oriented Understudy for Gisting Evaluation : Weighted Longest Common Subsequence |
DRL | Deep Reinforcement Learning | SDN | Semantic Detection Network |
DVS | Descriptive Video Service | SPICE | Semantic Propositional Image Caption Evaluation |
ECN | Efficient Convolution Network | SSIM | Structural Similarity Index Measure |
ED | Encoder-Decoder | ST | Standard Transformer |
EMD | Earth Mover’s Distance | SVO | Subject Verb Object |
FFNN | Feed-Forward Neural Network | TACoS | Textually Annotated Cooking Scenes |
FFVD | Fan-generated Fashion Video Dataset | TRECVID | Text Retrieval Conference Video Retrieval evaluation |
GAN | Generative Adversarial Network | TvT | Two Viewed Transformer |
GPU | Graphical Processing Unit | UGVs | User Generated Videos |
GRU | Gated Recurrent Unit | UT | Universal Transformer |
HRL | Hierarchical Reinforcement Learning | VATEX | Video And Text |
LSH | Locality Sensitive Hashing | VCR | Visual Commonsense Reasoning |
LSMDC | Large Scale Movie Description Challenge | VQA | Visual Question Answering |
LSTM | Long Short-term Memory | VTW | Video Titles in the Wild |
METEOR | Metric for Evaluation of Translation with Explicit Ordering | WMD | Word Mover’s Distance |
MLE | Maximum Likelihood Estimation | WN | WordNet |
MPII | Max Plank Institute of Informatics | XE | Cross Entropy |
1.1 Classical approach
1.2 Video captioning
1.3 Dense video captioning/ video description
2 Literature review
References | Deep Learning-based Techniques | |||
---|---|---|---|---|
Std ED | AM | RL | TM | |
This Research | ✓ | ✓ | ✓ | ✓ |
Aafaq et al. (2019b) | ✓ | ✗ | ✗ | ✗ |
Wang et al. (2020) | ✗ | ✓ | ✗ | ✗ |
Li et al. (2019a) | ✓ | ▼ | ▼ | ✗ |
Chen et al. (2019b) | ✓ | ▼ | ▼ | ✗ |
Aafaq et al. (2019c) | ✓ | ▼ | ▼ | ✗ |
Amaresh and Chitrakala (2019) | ▼ | ▼ | ✗ | ✗ |
Su (2018) | ✓ | ▼ | ✗ | ✗ |
Park et al. (2018) | ▼ | ✗ | ✗ | ✗ |
Wu (2017) | ✓ | ▼ | ✗ | ✗ |
3 Techniques/approaches
3.1 Standard encoder–decoder approaches
S/N | References | Year | Approach | Model | |
---|---|---|---|---|---|
Visual | Language | ||||
A. CNN-RNN | |||||
1 |
Gao et al. (2022) | 2022 | vc-HRNAT (Video Captioning - Hierarchical Representation Network with Auxiliary Tasks) | CNN | LSTM |
Contributions: An end-to-end framework utilizing hierarchical representation learning and auxiliary tasks in a self-supervised manner. The framework is capable of learning multi-level semantic representation of video concepts. | |||||
Shortcomings: Visulaization of absent or ambiguous concepts of objects and actions in videos. | |||||
2 |
Seo et al. (2022) | 2022 | MV-GPT (Multimodal Video Generative Pretraining) | ViViT Arnab et al. (2021) based visual encoder, BERT based text encoder | modified GPT-2 based decoder. |
Contributions: Jointly trainable Encoder–Decoder model where no more manually annotated captions are required. Instead utterances at different time steps of the same video can be utilized. The encoder is trained directly from the pixels and words. | |||||
Shortcomings: Considering pre-training, system suffers from performance degradation for inputs in a different domain. | |||||
3 |
Aafaq et al. (2022) | 2022 | VSJM-Net (Visual-Semantic Joint Embedding Network) | 2D CNN | Vanilla Transformer |
Contributions: In the proposed Visual-Semantic Joint Embedding Network, the Visual-Semantic Embedding (ViSE) jointly learns the visual and semantic space while detecting proposals. Video Level Sequence Encoder (VLSE) detects event boundaries across frames in given video. The ViSE embedding are transformed into descriptor vectors with a Hierarchical Descriptor Transformer (HDT). The transformed features are used in proposal generation network along with Linguistic information. | |||||
Shortcomings: NA | |||||
4 |
Madake (2022) | 2022 | Dense Video Captioning | EfficientNetB7 | bi-LSTM + LSTM |
Contributions: Event detection and using information from future and past contexts in the video. EfficientNetB7 neural network is used for visual features extraction. Bi-LSTM and LSTM is employed for caption generation. | |||||
Shortcomings: As length of video increases from a few seconds to several minutes the BLEU and METEOR scores decrease as it is hard for LSTM to learn long term dependencies. | |||||
5 |
Hammoudeh et al. (2022) | 2022 | Soccer Captioning | ConvNet (CNN-img, CNN-flow, and CNN-vae) | transformer |
Contributions: A dataset of 22k video-caption pairs along with features extracted from 500 hours of SoccerNet videos and model employing semantic related losses while captioning soccer actions. | |||||
Shortcomings: NA | |||||
6 |
Zhao et al. (2022) | 2022 | Transformer-LSTM-RL | ResNet-152, ResNeXt-101, ViT | LSTM |
Contributions: Used an encoder composed of Transformer Encoder blocks to encode video features in a global view, thereby reducing the loss of intermediate hidden layer information. Further, Introduced the Policy Gradient reinforcement learning method to improve the accuracy of the model. | |||||
Shortcomings: In the video captioning task, the collection and labeling of training data often consumes a lot of manpower and material resources. | |||||
7 |
Perez-Martin et al. (2021a) | 2021 | Attentive Visual Semantics Specialized Network (AVSSN) | 2D/3D CNN | LSTM |
Contributions: Proposed specialized visual and semantic LSTM layers along with Adaptive Attention Gate for integrating different temporal representations into decoder. The model is capable of accurately capturing visual and semantic context representations. | |||||
Shortcomings: Model is unable to capture the multiple events. | |||||
8 |
Zheng et al. (2020) | 2020 | Syntax-Aware Action Targeting (SAAT) | Extractor encoder (Cxe) | Extractor decoder (Cxd) LSTM |
Contributions: An action guided captioner models relationship among video objects and dynamically fuse information from the predicate and previously generated words. | |||||
Shortcomings: Global temporal information captured by 3D CNN is not enough to learn finer actions. | |||||
9 |
Chen et al. (2020) | 2020 | VNS-GRU (Decoder with variational dropout and layer normalization, professional learning strategy) | CNN (ResNetXt-101) | Semantic GRU with variational dropout and layer normalization |
Contributions: Variational dropout and layer normalization is combined in the decoder to prevent overfitting and sustain convergence speed. Professional learning training strategy is adopted for efficient model training. | |||||
Shortcomings: Professional learning training strategy used to train model by optimizing losses needs further experimentation. | |||||
10 |
Hou et al. (2019) | 2019 | Joint Syntax Representation Learning and Visual Cue Translation (JSRL-VCT) | CNN (C3D, ResNet, Inception) | POS Tag Generator |
Contributions: An end-to-end trainable network capable of capturing the syntactic structure of sentences via video POS tagging and perceive intrinsic semantic primitives. Word bias problem caused by imbalanced classes is also addressed. | |||||
Shortcomings: Evaluation on B@4 is not remarkable due to B@4 lexical basis rather than syntactic matching. | |||||
11 |
Chen et al. (2019a) | 2019 | Semantic detection network (SDN), Semantic Compositional Network (SCN) & SDN trained with scheduled sampling | 2D-CNN for static features, 3D-CNN for spatio-temporal features | Semantic Compositional Network (SCN), a variant of LSTM |
Contributions: A semantic-assisted captioning model with scheduled sampling to bridge the gap between training and testing in the Teacher Forcing algorithm. Sentence-length-modulated loss function is also proposed to keep the model in a balance between language redundancy and conciseness. The proposed SDN (Semantic Detection Network) extracts high-quality semantic features for video. | |||||
Shortcomings: NA | |||||
12 |
Aafaq et al. (2019a) | 2019 | GRU–Enriched Visual Encoding (EVE) with hierarchical Fourier transform | 2D/3D CNN | 2-layered GRU |
Contributions: Visual encoding technique that effectively encapsulates spatio-temporal dynamics of the videos and embeds relevant high-level semantic attributes in the visual codes for video captioning. Use of hierarchical Fourier Transform to capture the temporal dynamics of videos. | |||||
Shortcomings: NA | |||||
13 |
Olivastri (2019) | 2019 | End-to-end network-Inception ResNet V2 (EtENet-IRv2) | CNN (Inception ResNet V2, GoogLeNet) | LSTM with soft attention (SA-LSTM ) |
Contributions: End-to-End trainable framework designed to learn task-specific features based on two staged training strategy. At first stage, freeze the weights of the pre-trained encoder to train the decoder resulting in low memory requirement and fast process execution. At the second stage, the whole network is trained end-to-end while freezing the batch normalisation layer. | |||||
Shortcomings: Evaluation on BLEU is not remarkable as BLEU score’s lack of explicit word matching between translation and reference. Training a deep neural network end-to-end requires significant computational resources. | |||||
14 |
Zhang et al. (2019a) | 2019 | Object-aware aggregation with bidirectional temporal graph (OA-BTG) | Convolutional gated recurrent unit (C-GRU) | GRU with attention |
Contributions: video captioning approach based on object- aware aggregation with bidirectional temporal graph (OA- BTG), which captures detailed temporal dynamics for the salient objects in video via a bidirectional temporal graph, and learns discriminative spatio-temporal video representations by performing object-aware local feature aggregation on object regions. | |||||
Shortcomings: Modeling salient objects with their trajectories along with interaction and relationship among objects is required for accurate descriptions generation of actions. | |||||
15 |
Liu et al. (2020) | 2018 | SibNet (sibling convolutional encoder for video captioning) | CNN (content and semantic branches) | RNN |
Contributions: A dual branch achitecture composed of visual branch and semantic branch. Content branch to encode salient visual content information and the semantic branch to encode high-level semantic information with the guidance of ground truth captions brought by visual-semantic joint embedding. The content branch, the semantic branch and the decoder are trained jointly by minimizing the proposed loss function. TCB (Temporal Convolution Block) is proposed providing more efficient video temporal encoding than RNN with less number of parameters. | |||||
Shortcomings: NA | |||||
16 |
Lee and Kim (2018) | 2018 | SeFLA (Semantic feature learning and attention-based caption generation) | 2D/3D CNN for visual features extraction, LSTM for semantic features extraction | LSTM |
Contributions: Semantic Feature Learning and Attention-Based Caption Generation for effective video captioning by utilizing both visual and semantic (dynamic and static) features. | |||||
Shortcomings: Relatively inefficient when predicting consecutive words demonstrating the model’s ineffectiveness in generating prepositional and postpositional particles. Low BLEU@2, BLEU@3, and BLEU@4 scores. | |||||
17 |
Wang et al. (2018a) | 2018 | Reconstruction Network (RecNet) | CNN (Inception-V4) | LSTM with temporal attention |
Contributions: RecNet with the encoder-decoder-reconstructor architecture for video captioning, which exploits the bidirectional cues (video to sentence, i.e., forward and sentence to video, i.e., backward) between natural language description and video content. Video global and local structures are restored byncustomized reconstructor networks. The forward likelihood and backward reconstruction losses are jointly modeled to train the proposed network. | |||||
Shortcomings: RecNet-global under-performed as compared to RecNet-local due to the temporal dynamic modeling and employment of mean pooling for video representation reproduction. This simple temporal attention mechanism cannot capture the internal relationships of key information. Ji et al. (2022) | |||||
18 |
Pan et al. (2017) | 2017 | Long Short-Term Memory with Transferred Semantic Attributes(LSTM-TSA) | 2D/3D CNN | LSTM with high-level semantic attributes |
Contributions: Proposal of LSTM-TSA for addressing the issue of exploiting the mutual relationship between video representations and attributes for boosting video captioning. A transfer unit is designed to dynamically control the impacts of semantic attributes from the two sources (images and videos) on sentence generation. | |||||
Shortcomings: NA | |||||
19 |
Shen et al. (2017) | 2017 | Lexical-FCN (lexical fully convolutional neural network) | CNN (Lexical-FCN model) | LSTM |
Contributions: Dense video captioning by weakly supervised learning utilizing only video-level sentence annotations. Proposed approach modeled visual cues with Lexical-FCN, discovering region-sequence with submodular maximization, and decodes language outputs with sequence-to-sequence learning. | |||||
Shortcomings: For result comparison with oracle, need to strengthen the evaluator network. Diversity score is slightly worse than the best of the clustered ground-truth sentences. | |||||
20 |
Zhang et al. (2017) | 2017 | Task-driven data fusion (TDDF) | 2D/3D CNN (VGG-19, GoogLeNet, C3D) | TDDF-based LSTM |
Contributions: To reduce ambiguity in video description, the proposed system adaptively choose different fusion patterns according to the task status. The dynamic fusion model can attend to certain visual cues that are most relevant to the current word. Appearance- centric, motion-centric and correlation-centric fusion patterns are designed to support the recognition of visual entities. | |||||
Shortcomings: The system failed to describe animation films due to different description context information during training and testing. | |||||
21 |
Lowell et al. (2014) | 2015 | Translating videos into natural language using deep recurrent neural networks | CNN (Caffe) | LSTM with transfer learning |
Contributions: End-to-end deep model for video-to-text generation that simultaneously learns a latent meaning state, and a fluent grammatical model of the associated language. | |||||
Shortcomings: Model trained and evaluated on random frames from the video, and not necessarily a key-frame or most-representative frame. Moreover, training on images alone do not directly perform well on video frames, and a better representation is required to learn from videos. | |||||
22 | Rivera-soto and Ordóñez (2013) | 2013 | Sequence-to-sequence models for generating video captions | ResNet-50 Wang et al. (2018c), VGG16, LSTM | LSTM |
Contributions: In sequence to sequence model pre-trained convolution network extract visual features from video frames and fed to LSTM encoder for encoding. LSTM decoder is employed to generate natural language description. | |||||
Shortcomings: The responsibility of both encoding the input features and decoding the natural language description in one set of weights complicates the convergence of the network. | |||||
23 |
Yan et al. (2010) | 2010 | Crowd Video Captioning (CVC) | 2D/3D CNN (Inception, ResNet, C3D) | LSTM/GRU (S2VT) |
Contributions: The proposed model aim to generate captions for the crowd video, i.e., describing the off-site audiences or visitors crowd. Created a dataset based on WorldExpo’10. | |||||
Shortcomings: Small dataset with simple captions. The number of videos and the complexity of captions needs to be increased in the dataset. | |||||
B. RNN-RNN | |||||
24 | Zhang et al. (2021) | 2021 | RCG (Retrieve-Copy-Generate) | Bi-directional LSTM | att-LSTM + Lang LSTM |
Contributions: End-to-end trainable Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively, and a copy-mechanism generator is introduced to extract expressions from multi-retrieved sentences dynamically. | |||||
Shortcomings: NA | |||||
25 |
Xiao and Shi (2019z) | 2019 | Diverse Captioning Model (DCM), Conditional GAN (a CNN as a generator/discriminator) | Bi-directional LSTM | Stacked LSTM |
Contributions: An efficient model for generating accurate descriptions with the aim to be consistent with human behaviour. A conditional GAN is proposed to explore the diversity of the descriptions. Diverse Captioning Evaluation (DCE) is also proposed to evaluate not only the differences among captions but also consider the rationality of the generated descriptions. | |||||
Shortcomings: Better accuracy assessment method required to decrease the gap between DCM and Ground Truth under DCE evaluation. | |||||
26 |
Babariya and Tamaki (2020) | 2019 | Object Attention, Meaning (OAM)-guided LSTM ED model + metric learning | LSTM | LSTM with attention |
Contributions: the proposed approach can describe objects detected by object detection, and generate captions having similar meaning with correct captions. | |||||
Shortcomings: Object detector (mistaken identification of objects) directly affect the encoder. | |||||
27 |
Wang et al. (2019a) | 2019 | GFN-POS (Controllable video captioning with POS sequence guidance based on a gated fusion network) | LSTM temporal encoder | 2-layer LSTM |
Contributions: A gated fusion network incorporating multiple features information together and a POS sequence generator predicting the global syntactic POS information of the generated sentence. Also proposed across gating (CG) strategy to effectively encode and fuse different representations. The global syntactic POS information is adaptively and dynamically incorporated into the decoder to guide the decoder to produce more accurate description in terms of both syntax and semantics. | |||||
Shortcomings: Under performed on BLEU@4 score due to mainly focusing on optimizing the CIDEr metric with reinforcement learning. | |||||
28 |
Hammad et al. (2019) | 2019 | Effects and interaction of multi-modal features, seq-to-seq video description | Stacked LSTM | Stacked LSTM with attention |
Contributions: Model is based on S2VT(Sequence-Video to Text) focusing on the characterization of the impact of features utilization from pre-trained model to implement video captioning. 2D object recognition features, scene recognition features, 3D action recognition features, audio features, and object recognition Intermediate Features are employed for abstract information about the different objects in the frame and their relations. | |||||
Shortcomings: Contrasting concatenation techniques for size reduction of multi-modal input data along with the model’s increased capacity by adding more LSTM nodes and better regularization techniques should be investigated. | |||||
29 |
Zhao et al. (2018) | 2018 | Tube features (Faster-RCNN) object detection + feature extraction + LSTM-based ED with attention | Bi-directional LSTM | LSTM with attention |
Contributions: Video caption generator conditioned on tube features. Where tubes are formed by the object trajectories. Each object tube is constructed by the faster-RCNN based detected objects and their corresponding regions in different frames. The edge between each pair of bounding boxes in adjacent frames are labeled with a similarity score. Bidirectional LSTM captures the dynamic information by encoding each tube. | |||||
Shortcomings: Restricted performance due to the visual input to LSTM, that is just the average pooling of frame features. | |||||
30 |
Donahue et al. (2017) | 2017 | Long-term Recurrent Convolutional Network (LRCN) | LSTM | LSTM with CRF (max or probabilities) |
Contributions: An end-to-end LRCN (Long-term Recurrent Convolutional Networks), a class of recurrent-convolution architectures for visual recognition and description which combines convolutional layers and long-range temporal recursion. The proposed model is specifically for video activity recognition, image caption generation, and video description tasks. The recurrent convolutional models are two times deep in a way that they learn compositional representations in space and time. | |||||
Shortcomings: NA | |||||
31 |
Wang and Song (2017) | 2017 | S2VTK (S2VT with knowledge) | LSTM | LSTM |
Contributions: video captioning approach aiming at knowledge base information fusion with frame features of the video. LSTM based caption generator is trained by maximizing the probability of correct caption given a video. | |||||
Shortcomings: Only BLEU and METEOR scores are reported. The proposed model is not evaluated for CIDEr and ROUGE scores. | |||||
32 |
Venugopalan et al. (2015) | 2015 | S2VT (end-to-end sequence-to-sequence stacked LSTM) | Stacked LSTM | Stacked LSTM |
Contributions: A pioneer sequence to sequence model in video captioning. The proposed model learns to map sequence of frames to a sequence of words directly. The optical flow is computed to model the temporal aspects of the events in the video. | |||||
Shortcomings: Model evaluated only for METEOR score (single evaluation metric can not guarantee the algorithm’s superiority). | |||||
33 |
Cho et al. (2014) | 2014 | RNN ED for machine translation from English to French | RNN (translation model) | RNN |
Contributions: The proposed RNN Encoder–Decoder framework with a hidden unit that adaptively remembers and forgets, is evaluated on the task of NMT (Neural Machine Translation) from english to french. | |||||
Shortcomings: Enhanced performance can be achieved by using neural net language model. | |||||
C. CNN-CNN | |||||
34 |
Chen et al. (2019b) | 2019 | TDConvED (CNN for both encoding and decoding with temporal attention) | CNN (VGGNet, ResNet) | CNN (VGGNet, C3D, ResNet) |
Contributions: The system contributed by exploiting a fully convolutional sequence learning architecture that relied on CNN-based encoder and decoder for video captioning. Moreover, it explored the temporal deformable convolutions and temporal attention mechanism to extend and utilize temporal dynamics across frames/clips. | |||||
Shortcomings: NA |
3.1.1 CNN–RNN
3.1.2 RNN–RNN
3.1.3 CNN–CNN
3.2 Discussion - ED based approaches
3.3 Attention mechanism
S/N | References | Year | Approach | Model | |
---|---|---|---|---|---|
Visual | Language | ||||
1 |
Ji et al. (2022) | 2022 | ADL (Attention-based Dual Learning) | Inception-V4 | LSTM + MHDPA (multi-head dot product attention) |
Contributions: An attention based dual learning approach (ADL) which can minimize the semantic gap between raw videos and generated captions by minimizing the differences between the reproduced and the raw videos, thereby enhancing the quality of the generated video captions. | |||||
Shortcomings: NA | |||||
2 |
Peng et al. (2021) | 2021 | global text combined with local attention enhancement (T-DL) | 2D/3D CNN, global Attention | GRU |
Contributions: Proposed extraction of 2D/3D video features bidirectional time flow, global image attention, training method of local attention focusing important words of the text. Global dynamic attention is added to text training to generate description text with reference to the context during the training. | |||||
Shortcomings: NA | |||||
3 |
Ryu et al. (2021) | 2021 | SGN (Semantic Grouping Network) | 2D/3D CNN | LSTM + Semantic Attention |
Contributions: The proposed network encodes the video into semantic groups that are in terms of relevant frames and the corresponding word phrases of the partially decoded caption, and adaptively decodes the next word based on the semantic groups. Moreover Contrastive Attention (CA) loss to provide labor-free supervision for the correct visual- textual alignment within each semantic group is proposed. | |||||
Shortcomings: SGN’s repeated grouping process reduces the inference speed of about 25%. | |||||
4 |
Chen et al. (2021) | 2021 | Scan2Cap | Mask R-CNN for 2D-3D projection, VoteNet for 3D-2D projection | Fusion GRU |
Contributions: An end-to-end trainable model capable to detect and describe 3D objects and their relationships in RGB-D scans. | |||||
Shortcomings: Difference in viewpoint, limited field of view and motion blur can cause poor performance. | |||||
5 |
Chen and Jiang (2021) | 2021 | EC-SL (Event Captioner-Sentence Localizer) | C3D, ISAB employing multi-head attention | bi-LSTM |
Contributions: Integrating the temporal localization and description for events in untrimmed videos under the weakly supervised setting, where temporal boundary annotations are not available. Creation of information communication channels between the tasks for better bridging and unification. | |||||
Shortcomings: Without external training data, the concept learner can not accurately detect concepts that are visually small and still suffers from the long-tailed issue. | |||||
6 |
Perez-Martin et al. (2021b) | 2021 | Visual-Semantic-Syntactic Aligned Network(SemSynAN) - Temporal Attention based on Soft Attention | 2D/3D CNN | LSTM |
Contributions: Created visual-syntactic embeddings by exploiting the Part-of-Speech (POS) templates of video descriptions. The learning process is based on a match and rank strategy, and ensures that videos and their corresponding captions are mapped close together in the common space. Then map the input video and generate our desired visual-syntactic embedding while generating while producing features for decoder. The proposed video captioning that integrates global semantic and syntactic representations of the input video. It learns how to combine vi- sual, semantic, and syntactic information in pairs. | |||||
Shortcomings: NA | |||||
7 |
Xu et al. (2020) | 2020 | Temporal-spatial and channel attention | Inception-V3 | LSTM |
Contributions: video description model based on temporal-spatial and channel attention is proposed. Model fully utilized the essential characteristics of CNN and added channel features into the attention mechanism of the model. Therefore, the model can use visual features more effectively and ensure the consistency of visual features and sentence descriptions to enhance the effect of the proposed model. | |||||
Shortcomings: The model cannot give the correct word after the article “a,” This may be due to the lack of attention mechanism for the ability to model abstract nouns that cannot express specifically. Similarly for articles like “a”, because it is not very relevant to the vision, all regions in the video are equally treated, so there would be no salient regions. | |||||
8 |
Zhang et al. (2020) | 2020 | Spatial-Temporal attention | Appearance features: InceptionResNetV2 pretrained on ImageNet, Motion features: C3D based ResNetXt-101 pretrained on Kinetics-400, Object features: ResNetXt-101 based Faster-RCNN pretrained on MSCOCO | 2 layered LSTM with a temporal- spatial attention module. |
Contributions: The proposed model generated video description with the assistance of an original training strategy. A learnable object relational graph to fully explore the spatial and temporal relationships between objects is created. Object representations can be enhanced during the process of relational reasoning. Partial and complete relational graphs are explored in this study. teacher-enforced learning is also introduced to enhance the quality of generated captions. | |||||
Shortcomings: NA | |||||
9 |
Yan et al. (2020) | 2020 | Spatial-Temporal Attention (STAT) | CNN (GoogleNet) + R-CNN (Faster R-CNN) | LSTM |
Contributions: A syntax-aware module is proposed that forms a self-attended scene representation to model the relationship among video objects and then decodes syntax components by setting different queries, targeting the action in video clips. An action-guided captioner that learns an attention distribution to dynamically fuse the information from the predicate and previously predicted words, avoiding wrong-action prediction in generated captions. | |||||
Shortcomings: The global temporal information provided by 3D CNNs is not always enough to learn finer actions in video clips. | |||||
10 |
Gao et al. (2020) | 2020 | Fused GRU with Semantic-Temporal Attention (STA-FG) | 2D/3D CNN | GRU |
Contributions: An end-to-end framework incorporating the high-level visual concepts prediction into CNN-RNN approach for videos captioning. Nouns and verbs from the training sentences as used as concepts while training a multi-label CNN. Both low-level visual features and high level semantic representation are fused and a semantic and temporal attention mechanism in a fused GRU network for accurate video captioningis proposed. | |||||
Shortcomings: CIDEr and ROUGE scores are not computed. Only BLEU and METEOR scores are demonstrated during the model evaluation. | |||||
11 |
Liu et al. (2020) | 2020 | Soft Attention (SibNet) | GoogleNet + Inception | LSTM |
Contributions: A dual branch achitecture composed of visual branch and semantic branch. Content branch to encode salient visual content information and the semantic branch to encode high-level semantic information with the guidance of ground truth captions brought by visual-semantic joint embedding. The content branch, the semantic branch and the decoder are trained jointly by minimizing the proposed loss function. TCB (Temporal Convolution Block) is proposed providing more efficient video temporal encoding than RNN with less number of parameters. | |||||
Shortcomings: NA | |||||
12 |
Pramanik et al. (2019) | 2020 | Self-Attention (OmniNet ) | ResNet-152 | Transformer with a two-step attention mechanism (Spatial & temporal) |
Contributions: Proposed an extended transformer towards a unified architecture, which enabled a single model to support tasks with multiple input modalities and asynchronous multi-task learning. Image, text and video peripherals are described as direct conjuction of spatial and temporal phenomenon in this research work. The proposed model can process and store spatio-temporal representation for each of the input domains and then decode predictions across a multitude of tasks. | |||||
Shortcomings: NA | |||||
13 |
Yan et al. (2019) | 2019 | Multi-Granular Attn (GLMSIR) | VGG16 + Faster R-CNN | LSTM |
Contributions: Graph-based Learning for Multi-Granularity Interaction Representation (GLMGIR) for fine-grained team sports auto-narrative task. This framework is featured with two key components. firstly, a multi-granular interaction modeling module is proposed to extract among-subjects’ interactive actions in a progressive way, for encoding both intra- and inter-team interactions. Secondly, based on the above multi-granular representations, a dense multi-granular attention module is developed to specifically handle the task of spatio-temporal granular feature selection for generating action or event descriptions of multiple spatio-temporal resolutions. The output of both the modules is input into a decoding network, which generates the final descriptions. | |||||
Shortcomings: Explored by the authors that granularity itself is an important threshold, so if the chosen granularity is too big, background noises can jeopardize the learned representations. If the chosen granularity is too small, it cannot provide enough information for generating useful descriptions. A wise selection of granuality threshold is required. | |||||
14 |
Zhou et al. (2019) | 2019 | Grounded video description | Feature: ResNeXt-101 + LSTM | LSTM |
Contributions: Dataset ActivityNet Entities is created, grounding video description on the noun phrase level to bounding boxes. Wit the bounding boxes supervision, a grounded video description model is proposed. | |||||
Shortcomings: Extra context and region interaction introduced by the self-attention confuses the region attention module and without any grounding supervision makes it fail to properly attend to the right region. | |||||
15 |
Chen and Jiang (2019) | 2019 | Motion Guided Spatial Attention (MGSA) | static features: CNN (GoogleNet, Inception-ResNst-V2), Motion information: C3D | LSTM |
Contributions: A novel video captioning framework Motion Guided Spatial Attention (MGSA), which utilizes optical flow to guide spatial attention. incorporates optical flow for attention guidance in video captioning. Introduced recurrent relations between consecutive spatial attention maps resulted in boost to captioning performance and designed a recurrent unit called Gated Attention Recurrent Unit (GARU) for this purpose. | |||||
Shortcomings: NA | |||||
16 |
Gao et al. (2019) | 2018 | Hierarchical LSTM with Adaptive Attention (hLSTMat) | CNN (ResNet-152 He et al. (2016)) | Hierarchical LSTM |
Contributions: The proposed hLSTMat framework, with the representation enrichment ability of LSTM, automatically decides when and where to use visual information, and when and how to adopt the language model to generate the next word for visual captioning. Spatial and temporal attention is used to decide where to look at visual information and the adaptive attention decides when to rely on language context information. At each time step, with both LSTM fusion, the low-level visual information and high-level language context information is obtained through hierarchical LSTMs. When connecting LSTMs sequentially, the second LSTM refines the first LSTM. | |||||
Shortcomings: Increased number of parameters and training time for two streams of hierachical LSTM. | |||||
17 |
Chen et al. (2018b) | 2018 | Spatiotemporal Attention | CNN (VGG16) | LSTM |
Contributions: Visual saliency information is utilized for arrangement of visual concepts in frames and paying attention to the informative frames in the video. As salient objects point towards important and dominant visual concepts, so a saliency-based spatiotemporal attention mechanism for video captioning is proposed. The model is capable of accurately aligning the visual information with the predicted words, to form a diverse caption. | |||||
Shortcomings: There is a gap between proposed model and other methods while evaluating in terms of METEOR and CIDEr scores. There are two main reasons; one is that the visual feature extractor used are weak, and the second is that fragment-level features are not taken into consideration. | |||||
18 |
Wang et al. (2018c) | 2018 | Hierarchically aligned cross-modal attention (HACA) | Image: ResNet, Audio: VGGish | LSTM |
Contributions: The proposed model learnt the attentive representations of multiple modalities along with the alignment and fusion of local and global contexts for video understanding and video captioning tasks. Deep audio and visual features are employed for description generation. | |||||
Shortcomings: NA | |||||
19 |
Yu (2017) | 2018 | Gaze Encoding Attention Network (GEAN) | Scene: GoogleNet, Motion:C3D, Fovea: GoogleNet | GRU |
Contributions: The research work study the effect of supervision by human gaze data on atten- tion mechanisms, particularly for video captioning. A dataset of movie clips with multiple annotations and human gaze tracking labels is created. The proposed GEAN model efficiently incorporates spatial attention by the gaze prediction model with temporal attention in the language decoder. | |||||
Shortcomings: NA | |||||
20 |
Li et al. (2019b) | 2018 | Residual attention (Res-ATT) | Static: Google Net, ResNet, Motion: C3D | LSTM |
Contributions: (Res-ATT) an attention based model, considering sentence internal information which usually gets lost in the transmission process. Integration of residual mapping into a hierarchical LSTM network to solve the degradation problem is proposed. | |||||
Shortcomings: NA |
3.4 Discussion—attention based approaches
3.5 Transformer mechanism
S/N | References | Year | Approach | Contributions | Shortcomings |
---|---|---|---|---|---|
1 |
Wu (2022) | 2022 | NSVA (Vision Transformer + TimeSformer , Transformer Decoder) | Identity-aware NBA dataset for sports video analysis (NSVA) built on web data. Design of a unified approach to process raw videos into a stack of meaningful features with minimum labelling efforts, showing cross modeling employing transformer architecture. | The bottleneck of the proposed model is player identification. |
2 |
Im and Choi (2022) | 2022 | UAT: FEGs (feature extraction gates) + UEA (universal encoder attraction) | An end-to-end learnable transformer for video captioning. Proposed a feature extraction gate (FEG) that considers making better features by a fusion of CLS token and patch sequences along with universal encoder layer attention (UEA) constructed to obtain more information from one feature type. | NA |
3 |
Yuan et al. (2022) | 2022 | Vanilla Transformer+ CMF (Cross-modal fusion) module | Enhanced 3D dense captioning method employing Cross-modal knowledge transfer using Transformer for 3D dense captioning. Proposed captioning through the knowledge distillation enabled by a teacher-student framework. | Due to the limited views of a single image, the performance of image-based dense captioning methods is significantly degraded when directly transferred to 3D scenarios. |
4 |
Vo et al. (2022) | 2022 | NOC-REK : pre-trained BERT to embed the definition of each word into the embedding. | An end-to-end NOC-REK model which retrieves vocabulary from external knowledge and generates captions using shared-parameter transformers. | All potential objects may not be represented while extracting ROI from a given image. In the case of an heavy vocabulary, training both image features and vocabulary embeddings is required. |
5 |
Wang et al. (2021) | 2021 | PDVC (parallel Decoding video Captioning) | An end-to-end framework formulating dense video captioning as a parallel set prediction task, significantly simplifying traditional captioning pipeline. An event counter to estimate the number of events in a video is introduced. | employment of transformer captioner for high performance. |
6 |
Estevam et al. (2021) | 2021 | BMT-V+Sm (Bi-Modal Transformer with visual and semantic descriptor) | proposed an unsupervised descriptor that can be easily employed in video understanding tasks and can adequately capture the visual similarity between seen and unseen clips. visual similarity employed to generate event pro- posals replacing the audio signal. | slightly lower performance using semantic descriptor instead of audio modality while comparing with BMT (Bi-Modal Transformer). |
7 |
Liu et al. (2021) | 2021 | O2NA (Object-Oriented Non-Autoregressive Approach) | Controlled contents video captioning approach focused on practical values than syntatical variations. The proposed approach tackle the controllable video captioning problem by injecting strong control signals conditioned on selected objects, with the benefits of fast and fixed inference time, which are critical for real-time applications. | A more powerful object predictor may be helpful in solving the issue of incorrectly predicted objects. |
8 |
Deng et al. (2021) | 2021 | SGR (Sketch, Ground, and Refin) Vanilla Transformer | SGR (Sketch, Ground, and Refin) reversed the predominant “detect-then-describe” fashion and proposed to solve dense video captioning from a top-down perspective, i.e., generating a video-level story at first and then ground each sentence in the story to video segment for detailed refinement. By doing this, the event segments are predicted not only based on the visual information, but also the semantic coherence from the text. | NA |
9 |
Song et al. (2021) | 2021 | Vanilla transformer with dynamic video memories enhanced attention | The proposed model avoided the event detection stage and generated paragraphs directly. In vanilla transformer the standard attention module is replaced with dynamic video memories enhanced attention. | NA |
10 |
Zhang et al. (2021) | 2021 | RSTNet (Relationship-Sensitive Transformer) - Transformer with Adaptive attention | Proposed RSTNet combined with GA and AA. Grid Augmented (GA) unit to integrate the spatial information of raw visual features extracted from image. Adaptive attention (AA) for facilitation of fine-grained captioning by dynamically measuring the involvement of visual and text signals for word prediction. | NA |
11 |
Li et al. (2020) | 2020 | HERO: Hierarchical encoder for video+language omni-representation pre-training | Cross-modal transformer combining subtitle sentence with its local video frames, followed by temporal transformer to obtain a sequentially contextualized embedding for each video frame is proposed primarily for representation learning. | Unlike Multi- stream, which leverages fine-grained region-level features, HERO’s results are reported on global frame- level features. Therefore, it may be difficult for HERO to capture the inconsistency between hypothesis and video content. |
12 |
Ging et al. (2020) | 2020 | COOT: Cooperative hierarchical transformer for video-text representation learning | An hierarchical transformer architecture with an attention-aware feature aggregation layer and a contextual attention module. Semantic alignment between vision and text features in the joint embedding space is proposed through cross-modal cycle-consistency loss. Both proposed components contribute jointly and individually to an improved retrieval performance. | NA |
13 |
Jin et al. (2020) | 2020 | SBAT: Video captioning with a sparse boundary-aware transformer | boundary- aware pooling operation following the preliminary scores of multihead attention and selection for the features of different scenarios to reduce the redundancy is proposed with the aim to improve the vanilla transformer. Developed a local correlation scheme to compensate the local information loss brought by sparse operation. The developed scheme can be implemented synchronously with the boundary-aware strategy. | NA |
14 |
Lei et al. (2020b) | 2020 | MMT: Multi-modal transformer | Multi-modal transformer following vanilla transformer taking both video and subtitles as encoder input and generated description employing decoder. At inference, greedy decoding is employed instead of beam search. | Models with both appearance features and motion features performed better than only using one features representation. |
15 |
Lei et al. (2020a) | 2020 | MART: Memory-augmented recurrent transformer for coherent video paragraph captioning | Memory augmented transformer based paragraph description model conditioned on the given video with pre-defined event segments. The proposed model generated less redundant paragraphs while maintaining relevance to the videos. | Achieved best scores on CIDEr-D and R@4 metrics, not on B@4 and METEOR. |
16 |
Zhu et al. (2018) | 2020 | Employing X-Lin+Tr Pan et al. (2020b) | Multi-view features and hybrid reward methods are proposed to address the variety of video content and diversity of captions. Weighted Ensemble method has little improvement over Average Ensemble, hence selected the former. | NA |
17 |
Fang et al. (2020) | 2020 | V2C transformer | Creation of dataset annotated with captions and commonsense aspects. Proposed V2C-Transformer architecture that effectively generates relevant commonsense descriptions. | Do not have enough annotations per sample to compute a fair BLEU score for comparisons. |
18 | Iashin and Rahtu (2020) | 2020 | Multi-modal dense video captioning | Bi-modal transformer with a bi-modal multi-headed proposal generation module is proposed demonstrating use of audio and visual features while performing dense video captioning. | The dense video captioning system, audio-only model mostly get the signal of "talking", needs more attention. |
19 |
Kitaev et al. (2020) | 2020 | Reformer | Reformer combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences and with small memory use even for models with a large number of layers. | The computational cost of a model grows with the number of hashes, so hashing hyperparameter can be adjusted depending on the available compute budget. |
20 |
Dai et al. (2020) | 2020 | Transformer XL | Extra long transformer is proposed to address the limitation of fixed length context with the notion of recurrence into deep self attention network. The hidden states achieved in previous segments are utilized as memory for current segments resulting in recurrent connections between the segments. The authors also proposed simple yet effective positional encoding formulation. | NA |
21 |
Luo et al. (2020) | 2020 | UniVL: Unified video and language pre-training model for multi-modal understanding and generation | Proposed a multi-modal video-language pre-training model trained on a large-scale instructional video dataset. Model is designed with four modules and five objectives capable of video-language understanding and generative tasks and learning joint representation of video and language and adapt down-stream multi-modal tasks. | Joint loss decreases the generation task a little, although it performs well in the retrieval task. Excessive emphasis on coarse-grained matching can affect the fine-grained description at the generation task. |
22 |
Pan et al. (2020a) | 2020 | Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training | Large-scale automatically generated pre-training dataset for generic video understanding. Designed Transformer-based Encoder–Decoder structure for vision-language pre-training in video domain. | NA |
23 |
Bilkhu et al. (2019) | 2019 | Universal transformer | A generalized form of the standard transformer, is a parallel-in-time recurrent self-attentive sequence model based on the ED architecture, and employs an RNN for representations of every position in both input and output sequences. The recurrence is over the depth, not over the position in the sequence. the depth of UT is the main difference between the standard transformer and the UT. | NA |
24 |
Sun et al. (2019b) | 2019 | VideoBERT: A Joint Model for Video and Language Representation Learning | Method to learn high level video representations that capture semantically meaningful and temporally long-range structure. | Explicitly model visual patterns at multiple temporal scales, instead of the proposed approach, that skips frames but builds a single vocabulary. |
25 |
Lu et al. (2019) | 2019 | ViLBERT: Vision & language BERT | Extending the popular BERT, the authors developed a model and proxy tasks for learning joint visual-linguistic representations. Two-stream architecture with co-attentional transformer blocks that outperforms sensible ablations and exceeds state-of-the-art when transferred to multiple established vision-and-language tasks. | Considering training, language often only identifies high-level semantics of visual content and is unlikely to be able to reconstruct exact image features. Further, applying a regression loss could make it difficult to balance losses incurred by masked image and text inputs. |
26 |
Child et al. (2019) | 2019 | Sparse transformers | introduced several sparse factorization of the attention matrix, as well as reconstructed residual blocks, weight initialization for training enhancement of deeper networks, and a reduction in memory usage. Unlike a transformer, where training with many layers is difficult, the sparse transformer facilitates hundreds of layers by using the pre-activation residual block. Instead of positional encoding, learned embedding is useful and efficient. | NA |
27 |
Li and Qiu (2020) | 2019 | LSTM vs standard transformer | Authors explored attention over space and time with features extracted from Pseudo-3D ResidualNet and compared neural network architectures using temporal attention over 2D-CNN features. Also explored the performance of LSTM vs transformer for video captioning. | Hyperparameter tuning while training model is required. Spatio-temporal attention over P3D features did not improve performance. |
28 |
Yang et al. (2019) | 2019 | NAVC (standard transformer) | Developed a non-autoregressive video captioning model (NAVC) with iterative refinement. Also exploited external auxiliary scoring information to assist the NAVC in precisely focusing on those inappropriate words during iterative refinement. The captioning decoder is capable of predicting target words along with the parallel generation of captions. | Explore an internal auxiliary scoring module to get rid of external constraints. |
29 |
Zhou et al. (2018) | 2018 | Masked transformer | End to end non-efficient transformer based model employing masking network to restrict attention to the proposal event over the encoding features. The proposed model employs a self attention mechanism. | Small objects, such as utensils and ingredients, are hard to detect using global visual features but are crucial for describing a recipe. |
30 |
Chen et al. (2018) | 2018 | Two-view transformer | TVT comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively allowing parallel computing. | Other modalities can be incorporated in the TVT framework for better video captioning. |
31 |
Vaswani et al. (2017) | 2017 | Standard transformer | A simple transduction architecture for sequence modeling entirely based on an attention mechanism with the objectives of parallelization and a reduction in sequential computation. The commonly used recurrent layers in the ED architecture are replaced with multi-head self-attention layers where self attention is about computing the sequence representation by relating different positions or parts of it. | NA |
3.5.1 Standard/Vanilla transformer
3.5.2 Universal transformer
3.5.3 Masked transformer
3.5.4 Two-view transformer
3.5.5 Bidirectional transformer
3.5.6 Sparse transformer
3.5.7 Reformer (the efficient transformer)
3.5.8 Transformer-XL
3.6 Discussion—transformer based approaches
3.7 Deep reinforcement learning (DRL)
References | Year | Approach | DRL System Components | ||||
---|---|---|---|---|---|---|---|
Agent | Action | Environment | Reward | Goal | |||
Li and Qiu (2020) | 2019 | End-to-end video captioning with multitask RL | Captioning model | Predict next word | Input video with user-annotated captions | CIDEr score | Generate a proper sentence after observing input vide |
He et al. (2019) | 2019 | Read, watch, and move: RL for temporally grounding natural language descriptions in videos | RL (actor-critic-based model) | One of 7 ways to adjust temporal boundaries | Video, the description, temporal grounding boundaries | 1, if temporal IoU is within a certain threshold; 0, otherwise | Extraction of well-matched video clips w.r.t. the provided query |
Zhang et al. (2019c) | 2019 | Reconstruction Network (RecNet) | ED model (RecNet) | Predict next word | Video content | ground truth words | CIDEr score Caption generation by metric optimization |
Wang et al. (2018b) | 2018 | Video captioning via hierarchical RL | Manager + Worker + Critic | Selection of words from a dictionary | Textual and video context | Delta CIDEr | Maximize discounted return Rt |
Chen et al. (2018a) | 2018 | Less is more–picking informative frames for video captioning (PickNet) | PickNet Model | Frame picking | ED architecture | Sum of language reward and visual diversity reward | Select informative frames for the task of video captioning |
Pasunuru and Bansal (2017) | 2017 | Reinforced video captioning with entailment rewards | Baseline model | Word generation | Video and caption | CIDEnt (entailment corrected reward) | Minimize the negative expected reward |
Ren et al. (2017) | 2017 | Deep RL-based image captioning with embedding reward | Policy network + value network | Predict next word | Input image and predicted words | Visual-semantic embedding similarities | Generate caption similar to ground truth |
Phan et al. (2017) | 2017 | Consensus-based Sequence Training (CST) for video captioning | LSTM language model | Predict next word | Image/video features and words | CIDEr | Generate captions similar to reference captions |
Rennie et al. (2017) | 2017 | Self-critical sequence training (SCST) for image captioning | LSTM | Next word prediction | Words and image features | CIDEr | Generate image captions similar to reference captions |
Ranzato et al. (2016) | 2016 | Sequence-level training with recurrent neural network (MIXER) | Generative model, RNN | Next word prediction | Context vector and words | BLEU, ROUGE-2 | Sequence generation |
3.8 Discussion—deep reinforcement learning (DRL)
4 Results comparison & discussion
4.1 Evaluation metrics
4.1.1 BLEU
4.1.2 METEOR
4.1.3 ROUGE
4.1.4 CIDEr
4.2 Datasets for evaluation
4.2.1 MSVD—the microsoft video description dataset
References | Model | Year | B | M | R | C |
---|---|---|---|---|---|---|
A. Encoder-decoder based approaches | ||||||
Gao et al. (2022) | vc-HRNAT | 2022 | 57.7 | 36.8 | 74.1 | 98.1 |
Madake (2022) | DVC | 2022 | 75 | 34.7 | – | – |
Perez-Martin et al. (2021a) | AVSSN | 2021 | 62.3 | 39.2 | 76.8 | 107.7 |
Zheng et al. (2020) | SAAT | 2020 | 46.5 | 33.5 | 69.4 | 81.0 |
Chen et al. (2020) | VNS-GRU | 2020 | 64.9 | 41.1 | 78.5 | 115 |
Hou et al. (2019) | JSRL-VCT | 2019 | 52.8 | 36.1 | 71.8 | 87.8 |
Wang et al. (2019a) | GFN-POS | 2019 | 53.9 | 34.9 | 72.1 | 91.0 |
Xiao and Shi (2019z) | DCM-ED(M) | 2019 | 53.3 | 35.6 | 71.2 | 83.1 |
Chen et al. (2019b) | TDConvED(R) | 2019 | 53.3 | 33.8 | – | 76.4 |
Babariya and Tamaki (2020) | OAM | 2019 | 43.5 | 31.6 | – | 64.9 |
Chen et al. (2019a) | SDN | 2019 | 61.8 | 37.8 | 76.8 | 103 |
Aafaq et al. (2019a) | GRU-EVE | 2019 | 47.9 | 35.0 | 71.5 | 78.1 |
Olivastri (2019) | EtENet-IRv2 | 2019 | 50.0 | 34.3 | 70.2 | 86.6 |
Zhang et al. (2019a) | OA-BTG | 2019 | 56.9 | 36.2 | – | 90.6 |
Lee et al. (2019) | LR-dep(NL) | 2019 | 49.7 | 33.7 | 71.7 | 84.5 |
Liu et al. (2020) | SibNet | 2018 | 54.2 | 34.8 | 71.7 | 88.2 |
Zhao et al. (2018) | Tubes | 2018 | 77.6 | 32.6 | 69.3 | 52.2 |
Lee and Kim (2018) | SeFLA | 2018 | 84.8 | – | – | 94.3 |
Wang et al. (2018a) | RecNet | 2018 | 52.3 | 34.1 | 69.8 | 80.3 |
Zhang et al. (2017) | TDDF | 2017 | 45.8 | 33.3 | 69.7 | 73.0 |
Pan et al. (2017) | LSTM-TSA | 2017 | 52.8 | 33.5 | – | 74.0 |
Wang and Song (2017) | S2VTK | 2017 | 42.5 | 31.0 | – | – |
Pan et al. (2016) | LSTM-E | 2016 | 45.3 | 31.0 | – | – |
Venugopalan et al. (2015) | S2VT | 2015 | – | 29.8 | – | – |
Lowell et al. (2014) | LSTM-YT | 2014 | 33.29 | 29.07 | – | – |
B. DRL approaches | ||||||
Li and Qiu (2020) | Multi-task RL | 2019 | 50.3 | 34.1 | 70.8 | 87.5 |
Chen et al. (2018a) | PickNet | 2018 | 49.9 | 33.1 | 69.3 | 76 |
Pasunuru and Bansal (2017) | CIDEnt | 2017 | 54.4 | 34.9 | 72.2 | 88.6 |
C. Transformer-based approaches | ||||||
Im and Choi (2022) | UAT-FEGs | 2022 | 56.5 | 36.4 | 72.8 | 92.8 |
Liu et al. (2021) | O2NA | 2021 | 55.4 | 37.4 | 74.5 | 96.4 |
Jin et al. (2020) | SBAT | 2020 | 53.5 | 35.3 | 72.3 | 89.5 |
Li and Qiu (2020) | 2D-CNN+Tr | 2019 | 40.8 | – | – | – |
Li and Qiu (2020) | P3D+CNN | 2019 | 35.4 | – | – | – |
Yang et al. (2019) | NRVC | 2019 | 53.1 | 35.5 | - | 89.4 |
Bilkhu et al. (2019) | I3D+UT | 2019 | 46.0 | - | – | – |
Chen et al. (2018) | TVT | 2018 | 53.9 | 35.2 | 72.0 | 86.7 |
D. Attention-based approaches | ||||||
Ji et al. (2022) | ADL | 2022 | 54.1 | 35.7 | 70.4 | 81.6 |
Peng et al. (2021) | T-DL | 2021 | 55.1 | 36.4 | 72.2 | 85.7 |
Ryu et al. (2021) | SGN | 2021 | 52.8 | 35.5 | 72.9 | 94.3 |
Perez-Martin et al. (2021b) | SemSynAN | 2021 | 64.4 | 41.9 | 79.5 | 111.5 |
Zhang et al. (2020) | ORG-TRL | 2020 | 54.3 | 36.4 | 73.9 | 95.2 |
Yan et al. (2020) | STAT | 2020 | 52.0 | 33.3 | – | 73.8 |
Bin et al. (2019) | BiLTSM | 2019 | 37.3 | 30.3 | – | – |
Xiao and Shi (2019b) | Attrib_Sel | 2019 | 56.5 | 35.4 | – | 86.1 |
Sun et al. (2019b) | MSAN | 2019 | 56.4 | 35.3 | – | 79.6 |
Li et al. (2019b) | Res-ATT | 2019 | 53.4 | 34.3 | – | 72.9 |
Gao et al. (2019) | hLSTMat | 2019 | 54.3 | 33.6 | – | 73.8 |
Chen et al. (2018b) | SSTA-R | 2018 | 45.3 | 30.3 | – | 59.2 |
Gao et al. (2017) | aLSTMs | 2017 | 50.8 | 33.3 | - | 74.8 |
Li et al. (2017) | MAM-RNN | 2017 | 41.3 | 32.9 | 68.8 | 53.9 |
Xu et al. (2017) | MA-LSTM | 2017 | 52.3 | 33.6 | – | 70.4 |
Hori et al. (2017) | A.F | 2017 | 53.9 | 32.2 | – | 68.8 |
Chen and Jiang (2019) | MGSA | 2017 | 53.4 | 35.0 | – | 86.7 |
Laokulrat et al. (2016) | S2S-TA | 2016 | 43.7 | 32.6 | 68.1 | 75.0 |
References | Model | Year | B | M | R | C |
---|---|---|---|---|---|---|
A. Encoder–decoder based approaches | ||||||
Gao et al. (2022) | vc-HRNAT | 2022 | 43 | 28.2 | 61.7 | 49.6 |
Zhao et al. (2022) | Tr-LSTM-RL | 2022 | 42 | 28.8 | 62 | 54.2 |
Zhang et al. (2021) | RCG | 2021 | 43.1 | 29.3 | 61.9 | 52.9 |
Perez-Martin et al. (2021a) | AVSSN | 2021 | 45.5 | 31.4 | 64.3 | 50.6 |
Zheng et al. (2020) | SAAT | 2020 | 40.5 | 28.2 | 60.9 | 49.1 |
Chen et al. (2020) | VNS-GRU | 2020 | 46.0 | 29.5 | 63.3 | 52.0 |
Hou et al. (2019) | JSRL-VCT | 2019 | 42.3 | 29.7 | 62.8 | 49.1 |
Wang et al. (2019a) | GFN-POS | 2019 | 42.0 | 28.2 | 61.6 | 48.7 |
Xiao and Shi (2019z) | DCM-Best1(M) | 2019 | 43.8 | 34.2 | 65.8 | 47.6 |
Chen et al. (2019b) | TDConvED(R) | 2019 | 39.5 | 27.5 | – | 42.8 |
Olivastri (2019) | EtENet-IRv2 | 2019 | 40.5 | 27.7 | 60.6 | 47.6 |
Chen et al. (2019a) | SDN | 2019 | 43.8 | 28.9 | 62.4 | 51.4 |
Aafaq et al. (2019a) | GRU-EVE | 2019 | 38.3 | 28.4 | 60.7 | 48.1 |
Hammad et al. (2019) | MM-features | 2019 | 39.2 | 27.8 | 59.8 | 45.7 |
Zhang et al. (2019a) | OA-BTG | 2019 | 41.4 | 28.2 | – | 46.9 |
Liu et al. (2020) | SibNet | 2018 | 40.9 | 27.5 | 60.2 | 47.5 |
Lee and Kim (2018) | SeFLA | 2018 | 41.8 | – | – | – |
Wang et al. (2018a) | RecNet | 2018 | 39.1 | 26.6 | 59.3 | 42.7 |
Shen et al. (2017) | Lexical-FCN | 2017 | 41.4 | 28.3 | 61.1 | 48.9 |
Zhang et al. (2017) | TDDF | 2017 | 37.3 | 27.8 | 59.2 | 43.8 |
B. DRL approaches | ||||||
Li and Qiu (2020) | E2E(MT-RL) | 2019 | 40.4 | 27 | 61 | 48.3 |
Wang et al. (2018b) | HRL | 2018 | 41.3 | 28.7 | 61.7 | 48 |
Chen et al. (2018a) | PickNet | 2018 | 38.9 | 27.2 | 59.5 | 42.1 |
Pasunuru and Bansal (2017) | CIDEnt | 2017 | 40.5 | 28.4 | 61.4 | 51.7 |
Phan et al. (2017) | CST | 2017 | 42.2 | 28.9 | 62.3 | 54.2 |
C. Transformer-based approaches | ||||||
Im and Choi (2022) | UAT-FEGs | 2022 | 43 | 27.8 | 60.9 | 49.7 |
Liu et al. (2021) | O2NA | 2021 | 41.6 | 28.5 | 62.4 | 51.1 |
Jin et al. (2020) | SBAT | 2020 | 42.9 | 28.9 | 61.5 | 51.6 |
Yang et al. (2019) | NRVC | 2019 | 42.50 | 28.0 | – | 49.40 |
Chen et al. (2018) | TVT | 2018 | 42.46 | 28.29 | 61.07 | 48.53 |
D. Attention-based approaches | ||||||
Ji et al. (2022) | ADL | 2022 | 40.2 | 26.6 | 60.2 | 44 |
Peng et al. (2021) | T-DL | 2021 | 42.3 | 28.9 | 61.7 | 49.2 |
Ryu et al. (2021) | SGN | 2021 | 40.8 | 28.3 | 60.8 | 49.5 |
Perez-Martin et al. (2021b) | SemSynAN | 2021 | 46.4 | 30.4 | 64.7 | 51.9 |
Zhang et al. (2020) | ORG-TRL | 2020 | 43.6 | 28.8 | 62.1 | 50.9 |
Yan et al. (2020) | STAT | 2020 | 39.3 | 27.1 | – | 44.0 |
Bin et al. (2019) | BiLTSM | 2019 | 33.9 | 26.2 | – | – |
Xiao and Shi (2019b) | Attrib_Sel | 2019 | 40.1 | 27.2 | – | 45.5 |
Sun et al. (2019b) | MSAN | 2019 | 46.8 | 29.5 | – | 52.4 |
Li et al. (2019b) | Res-ATT | 2019 | 37 | 26.9 | – | 40.7 |
Gao et al. (2019) | hLSTMat | 2019 | 39.7 | 27 | – | 43.4 |
Wang et al. (2018c) | HACA | 2018 | 43.4 | 29.5 | 61.8 | 49.7 |
Gao et al. (2017) | aLSTMs | 2017 | 38 | 26.1 | - | 43.2 |
Xu et al. (2017) | MA-LSTM | 2017 | 36.5 | 26.5 | 59.8 | 41 |
Hori et al. (2017) | A.F | 2017 | 39.7 | 25.5 | – | 40.4 |
Chen and Jiang (2019) | MGSA | 2017 | 45.4 | 28.6 | – | 50.1 |
4.2.2 MSR-VTT—microsoft research video to text
References | Model | Year | B | M | R | C |
---|---|---|---|---|---|---|
A. Standard ED approaches | ||||||
Seo et al. (2022) | MV-GPT | 2022 | 6.84 | 12.31 | – | – |
Aafaq et al. (2022) | VSJM-Net | 2022 | 3.97 | 12.89 | 25.37 | 26.52 |
Hosseinzadeh et al. (2021) | VC-FF | 2021 | 2.76 | 7.02 | 18.16 | 26.55 |
Hou et al. (2019) | JSRL-VCT | 2019 | 1.9 | 11.30 | 22.40 | 44.20 |
Li et al. (2018) | DVC | 2018 | 1.62 | 10.33 | – | 25.24 |
Xu et al. (2019) | JEDDi-Net | 2018 | 1.63 | 8.58 | 19.63 | 19.88 |
B. Transformer-based approaches | ||||||
Wang et al. (2021) | PDVC | 2021 | 10.29 | 15.8 | – | 20.45 |
Estevam et al. (2021) | BMT-V+sm | 2021 | 2.55 | 8.65 | 13.62 | 13.48 |
Deng et al. (2021) | SGR | 2021 | 1.67 | 9.07 | – | 22.12 |
Song et al. (2021) | TR-Dyn-mem | 2021 | 12.2 | 16.1 | – | 27.36 |
Ging et al. (2020) | COOT | 2020 | 17.43 | 15.99 | 31.45 | 28.19 |
Iashin and Rahtu (2020) | MDVC | 2020 | 5.83 | 11.72 | – | – |
Lei et al. (2020a) | MART | 2020 | 9.78 | 15.57 | – | 22.16 |
Bilkhu et al. (2019) | I3D+UT | 2019 | 49* | – | – | – |
Zhou et al. (2018) | E2E-MskTr | 2018 | 2.23 | 9.56 | – | – |
C. Attention-based approaches | ||||||
Chen and Jiang (2021) | EC-SL | 2021 | 1.33 | 7.49 | 13.02 | 21.21 |
Krishna et al. (2017) | DenseCap | 2017 | 3.98 | 9.5 | - | 24.6 |
4.2.3 ActivityNet Captions
4.2.4 YouCookII
References | Model | Year | B | M | R | C |
---|---|---|---|---|---|---|
A. Standard ED approaches | ||||||
Seo et al. (2022) | MV-GPT | 2022 | 21.88 | 27.09 | 49.38 | 2.21 |
Aafaq et al. (2022) | VSJM-Net | 2022 | 1.09 | 4.31 | 10.51 | 9.07 |
B. Transformer-based approaches | ||||||
Wang et al. (2021) | PDVC | 2021 | 0.89 | 4.74 | – | 23.07 |
Deng et al. (2021) | SGR | 2021 | – | 4.35 | – | – |
Ging et al. (2020) | COOT | 2020 | 17.97 | 19.85 | 37.94 | 57.24 |
Lei et al. (2020a) | MART | 2020 | 8.0 | 15.9 | – | 35.74 |
Sun et al. (2019b) | VideoBERT | 2019 | 4.33 | 11.94 | 28.8 | 0.55 |
Zhou et al. (2018) | E2E-MskTr | 2018 | 1.13 | 5.9 | – | – |
4.2.5 TVC - TV show caption
4.2.6 VATEX - video and TEXt
Reerencesf | Model (dataset) | Year | B | M | R | C |
---|---|---|---|---|---|---|
A. Standard ED approaches | ||||||
Gao et al. (2022) | vc-HRNAT | 2022 | 32.1 | 21.9 | 48.4 | 48.5 |
Zhang et al. (2021) | RCG | 2021 | 33.9 | 23.7 | 50.2 | 57.5 |
B. Transformer-based approaches | ||||||
Zhu et al. (2018) | X-Lin+Tr(VATEX-En) | 2020 | 40.7 | 25.8 | 53.7 | 81.4 |
X-Lin+Tr(VATEX-Ch) | 32.6 | 32.1 | 56.5 | 59.5 | ||
C. Attention-based approaches | ||||||
Zhang et al. (2020) | ORG-TRL(VATEX-En) | 2020 | 32.1 | 22.2 | 48.9 | 49.7 |
Lin et al. (2020) | FAtt(VATEX-En) | 2020 | 39.2 | 25.0 | 52.7 | 76.0 |
FAtt(VATEX-Ch) | 33.1 | 30.3 | 49.7 | 50.4 | ||
Wang et al. (2019b) | ML-Vatex(VATEX-En) | 2019 | 28.4 | 21.7 | 47.0 | 45.1 |
ML-Vatex(VATEX-Ch) | 24.9 | 29.8 | 51.7 | 35.0 |
References | Model (dataset) | Year | B | M | R | C |
---|---|---|---|---|---|---|
A. Standard ED approaches | ||||||
Hammoudeh et al. (2022) | Soccer-Cap(Soccer-Dataset) | 2022 | 14.9 | 27 | 35 | 0.99 |
Zhao et al. (2018) | Tubes(Charades) | 2018 | 31.5 | 19.1 | – | 18 |
Pan et al. (2017) | LSTM-TSA(MPII-MD) | 2017 | – | 8 | – | – |
Pan et al. (2017) | LSTM-TSA(M-VAD) | 2017 | – | 7.2 | – | – |
Donahue et al. (2017) | LRCN(TACoS-ML) | 2015 | 28.8 | – | – | – |
Venugopalan et al. (2015) | S2VT(MPII-MD) | 2015 | – | 7.1 | – | – |
Venugopalan et al. (2015) | S2VT(M-VAD) | 2015 | – | 6.7 | – | – |
Yan et al. (2010) | CVC(WExpo10) | 2010 | 95.7 | 67.8 | 95.8 | 81.1 |
B. DRL Approaches | ||||||
Ren et al. (2017) | RL-EmbRewd(MS-COCO) | 2017 | 30.4 | 25.1 | 52.5 | 93.7 |
Rennie et al. (2017) | SCST(MS-COCO) | 2017 | 31.9 | 25.5 | 54.3 | 106 |
Phan et al. (2017) | HRL(Charades) | 2017 | 18.8 | 19.5 | 41.4 | 23.6 |
C. Transformer-based approaches | ||||||
Wu (2022) | NSVA (NSVA) | 2022 | 24.3 | 24.3 | 50.8 | 113.9 |
Yuan et al. (2022) | X-Trans2Cap(ScanRefer) | 2022 | 49.07 | 32.25 | 65.54 | 106.11 |
Yuan et al. (2022) | X-Trans2Cap(Nr3D) | 2022 | 40.51 | 31.36 | 68.84 | 85.4 |
Vo et al. (2022) | NOC-REK(MSCOCO) | 2022 | – | 32.8 | – | 138.4 |
Vo et al. (2022) | NOC-REK(NOCap) | 2022 | – | – | – | 93 |
Song et al. (2021) | TR-Dyn-mem(CharadesCap) | 2021 | 20.34 | 20.05 | – | 27.54 |
D. Attention-based approaches | ||||||
Chen et al. (2021) | Scan2Cap(ScanRefer) | 2021 | 41.49 | 29.23 | 63.66 | 67.95 |
Gao et al. (2019) | hLSTMat(LSMDC) | 2019 | 7.0 | 5.8 | 15.0 | 10.4 |
Zhou et al. (2019) | GVD(ANet Ent) | 2019 | 2.35 | 11.0 | – | 45.5 |
Li et al. (2017) | MAM-RNN(Charades) | 2017 | 13.3 | 19.1 | – | 18.3 |
Yu (2017) | GEAN(VAS) | 2017 | – | 8.4 | 22.9 | 8.4 |
Yu (2017) | GEAN(LSMDC) | 2017 | – | 7.2 | 15.6 | 9.3 |
Laokulrat et al. (2016) | S2S-TA(M-VAD) | 2016 | 0.8 | 7.2 | 15.9 | 8.8 |