Abstract
Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. The well-established way of doing so is to rely on encoder-decoder paradigm by learning to encode the input video and decode the variable-length output sentence in a sequence-to-sequence manner. Nevertheless, these approaches often fail to produce complex and descriptive sentences as natural as those from human being, since the models are incapable of memorizing all visual contents and syntactic structures in the human-annotated video-sentence pairs. In this article, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model. Specifically, for each query video, a video-sentence retrieval model is first utilized to fetch semantically relevant sentences from the training sentence pool, coupled with the corresponding training videos. RAM then writes the relevant video-sentence pairs into memory and reads the memorized visual contents/syntactic structures in video-sentence pairs from memory to facilitate the word prediction at each timestep. Furthermore, we present Retrieval Augmented Convolutional Encoder-Decoder Network (R-ConvED), which novelly integrates RAM into convolutional encoder-decoder structure to boost video captioning. Extensive experiments on MSVD, MSR-VTT, Activity Net Captions, and VATEX datasets validate the superiority of our proposals and demonstrate quantitatively compelling results.
- [1] . 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR. 12487–12496.Google Scholar
- [2] . 2017. VQA: Visual question answering. Int. J. Comput. Vis. 123, 1 (2017), 4–31.Google ScholarDigital Library
- [3] . 2016. Evaluating video search, video event detection, localization, and hyperlinking. In TRECVID.Google Scholar
- [4] . 2015. Neural machine translation by jointly learning to align and translate. In ICLR.Google Scholar
- [5] . 2016. Delving deeper into convolutional networks for learning video representations. In ICLR.Google Scholar
- [6] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL. Association for Computational Linguistics, 65–72.Google Scholar
- [7] . 2017. Hierarchical boundary-aware neural encoder for video captioning. In CVPR. IEEE Computer Society, 3185–3194.Google Scholar
- [8] . 2011. Collecting highly parallel data for paraphrase evaluation. In ACL-HLT. The Association for Computer Linguistics, 190–200.Google Scholar
- [9] . 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI. 8167–8174.Google Scholar
- [10] . 2017. Video captioning with guidance of multimodal latent topics. In MM. ACM, 1838–1846.Google Scholar
- [11] . 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In ECCV. Springer, 333–351.Google Scholar
- [12] . 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
- [13] . 2018. Less is more: Picking informative frames for video captioning. In ECCV. Springer, 367–384.Google Scholar
- [14] . 2017. Language modeling with gated convolutional networks. In ICML. PMLR, 933–941.Google Scholar
- [15] . 2018. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multim. 20, 12 (2018), 3377–3388.Google ScholarDigital Library
- [16] . 2019. Dual encoding for zero-example video retrieval. In CVPR. Computer Vision Foundation/IEEE, 9346–9355.Google Scholar
- [17] . 2015. From captions to visual concepts and back. In CVPR. 1473–1482.Google Scholar
- [18] . 2022. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans. Image Process. 31 (2022), 202–215.Google ScholarCross Ref
- [19] . 2017. Convolutional sequence-to-sequence learning. In ICML. PMLR, 1243–1252.Google Scholar
- [20] . 2019. Incorporating textual similarity in video captioning schemes. In ICE/ITMC. 1–6.Google Scholar
- [21] . 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712–2719.Google Scholar
- [22] . 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.Google ScholarDigital Library
- [23] . 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In ECCV. 372–380.Google Scholar
- [24] . 2016. Deep residual learning for image recognition. In CVPR. IEEE Computer Society, 770–778.Google Scholar
- [25] . 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853–899.Google ScholarCross Ref
- [26] . 2019. Video question answering with spatio-temporal reasoning. Int. J. Comput. Vis. 127, 10 (2019), 1385–1412.Google ScholarDigital Library
- [27] . 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 17, 2s (
June 2021).Google ScholarDigital Library - [28] . 2020. SBAT: Video captioning with sparse boundary-aware transformer. In IJCAI. ijcai.org, 630–636.Google Scholar
- [29] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889–1897.Google Scholar
- [30] . 2014. Large-scale video classification with convolutional neural networks. In CVPR. IEEE Computer Society, 1725–1732.Google Scholar
- [31] . 2015. Adam: A method for stochastic optimization. In ICLR.Google Scholar
- [32] . 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.Google ScholarDigital Library
- [33] . 2017. Dense-captioning events in videos. In ICCV. 706–715.Google Scholar
- [34] . 2012. ImageNet classification with deep convolutional neural networks. In NIPS.1106–1114.Google Scholar
- [35] . 2017. MAM-RNN: Multi-level attention model based RNN for video captioning. In IJCAI. ijcai.org, 2208–2214.Google Scholar
- [36] . 2019. Residual attention-based LSTM for video captioning. World Wide Web 22, 2 (2019), 621–636.Google ScholarDigital Library
- [37] . 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. arXiv preprint arXiv:2201.04026 (2022).Google Scholar
- [38] . 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In MM. 3799–3802.Google Scholar
- [39] . 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In AAAI. 8518–8526.Google Scholar
- [40] . 2018. Jointly localizing and describing events for dense video captioning. In CVPR. 7492–7500.Google Scholar
- [41] . 2019. Pointing novel objects in image captioning. In IEEE/CVF CVPR. 12497–12506.Google Scholar
- [42] . 2022. Contextual transformer networks for visual recognition. IEEE Trans. Patt. Anal. Mach. Intell. (2022). https://ieeexplore.ieee.org/abstract/document/9747984.Google Scholar
- [43] . 2021. SwinBERT: End-to-end transformers with sparse attention for video captioning. CoRR abs/2111.13196 (2021).Google Scholar
- [44] . 2021. SibNet: Sibling convolutional encoder for video captioning. IEEE Trans. Patt. Anal. Mach. Intell. 43, 9 (2021), 3259–3272.Google ScholarCross Ref
- [45] . 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In MM. 5600–5608.Google Scholar
- [46] . 2018. Learning a multi-concept video retrieval model with multiple latent variables. ACM Trans. Multim. Comput. Commun. Appl. 14, 2 (2018), 46:1–46:21.Google ScholarDigital Library
- [47] . 2015. Encoding source language with convolutional neural network for machine translation. In ACL-IJCNLP. The Association for Computer Linguistics, 20–30.Google Scholar
- [48] . 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, . ACM, 19–27.Google Scholar
- [49] . 2019. End-to-end learning of latent deformable part-based representations for object detection. Int. J. Comput. Vis. 127, 11–12 (2019), 1659–1679.Google ScholarDigital Library
- [50] . 2020. Video captioning using boosted and parallel Long Short-Term Memory networks. Comput. Vis. Image Underst. 190 (2020).Google ScholarDigital Library
- [51] . 2016. Large scale retrieval and generation of image descriptions. Int. J. Comput. Vis. 119, 1 (2016), 46–59.Google ScholarDigital Library
- [52] . 2011. Im2Text: Describing images using 1 million captioned photographs. In NIPS. 1143–1151.Google Scholar
- [53] . 2020. Spatio-temporal graph for video captioning with knowledge distillation. In CVPR. Computer Vision Foundation/IEEE, 10867–10876.Google Scholar
- [54] . 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv preprint arXiv:2007.02375 (2020).Google Scholar
- [55] . 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In JCAI. 3832–3838.Google Scholar
- [56] . 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR. IEEE Computer Society, 4594–4602.Google Scholar
- [57] . 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE Computer Society, 984–992.Google Scholar
- [58] . 2020. X-linear attention networks for image captioning. In CVPR. 10971–10980.Google Scholar
- [59] . 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311–318.Google Scholar
- [60] . 2017. Multi-task video captioning with video and entailment generation. In ACL. Association for Computational Linguistics, 1273–1283.Google Scholar
- [61] . 2016. Multimodal video description. In MM. ACM, 1092–1096.Google Scholar
- [62] . 2017. YOLO9000: Better, faster, stronger. In CVPR. 6517–6525.Google Scholar
- [63] . 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Patt. Anal. Mach. Intell. 39, 6 (2017), 1137–1149.Google ScholarDigital Library
- [64] . 2013. Translating video content to natural language descriptions. In ICCV. 433–440.Google Scholar
- [65] . 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.Google ScholarDigital Library
- [66] . 2021. Semantic grouping network for video captioning. In AAAI, EAAI. AAAI Press, 2514–2522.Google Scholar
- [67] . 2021. Semantic grouping network for video captioning. In AAAI, IAAI, EAAI. AAAI Press, 2514–2522.Google Scholar
- [68] . 2020. Understanding temporal structure for video captioning. Patt. Anal. Appl. 23, 1 (2020), 147–159.Google ScholarDigital Library
- [69] . 2020. Video captioning with boundary-aware hierarchical language decoding and joint video prediction. Neurocomputing 417 (2020), 347–356.Google ScholarCross Ref
- [70] . 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Computat. Ling. 2 (2014), 207–218.Google ScholarCross Ref
- [71] . 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. In IJCAI. ijcai.org, 2737–2743.Google Scholar
- [72] . 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI. 4278–4284.Google Scholar
- [73] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 15, 2 (2019), 31:1–31:23.Google ScholarDigital Library
- [74] . 2016. Learning language-visual embedding for movie understanding with natural-language. CoRR abs/1609.08124 (2016).Google Scholar
- [75] . 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV. IEEE Computer Society, 4489–4497.Google Scholar
- [76] . 2021. Enhancing the alignment between target words and corresponding frames for video captioning. Patt. Recog. 111 (2021), 107702.Google ScholarCross Ref
- [77] . 2017. Attention is all you need. In NIPS. 5998–6008.Google Scholar
- [78] . 2015. CIDEr: Consensus-based image description evaluation. In CVPR. IEEE Computer Society, 4566–4575.Google Scholar
- [79] . 2015. Sequence to sequence—Video to text. In ICCV. IEEE Computer Society, 4534–4542.Google Scholar
- [80] . 2015. Translating videos to natural language using deep recurrent neural networks. In NAACL HLT. The Association for Computational Linguistics, 1494–1504.Google Scholar
- [81] . 2018. Reconstruction network for video captioning. In CVPR. IEEE Computer Society, 7622–7631.Google Scholar
- [82] . 2018. M3: Multimodal memory modelling for video captioning. In CVPR. Computer Vision Foundation/IEEE Computer Society, 7512–7520.Google Scholar
- [83] . 2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In IEEE/CVF ICCV. IEEE, 4580–4590.Google Scholar
- [84] . 2018. Multi-modal circulant fusion for video-to-language and backward. In IJCAI.Google Scholar
- [85] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In CVPR. IEEE Computer Society, 5288–5296.Google Scholar
- [86] . 2017. Learning multimodal attention LSTM networks for video captioning. In MM. ACM, 537–545.Google Scholar
- [87] . 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI. AAAI Press, 2346–2352.Google Scholar
- [88] . 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2020), 229–241.Google ScholarDigital Library
- [89] . 2015. Describing videos by exploiting temporal structure. In ICCV. IEEE Computer Society, 4507–4515.Google Scholar
- [90] . 2017. Incorporating copying mechanism in image captioning for learning novel objects. In CVPR.Google Scholar
- [91] . 2018. Exploring visual relationship for image captioning. In ECCV.Google Scholar
- [92] . 2019. Hierarchy parsing for image captioning. In IEEE/CVF ICCV. 2621–2629.Google Scholar
- [93] . 2017. Boosting image captioning with attributes. In ICCV.Google Scholar
- [94] . 2016. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR. IEEE Computer Society, 4584–4593.Google Scholar
- [95] . 2021. Open-book video captioning with retrieve-copy-generate network. In IEEE CVPR. Computer Vision Foundation/IEEE, 9837–9846.Google Scholar
- [96] . 2020. Object relational graph with teacher-recommended learning for video captioning. In IEEE/CVF CVPR. Computer Vision Foundation/ IEEE, 13275–13285.Google Scholar
- [97] . 2018. Video captioning with tube features. In IJCAI. ijcai.org, 1177–1183.Google Scholar
- [98] . 2019. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans. Image Process. 28, 11 (2019), 5552–5565.Google ScholarDigital Library
- [99] . 2021. Multi-modal dependency tree for video captioning. Adv. Neural Inf. Process. Syst. 34 (2021).Google Scholar
Index Terms
- Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning
Recommendations
BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
AbstractVideo captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder–decoder models, particularly, the ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Hierarchical Memory Modelling for Video Captioning
MM '18: Proceedings of the 26th ACM international conference on MultimediaTranslating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the vision-language translation ...
Comments