research-article

Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning

Authors:
Jingwen Chen

Sun Yat-sen University, China

Sun Yat-sen University, China

0000-0002-7917-6003
View Profile

,
Yingwei Pan

JD AI Research, Beijing, China

JD AI Research, Beijing, China

0000-0002-4344-8898
View Profile

,
Yehao Li

JD AI Research, Beijing, China

JD AI Research, Beijing, China

0000-0002-9603-1113
View Profile

,
Ting Yao

JD AI Research, Beijing, China

JD AI Research, Beijing, China

0000-0001-7587-101X
View Profile

,
Hongyang Chao

Sun Yat-sen University, China

Sun Yat-sen University, China

0000-0002-6104-2322
View Profile

,
Tao Mei

JD AI Research, Beijing, China

JD AI Research, Beijing, China

0000-0002-5990-7307
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19 Issue 1sArticle No.: 48pp 1–24https://doi.org/10.1145/3539225

Published:23 January 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. The well-established way of doing so is to rely on encoder-decoder paradigm by learning to encode the input video and decode the variable-length output sentence in a sequence-to-sequence manner. Nevertheless, these approaches often fail to produce complex and descriptive sentences as natural as those from human being, since the models are incapable of memorizing all visual contents and syntactic structures in the human-annotated video-sentence pairs. In this article, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model. Specifically, for each query video, a video-sentence retrieval model is first utilized to fetch semantically relevant sentences from the training sentence pool, coupled with the corresponding training videos. RAM then writes the relevant video-sentence pairs into memory and reads the memorized visual contents/syntactic structures in video-sentence pairs from memory to facilitate the word prediction at each timestep. Furthermore, we present Retrieval Augmented Convolutional Encoder-Decoder Network (R-ConvED), which novelly integrates RAM into convolutional encoder-decoder structure to boost video captioning. Extensive experiments on MSVD, MSR-VTT, Activity Net Captions, and VATEX datasets validate the superiority of our proposals and demonstrate quantitatively compelling results.

REFERENCES

[1] Aafaq Nayyer, Akhtar Naveed, Liu Wei, Gilani Syed Zulqarnain, and Mian Ajmal. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR. 12487–12496.Google Scholar
[2] Agrawal Aishwarya, Lu Jiasen, Antol Stanislaw, Mitchell Margaret, Zitnick C. Lawrence, Parikh Devi, and Batra Dhruv. 2017. VQA: Visual question answering. Int. J. Comput. Vis. 123, 1 (2017), 4–31.Google ScholarDigital Library
[3] Awad George, Fiscus Jonathan, Joy David, Michel Martial, Smeaton Alan F., Kraaij Wessel, Quénot Georges, Eskevich Maria, Aly Robin, Ordelman Roeland, Jones Gareth J. F., Huet Benoit, and Larson Martha. 2016. Evaluating video search, video event detection, localization, and hyperlinking. In TRECVID.Google Scholar
[4] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.Google Scholar
[5] Ballas Nicolas, Yao Li, Pal Chris, and Courville Aaron C.. 2016. Delving deeper into convolutional networks for learning video representations. In ICLR.Google Scholar
[6] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL. Association for Computational Linguistics, 65–72.Google Scholar
[7] Baraldi Lorenzo, Grana Costantino, and Cucchiara Rita. 2017. Hierarchical boundary-aware neural encoder for video captioning. In CVPR. IEEE Computer Society, 3185–3194.Google Scholar
[8] Chen David L. and Dolan William B.. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL-HLT. The Association for Computer Linguistics, 190–200.Google Scholar
[9] Chen Jingwen, Pan Yingwei, Li Yehao, Yao Ting, Chao Hongyang, and Mei Tao. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI. 8167–8174.Google Scholar
[10] Chen Shizhe, Chen Jia, Jin Qin, and Hauptmann Alexander G.. 2017. Video captioning with guidance of multimodal latent topics. In MM. ACM, 1838–1846.Google Scholar
[11] Chen Shaoxiang, Jiang Wenhao, Liu Wei, and Jiang Yu-Gang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In ECCV. Springer, 333–351.Google Scholar
[12] Chen Xinlei, Fang Hao, Lin Tsung-Yi, Vedantam Ramakrishna, Gupta Saurabh, Dollár Piotr, and Zitnick C. Lawrence. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
[13] Chen Yangyu, Wang Shuhui, Zhang Weigang, and Huang Qingming. 2018. Less is more: Picking informative frames for video captioning. In ECCV. Springer, 367–384.Google Scholar
[14] Dauphin Yann N., Fan Angela, Auli Michael, and Grangier David. 2017. Language modeling with gated convolutional networks. In ICML. PMLR, 933–941.Google Scholar
[15] Dong Jianfeng, Li Xirong, and Snoek Cees G. M.. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multim. 20, 12 (2018), 3377–3388.Google ScholarDigital Library
[16] Dong Jianfeng, Li Xirong, Xu Chaoxi, Ji Shouling, He Yuan, Yang Gang, and Wang Xun. 2019. Dual encoding for zero-example video retrieval. In CVPR. Computer Vision Foundation/IEEE, 9346–9355.Google Scholar
[17] Fang Hao, Gupta Saurabh, Iandola Forrest N., Srivastava Rupesh Kumar, Deng Li, Dollár Piotr, Gao Jianfeng, He Xiaodong, Mitchell Margaret, Platt John C., Zitnick C. Lawrence, and Zweig Geoffrey. 2015. From captions to visual concepts and back. In CVPR. 1473–1482.Google Scholar
[18] Gao Lianli, Lei Yu, Zeng Pengpeng, Song Jingkuan, Wang Meng, and Shen Heng Tao. 2022. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans. Image Process. 31 (2022), 202–215.Google ScholarCross Ref
[19] Gehring Jonas, Auli Michael, Grangier David, Yarats Denis, and Dauphin Yann N.. 2017. Convolutional sequence-to-sequence learning. In ICML. PMLR, 1243–1252.Google Scholar
[20] Gkountakos Konstantinos, Dimou Anastasios, Papadopoulos Georgios Th., and Daras Petros. 2019. Incorporating textual similarity in video captioning schemes. In ICE/ITMC. 1–6.Google Scholar
[21] Guadarrama Sergio, Krishnamoorthy Niveda, Malkarnenkar Girish, Venugopalan Subhashini, Mooney Raymond J., Darrell Trevor, and Saenko Kate. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712–2719.Google Scholar
[22] Gupta Shikha, Sharma Krishan, Dinesh Dileep Aroor, and Thenkanidiyoor Veena. 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.Google ScholarDigital Library
[23] Hanckmann Patrick, Schutte Klamer, and Burghouts Gertjan J.. 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In ECCV. 372–380.Google Scholar
[24] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In CVPR. IEEE Computer Society, 770–778.Google Scholar
[25] Hodosh Micah, Young Peter, and Hockenmaier Julia. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853–899.Google ScholarCross Ref
[26] Jang Yunseok, Song Yale, Kim Chris Dongjoo, Yu Youngjae, Kim Youngjin, and Kim Gunhee. 2019. Video question answering with spatio-temporal reasoning. Int. J. Comput. Vis. 127, 10 (2019), 1385–1412.Google ScholarDigital Library
[27] Ji Wanting and Wang Ruili. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 17, 2s (June2021).Google ScholarDigital Library
[28] Jin Tao, Huang Siyu, Chen Ming, Li Yingming, and Zhang Zhongfei. 2020. SBAT: Video captioning with sparse boundary-aware transformer. In IJCAI. ijcai.org, 630–636.Google Scholar
[29] Karpathy Andrej, Joulin Armand, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889–1897.Google Scholar
[30] Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In CVPR. IEEE Computer Society, 1725–1732.Google Scholar
[31] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In ICLR.Google Scholar
[32] Kojima Atsuhiro, Tamura Takeshi, and Fukunaga Kunio. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.Google ScholarDigital Library
[33] Krishna Ranjay, Hata Kenji, Ren Frederic, Li Fei-Fei, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In ICCV. 706–715.Google Scholar
[34] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In NIPS.1106–1114.Google Scholar
[35] Li Xuelong, Zhao Bin, and Lu Xiaoqiang. 2017. MAM-RNN: Multi-level attention model based RNN for video captioning. In IJCAI. ijcai.org, 2208–2214.Google Scholar
[36] Li Xiangpeng, Zhou Zhilong, Chen Lijiang, and Gao Lianli. 2019. Residual attention-based LSTM for video captioning. World Wide Web 22, 2 (2019), 621–636.Google ScholarDigital Library
[37] Li Yehao, Fan Jiahao, Pan Yingwei, Yao Ting, Lin Weiyao, and Mei Tao. 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. arXiv preprint arXiv:2201.04026 (2022).Google Scholar
[38] Li Yehao, Pan Yingwei, Chen Jingwen, Yao Ting, and Mei Tao. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In MM. 3799–3802.Google Scholar
[39] Li Yehao, Pan Yingwei, Yao Ting, Chen Jingwen, and Mei Tao. 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In AAAI. 8518–8526.Google Scholar
[40] Li Yehao, Yao Ting, Pan Yingwei, Chao Hongyang, and Mei Tao. 2018. Jointly localizing and describing events for dense video captioning. In CVPR. 7492–7500.Google Scholar
[41] Li Yehao, Yao Ting, Pan Yingwei, Chao Hongyang, and Mei Tao. 2019. Pointing novel objects in image captioning. In IEEE/CVF CVPR. 12497–12506.Google Scholar
[42] Li Yehao, Yao Ting, Pan Yingwei, and Mei Tao. 2022. Contextual transformer networks for visual recognition. IEEE Trans. Patt. Anal. Mach. Intell. (2022). https://ieeexplore.ieee.org/abstract/document/9747984.Google Scholar
[43] Lin Kevin, Li Linjie, Lin Chung-Ching, Ahmed Faisal, Gan Zhe, Liu Zicheng, Lu Yumao, and Wang Lijuan. 2021. SwinBERT: End-to-end transformers with sparse attention for video captioning. CoRR abs/2111.13196 (2021).Google Scholar
[44] Liu Sheng, Ren Zhou, and Yuan Junsong. 2021. SibNet: Sibling convolutional encoder for video captioning. IEEE Trans. Patt. Anal. Mach. Intell. 43, 9 (2021), 3259–3272.Google ScholarCross Ref
[45] Luo Jianjie, Li Yehao, Pan Yingwei, Yao Ting, Chao Hongyang, and Mei Tao. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In MM. 5600–5608.Google Scholar
[46] Mazaheri Amir, Gong Boqing, and Shah Mubarak. 2018. Learning a multi-concept video retrieval model with multiple latent variables. ACM Trans. Multim. Comput. Commun. Appl. 14, 2 (2018), 46:1–46:21.Google ScholarDigital Library
[47] Meng Fandong, Lu Zhengdong, Wang Mingxuan, Li Hang, Jiang Wenbin, and Liu Qun. 2015. Encoding source language with convolutional neural network for machine translation. In ACL-IJCNLP. The Association for Computer Linguistics, 20–30.Google Scholar
[48] Mithun Niluthpol Chowdhury, Li Juncheng, Metze Florian, and Roy-Chowdhury Amit K.. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, Aizawa Kiyoharu. ACM, 19–27.Google Scholar
[49] Mordan Taylor, Thome Nicolas, Hénaff Gilles, and Cord Matthieu. 2019. End-to-end learning of latent deformable part-based representations for object detection. Int. J. Comput. Vis. 127, 11–12 (2019), 1659–1679.Google ScholarDigital Library
[50] Nabati Masoomeh and Behrad Alireza. 2020. Video captioning using boosted and parallel Long Short-Term Memory networks. Comput. Vis. Image Underst. 190 (2020).Google ScholarDigital Library
[51] Ordonez Vicente, Han Xufeng, Kuznetsova Polina, Kulkarni Girish, Mitchell Margaret, Yamaguchi Kota, Stratos Karl, Goyal Amit, Dodge Jesse, Mensch Alyssa C., III Hal Daumé, Berg Alexander C., Choi Yejin, and Berg Tamara L.. 2016. Large scale retrieval and generation of image descriptions. Int. J. Comput. Vis. 119, 1 (2016), 46–59.Google ScholarDigital Library
[52] Ordonez Vicente, Kulkarni Girish, and Berg Tamara L.. 2011. Im2Text: Describing images using 1 million captioned photographs. In NIPS. 1143–1151.Google Scholar
[53] Pan Boxiao, Cai Haoye, Huang De-An, Lee Kuan-Hui, Gaidon Adrien, Adeli Ehsan, and Niebles Juan Carlos. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In CVPR. Computer Vision Foundation/IEEE, 10867–10876.Google Scholar
[54] Pan Yingwei, Li Yehao, Luo Jianjie, Xu Jun, Yao Ting, and Mei Tao. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv preprint arXiv:2007.02375 (2020).Google Scholar
[55] Pan Yingwei, Li Yehao, Yao Ting, Mei Tao, Li Houqiang, and Rui Yong. 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In JCAI. 3832–3838.Google Scholar
[56] Pan Yingwei, Mei Tao, Yao Ting, Li Houqiang, and Rui Yong. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR. IEEE Computer Society, 4594–4602.Google Scholar
[57] Pan Yingwei, Yao Ting, Li Houqiang, and Mei Tao. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE Computer Society, 984–992.Google Scholar
[58] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In CVPR. 10971–10980.Google Scholar
[59] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311–318.Google Scholar
[60] Pasunuru Ramakanth and Bansal Mohit. 2017. Multi-task video captioning with video and entailment generation. In ACL. Association for Computational Linguistics, 1273–1283.Google Scholar
[61] Ramanishka Vasili, Das Abir, Park Dong Huk, Venugopalan Subhashini, Hendricks Lisa Anne, Rohrbach Marcus, and Saenko Kate. 2016. Multimodal video description. In MM. ACM, 1092–1096.Google Scholar
[62] Redmon Joseph and Farhadi Ali. 2017. YOLO9000: Better, faster, stronger. In CVPR. 6517–6525.Google Scholar
[63] Ren Shaoqing, He Kaiming, Girshick Ross B., and Sun Jian. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Patt. Anal. Mach. Intell. 39, 6 (2017), 1137–1149.Google ScholarDigital Library
[64] Rohrbach Marcus, Qiu Wei, Titov Ivan, Thater Stefan, Pinkal Manfred, and Schiele Bernt. 2013. Translating video content to natural language descriptions. In ICCV. 433–440.Google Scholar
[65] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael S., Berg Alexander C., and Fei-Fei Li. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.Google ScholarDigital Library
[66] Ryu Hobin, Kang Sunghun, Kang Haeyong, and Yoo Chang D.. 2021. Semantic grouping network for video captioning. In AAAI, EAAI. AAAI Press, 2514–2522.Google Scholar
[67] Ryu Hobin, Kang Sunghun, Kang Haeyong, and Yoo Chang D.. 2021. Semantic grouping network for video captioning. In AAAI, IAAI, EAAI. AAAI Press, 2514–2522.Google Scholar
[68] Sah Shagan, Nguyen Thang, and Ptucha Ray. 2020. Understanding temporal structure for video captioning. Patt. Anal. Appl. 23, 1 (2020), 147–159.Google ScholarDigital Library
[69] Shi Xiangxi, Cai Jianfei, Gu Jiuxiang, and Joty Shafiq R.. 2020. Video captioning with boundary-aware hierarchical language decoding and joint video prediction. Neurocomputing 417 (2020), 347–356.Google ScholarCross Ref
[70] Socher Richard, Karpathy Andrej, Le Quoc V., Manning Christopher D., and Ng Andrew Y.. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Computat. Ling. 2 (2014), 207–218.Google ScholarCross Ref
[71] Song Jingkuan, Gao Lianli, Guo Zhao, Liu Wu, Zhang Dongxiang, and Shen Heng Tao. 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. In IJCAI. ijcai.org, 2737–2743.Google Scholar
[72] Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, and Alemi Alexander A.. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI. 4278–4284.Google Scholar
[73] Tang Pengjie, Wang Hanli, and Li Qinyu. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 15, 2 (2019), 31:1–31:23.Google ScholarDigital Library
[74] Torabi Atousa, Tandon Niket, and Sigal Leonid. 2016. Learning language-visual embedding for movie understanding with natural-language. CoRR abs/1609.08124 (2016).Google Scholar
[75] Tran Du, Bourdev Lubomir D., Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV. IEEE Computer Society, 4489–4497.Google Scholar
[76] Tu Yunbin, Zhou Chang, Guo Junjun, Gao Shengxiang, and Yu Zhengtao. 2021. Enhancing the alignment between target words and corresponding frames for video captioning. Patt. Recog. 111 (2021), 107702.Google ScholarCross Ref
[77] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In NIPS. 5998–6008.Google Scholar
[78] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. CIDEr: Consensus-based image description evaluation. In CVPR. IEEE Computer Society, 4566–4575.Google Scholar
[79] Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond J., Darrell Trevor, and Saenko Kate. 2015. Sequence to sequence—Video to text. In ICCV. IEEE Computer Society, 4534–4542.Google Scholar
[80] Venugopalan Subhashini, Xu Huijuan, Donahue Jeff, Rohrbach Marcus, Mooney Raymond J., and Saenko Kate. 2015. Translating videos to natural language using deep recurrent neural networks. In NAACL HLT. The Association for Computational Linguistics, 1494–1504.Google Scholar
[81] Wang Bairui, Ma Lin, Zhang Wei, and Liu Wei. 2018. Reconstruction network for video captioning. In CVPR. IEEE Computer Society, 7622–7631.Google Scholar
[82] Wang Junbo, Wang Wei, Huang Yan, Wang Liang, and Tan Tieniu. 2018. M3: Multimodal memory modelling for video captioning. In CVPR. Computer Vision Foundation/IEEE Computer Society, 7512–7520.Google Scholar
[83] Wang Xin, Wu Jiawei, Chen Junkun, Li Lei, Wang Yuan-Fang, and Wang William Yang. 2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In IEEE/CVF ICCV. IEEE, 4580–4590.Google Scholar
[84] Wu Aming and Han Yahong. 2018. Multi-modal circulant fusion for video-to-language and backward. In IJCAI.Google Scholar
[85] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In CVPR. IEEE Computer Society, 5288–5296.Google Scholar
[86] Xu Jun, Yao Ting, Zhang Yongdong, and Mei Tao. 2017. Learning multimodal attention LSTM networks for video captioning. In MM. ACM, 537–545.Google Scholar
[87] Xu Ran, Xiong Caiming, Chen Wei, and Corso Jason J.. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI. AAAI Press, 2346–2352.Google Scholar
[88] Yan Chenggang, Tu Yunbin, Wang Xingzheng, Zhang Yongbing, Hao Xinhong, Zhang Yongdong, and Dai Qionghai. 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2020), 229–241.Google ScholarDigital Library
[89] Yao Li, Torabi Atousa, Cho Kyunghyun, Ballas Nicolas, Pal Christopher J., Larochelle Hugo, and Courville Aaron C.. 2015. Describing videos by exploiting temporal structure. In ICCV. IEEE Computer Society, 4507–4515.Google Scholar
[90] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In CVPR.Google Scholar
[91] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In ECCV.Google Scholar
[92] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In IEEE/CVF ICCV. 2621–2629.Google Scholar
[93] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In ICCV.Google Scholar
[94] Yu Haonan, Wang Jiang, Huang Zhiheng, Yang Yi, and Xu Wei. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR. IEEE Computer Society, 4584–4593.Google Scholar
[95] Zhang Ziqi, Qi Zhongang, Yuan Chunfeng, Shan Ying, Li Bing, Deng Ying, and Hu Weiming. 2021. Open-book video captioning with retrieve-copy-generate network. In IEEE CVPR. Computer Vision Foundation/IEEE, 9837–9846.Google Scholar
[96] Zhang Ziqi, Shi Yaya, Yuan Chunfeng, Li Bing, Wang Peijin, Hu Weiming, and Zha Zheng-Jun. 2020. Object relational graph with teacher-recommended learning for video captioning. In IEEE/CVF CVPR. Computer Vision Foundation/ IEEE, 13275–13285.Google Scholar
[97] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2018. Video captioning with tube features. In IJCAI. ijcai.org, 1177–1183.Google Scholar
[98] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2019. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans. Image Process. 28, 11 (2019), 5552–5565.Google ScholarDigital Library
[99] Zhao Wentian, Wu Xinxiao, and Luo Jiebo. 2021. Multi-modal dependency tree for video captioning. Adv. Neural Inf. Process. Syst. 34 (2021).Google Scholar

Index Terms

Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
        Video summarization

Recommendations

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
Abstract
Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder–decoder models, particularly, the ...
Read More
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Read More
Hierarchical Memory Modelling for Video Captioning
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Translating videos into natural language sentences has drawn much attention recently. The framework of combining visual attention with Long Short-Term Memory (LSTM) based text decoder has achieved much progress. However, the vision-language translation ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 1s
February 2023
504 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3572859
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 January 2023
- Online AM: 26 May 2022
- Accepted: 16 May 2022
- Revised: 12 April 2022
- Received: 3 December 2021
Published in tomm Volume 19, Issue 1s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Video captioning
deep convolutional neural networks
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 793
  Total Downloads
- Downloads (Last 12 months)446
- Downloads (Last 6 weeks)59
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

Learning Multimodal Attention LSTM Networks for Video Captioning

Hierarchical Memory Modelling for Video Captioning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media