skip to main content
research-article

Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning

Published:23 January 2023Publication History
Skip Abstract Section

Abstract

Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. The well-established way of doing so is to rely on encoder-decoder paradigm by learning to encode the input video and decode the variable-length output sentence in a sequence-to-sequence manner. Nevertheless, these approaches often fail to produce complex and descriptive sentences as natural as those from human being, since the models are incapable of memorizing all visual contents and syntactic structures in the human-annotated video-sentence pairs. In this article, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model. Specifically, for each query video, a video-sentence retrieval model is first utilized to fetch semantically relevant sentences from the training sentence pool, coupled with the corresponding training videos. RAM then writes the relevant video-sentence pairs into memory and reads the memorized visual contents/syntactic structures in video-sentence pairs from memory to facilitate the word prediction at each timestep. Furthermore, we present Retrieval Augmented Convolutional Encoder-Decoder Network (R-ConvED), which novelly integrates RAM into convolutional encoder-decoder structure to boost video captioning. Extensive experiments on MSVD, MSR-VTT, Activity Net Captions, and VATEX datasets validate the superiority of our proposals and demonstrate quantitatively compelling results.

REFERENCES

  1. [1] Aafaq Nayyer, Akhtar Naveed, Liu Wei, Gilani Syed Zulqarnain, and Mian Ajmal. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR. 1248712496.Google ScholarGoogle Scholar
  2. [2] Agrawal Aishwarya, Lu Jiasen, Antol Stanislaw, Mitchell Margaret, Zitnick C. Lawrence, Parikh Devi, and Batra Dhruv. 2017. VQA: Visual question answering. Int. J. Comput. Vis. 123, 1 (2017), 431.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Awad George, Fiscus Jonathan, Joy David, Michel Martial, Smeaton Alan F., Kraaij Wessel, Quénot Georges, Eskevich Maria, Aly Robin, Ordelman Roeland, Jones Gareth J. F., Huet Benoit, and Larson Martha. 2016. Evaluating video search, video event detection, localization, and hyperlinking. In TRECVID.Google ScholarGoogle Scholar
  4. [4] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.Google ScholarGoogle Scholar
  5. [5] Ballas Nicolas, Yao Li, Pal Chris, and Courville Aaron C.. 2016. Delving deeper into convolutional networks for learning video representations. In ICLR.Google ScholarGoogle Scholar
  6. [6] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL. Association for Computational Linguistics, 6572.Google ScholarGoogle Scholar
  7. [7] Baraldi Lorenzo, Grana Costantino, and Cucchiara Rita. 2017. Hierarchical boundary-aware neural encoder for video captioning. In CVPR. IEEE Computer Society, 31853194.Google ScholarGoogle Scholar
  8. [8] Chen David L. and Dolan William B.. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL-HLT. The Association for Computer Linguistics, 190200.Google ScholarGoogle Scholar
  9. [9] Chen Jingwen, Pan Yingwei, Li Yehao, Yao Ting, Chao Hongyang, and Mei Tao. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI. 81678174.Google ScholarGoogle Scholar
  10. [10] Chen Shizhe, Chen Jia, Jin Qin, and Hauptmann Alexander G.. 2017. Video captioning with guidance of multimodal latent topics. In MM. ACM, 18381846.Google ScholarGoogle Scholar
  11. [11] Chen Shaoxiang, Jiang Wenhao, Liu Wei, and Jiang Yu-Gang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In ECCV. Springer, 333351.Google ScholarGoogle Scholar
  12. [12] Chen Xinlei, Fang Hao, Lin Tsung-Yi, Vedantam Ramakrishna, Gupta Saurabh, Dollár Piotr, and Zitnick C. Lawrence. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google ScholarGoogle Scholar
  13. [13] Chen Yangyu, Wang Shuhui, Zhang Weigang, and Huang Qingming. 2018. Less is more: Picking informative frames for video captioning. In ECCV. Springer, 367384.Google ScholarGoogle Scholar
  14. [14] Dauphin Yann N., Fan Angela, Auli Michael, and Grangier David. 2017. Language modeling with gated convolutional networks. In ICML. PMLR, 933941.Google ScholarGoogle Scholar
  15. [15] Dong Jianfeng, Li Xirong, and Snoek Cees G. M.. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multim. 20, 12 (2018), 33773388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Dong Jianfeng, Li Xirong, Xu Chaoxi, Ji Shouling, He Yuan, Yang Gang, and Wang Xun. 2019. Dual encoding for zero-example video retrieval. In CVPR. Computer Vision Foundation/IEEE, 93469355.Google ScholarGoogle Scholar
  17. [17] Fang Hao, Gupta Saurabh, Iandola Forrest N., Srivastava Rupesh Kumar, Deng Li, Dollár Piotr, Gao Jianfeng, He Xiaodong, Mitchell Margaret, Platt John C., Zitnick C. Lawrence, and Zweig Geoffrey. 2015. From captions to visual concepts and back. In CVPR. 14731482.Google ScholarGoogle Scholar
  18. [18] Gao Lianli, Lei Yu, Zeng Pengpeng, Song Jingkuan, Wang Meng, and Shen Heng Tao. 2022. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans. Image Process. 31 (2022), 202215.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Gehring Jonas, Auli Michael, Grangier David, Yarats Denis, and Dauphin Yann N.. 2017. Convolutional sequence-to-sequence learning. In ICML. PMLR, 12431252.Google ScholarGoogle Scholar
  20. [20] Gkountakos Konstantinos, Dimou Anastasios, Papadopoulos Georgios Th., and Daras Petros. 2019. Incorporating textual similarity in video captioning schemes. In ICE/ITMC. 16.Google ScholarGoogle Scholar
  21. [21] Guadarrama Sergio, Krishnamoorthy Niveda, Malkarnenkar Girish, Venugopalan Subhashini, Mooney Raymond J., Darrell Trevor, and Saenko Kate. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 27122719.Google ScholarGoogle Scholar
  22. [22] Gupta Shikha, Sharma Krishan, Dinesh Dileep Aroor, and Thenkanidiyoor Veena. 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Hanckmann Patrick, Schutte Klamer, and Burghouts Gertjan J.. 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In ECCV. 372380.Google ScholarGoogle Scholar
  24. [24] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In CVPR. IEEE Computer Society, 770778.Google ScholarGoogle Scholar
  25. [25] Hodosh Micah, Young Peter, and Hockenmaier Julia. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853899.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Jang Yunseok, Song Yale, Kim Chris Dongjoo, Yu Youngjae, Kim Youngjin, and Kim Gunhee. 2019. Video question answering with spatio-temporal reasoning. Int. J. Comput. Vis. 127, 10 (2019), 13851412.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Ji Wanting and Wang Ruili. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 17, 2s (June2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Jin Tao, Huang Siyu, Chen Ming, Li Yingming, and Zhang Zhongfei. 2020. SBAT: Video captioning with sparse boundary-aware transformer. In IJCAI. ijcai.org, 630636.Google ScholarGoogle Scholar
  29. [29] Karpathy Andrej, Joulin Armand, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 18891897.Google ScholarGoogle Scholar
  30. [30] Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In CVPR. IEEE Computer Society, 17251732.Google ScholarGoogle Scholar
  31. [31] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In ICLR.Google ScholarGoogle Scholar
  32. [32] Kojima Atsuhiro, Tamura Takeshi, and Fukunaga Kunio. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Krishna Ranjay, Hata Kenji, Ren Frederic, Li Fei-Fei, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In ICCV. 706715.Google ScholarGoogle Scholar
  34. [34] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In NIPS.11061114.Google ScholarGoogle Scholar
  35. [35] Li Xuelong, Zhao Bin, and Lu Xiaoqiang. 2017. MAM-RNN: Multi-level attention model based RNN for video captioning. In IJCAI. ijcai.org, 22082214.Google ScholarGoogle Scholar
  36. [36] Li Xiangpeng, Zhou Zhilong, Chen Lijiang, and Gao Lianli. 2019. Residual attention-based LSTM for video captioning. World Wide Web 22, 2 (2019), 621636.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Li Yehao, Fan Jiahao, Pan Yingwei, Yao Ting, Lin Weiyao, and Mei Tao. 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. arXiv preprint arXiv:2201.04026 (2022).Google ScholarGoogle Scholar
  38. [38] Li Yehao, Pan Yingwei, Chen Jingwen, Yao Ting, and Mei Tao. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In MM. 37993802.Google ScholarGoogle Scholar
  39. [39] Li Yehao, Pan Yingwei, Yao Ting, Chen Jingwen, and Mei Tao. 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In AAAI. 85188526.Google ScholarGoogle Scholar
  40. [40] Li Yehao, Yao Ting, Pan Yingwei, Chao Hongyang, and Mei Tao. 2018. Jointly localizing and describing events for dense video captioning. In CVPR. 74927500.Google ScholarGoogle Scholar
  41. [41] Li Yehao, Yao Ting, Pan Yingwei, Chao Hongyang, and Mei Tao. 2019. Pointing novel objects in image captioning. In IEEE/CVF CVPR. 1249712506.Google ScholarGoogle Scholar
  42. [42] Li Yehao, Yao Ting, Pan Yingwei, and Mei Tao. 2022. Contextual transformer networks for visual recognition. IEEE Trans. Patt. Anal. Mach. Intell. (2022). https://ieeexplore.ieee.org/abstract/document/9747984.Google ScholarGoogle Scholar
  43. [43] Lin Kevin, Li Linjie, Lin Chung-Ching, Ahmed Faisal, Gan Zhe, Liu Zicheng, Lu Yumao, and Wang Lijuan. 2021. SwinBERT: End-to-end transformers with sparse attention for video captioning. CoRR abs/2111.13196 (2021).Google ScholarGoogle Scholar
  44. [44] Liu Sheng, Ren Zhou, and Yuan Junsong. 2021. SibNet: Sibling convolutional encoder for video captioning. IEEE Trans. Patt. Anal. Mach. Intell. 43, 9 (2021), 32593272.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Luo Jianjie, Li Yehao, Pan Yingwei, Yao Ting, Chao Hongyang, and Mei Tao. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In MM. 56005608.Google ScholarGoogle Scholar
  46. [46] Mazaheri Amir, Gong Boqing, and Shah Mubarak. 2018. Learning a multi-concept video retrieval model with multiple latent variables. ACM Trans. Multim. Comput. Commun. Appl. 14, 2 (2018), 46:1–46:21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Meng Fandong, Lu Zhengdong, Wang Mingxuan, Li Hang, Jiang Wenbin, and Liu Qun. 2015. Encoding source language with convolutional neural network for machine translation. In ACL-IJCNLP. The Association for Computer Linguistics, 2030.Google ScholarGoogle Scholar
  48. [48] Mithun Niluthpol Chowdhury, Li Juncheng, Metze Florian, and Roy-Chowdhury Amit K.. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, Aizawa Kiyoharu. ACM, 1927.Google ScholarGoogle Scholar
  49. [49] Mordan Taylor, Thome Nicolas, Hénaff Gilles, and Cord Matthieu. 2019. End-to-end learning of latent deformable part-based representations for object detection. Int. J. Comput. Vis. 127, 11–12 (2019), 16591679.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Nabati Masoomeh and Behrad Alireza. 2020. Video captioning using boosted and parallel Long Short-Term Memory networks. Comput. Vis. Image Underst. 190 (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Ordonez Vicente, Han Xufeng, Kuznetsova Polina, Kulkarni Girish, Mitchell Margaret, Yamaguchi Kota, Stratos Karl, Goyal Amit, Dodge Jesse, Mensch Alyssa C., III Hal Daumé, Berg Alexander C., Choi Yejin, and Berg Tamara L.. 2016. Large scale retrieval and generation of image descriptions. Int. J. Comput. Vis. 119, 1 (2016), 4659.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Ordonez Vicente, Kulkarni Girish, and Berg Tamara L.. 2011. Im2Text: Describing images using 1 million captioned photographs. In NIPS. 11431151.Google ScholarGoogle Scholar
  53. [53] Pan Boxiao, Cai Haoye, Huang De-An, Lee Kuan-Hui, Gaidon Adrien, Adeli Ehsan, and Niebles Juan Carlos. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In CVPR. Computer Vision Foundation/IEEE, 1086710876.Google ScholarGoogle Scholar
  54. [54] Pan Yingwei, Li Yehao, Luo Jianjie, Xu Jun, Yao Ting, and Mei Tao. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv preprint arXiv:2007.02375 (2020).Google ScholarGoogle Scholar
  55. [55] Pan Yingwei, Li Yehao, Yao Ting, Mei Tao, Li Houqiang, and Rui Yong. 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In JCAI. 38323838.Google ScholarGoogle Scholar
  56. [56] Pan Yingwei, Mei Tao, Yao Ting, Li Houqiang, and Rui Yong. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR. IEEE Computer Society, 45944602.Google ScholarGoogle Scholar
  57. [57] Pan Yingwei, Yao Ting, Li Houqiang, and Mei Tao. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE Computer Society, 984992.Google ScholarGoogle Scholar
  58. [58] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-linear attention networks for image captioning. In CVPR. 1097110980.Google ScholarGoogle Scholar
  59. [59] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311318.Google ScholarGoogle Scholar
  60. [60] Pasunuru Ramakanth and Bansal Mohit. 2017. Multi-task video captioning with video and entailment generation. In ACL. Association for Computational Linguistics, 12731283.Google ScholarGoogle Scholar
  61. [61] Ramanishka Vasili, Das Abir, Park Dong Huk, Venugopalan Subhashini, Hendricks Lisa Anne, Rohrbach Marcus, and Saenko Kate. 2016. Multimodal video description. In MM. ACM, 10921096.Google ScholarGoogle Scholar
  62. [62] Redmon Joseph and Farhadi Ali. 2017. YOLO9000: Better, faster, stronger. In CVPR. 65176525.Google ScholarGoogle Scholar
  63. [63] Ren Shaoqing, He Kaiming, Girshick Ross B., and Sun Jian. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Patt. Anal. Mach. Intell. 39, 6 (2017), 11371149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Rohrbach Marcus, Qiu Wei, Titov Ivan, Thater Stefan, Pinkal Manfred, and Schiele Bernt. 2013. Translating video content to natural language descriptions. In ICCV. 433440.Google ScholarGoogle Scholar
  65. [65] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael S., Berg Alexander C., and Fei-Fei Li. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Ryu Hobin, Kang Sunghun, Kang Haeyong, and Yoo Chang D.. 2021. Semantic grouping network for video captioning. In AAAI, EAAI. AAAI Press, 25142522.Google ScholarGoogle Scholar
  67. [67] Ryu Hobin, Kang Sunghun, Kang Haeyong, and Yoo Chang D.. 2021. Semantic grouping network for video captioning. In AAAI, IAAI, EAAI. AAAI Press, 25142522.Google ScholarGoogle Scholar
  68. [68] Sah Shagan, Nguyen Thang, and Ptucha Ray. 2020. Understanding temporal structure for video captioning. Patt. Anal. Appl. 23, 1 (2020), 147159.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Shi Xiangxi, Cai Jianfei, Gu Jiuxiang, and Joty Shafiq R.. 2020. Video captioning with boundary-aware hierarchical language decoding and joint video prediction. Neurocomputing 417 (2020), 347356.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Socher Richard, Karpathy Andrej, Le Quoc V., Manning Christopher D., and Ng Andrew Y.. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Computat. Ling. 2 (2014), 207218.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Song Jingkuan, Gao Lianli, Guo Zhao, Liu Wu, Zhang Dongxiang, and Shen Heng Tao. 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. In IJCAI. ijcai.org, 27372743.Google ScholarGoogle Scholar
  72. [72] Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, and Alemi Alexander A.. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI. 42784284.Google ScholarGoogle Scholar
  73. [73] Tang Pengjie, Wang Hanli, and Li Qinyu. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 15, 2 (2019), 31:1–31:23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] Torabi Atousa, Tandon Niket, and Sigal Leonid. 2016. Learning language-visual embedding for movie understanding with natural-language. CoRR abs/1609.08124 (2016).Google ScholarGoogle Scholar
  75. [75] Tran Du, Bourdev Lubomir D., Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV. IEEE Computer Society, 44894497.Google ScholarGoogle Scholar
  76. [76] Tu Yunbin, Zhou Chang, Guo Junjun, Gao Shengxiang, and Yu Zhengtao. 2021. Enhancing the alignment between target words and corresponding frames for video captioning. Patt. Recog. 111 (2021), 107702.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In NIPS. 59986008.Google ScholarGoogle Scholar
  78. [78] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. CIDEr: Consensus-based image description evaluation. In CVPR. IEEE Computer Society, 45664575.Google ScholarGoogle Scholar
  79. [79] Venugopalan Subhashini, Rohrbach Marcus, Donahue Jeffrey, Mooney Raymond J., Darrell Trevor, and Saenko Kate. 2015. Sequence to sequence—Video to text. In ICCV. IEEE Computer Society, 45344542.Google ScholarGoogle Scholar
  80. [80] Venugopalan Subhashini, Xu Huijuan, Donahue Jeff, Rohrbach Marcus, Mooney Raymond J., and Saenko Kate. 2015. Translating videos to natural language using deep recurrent neural networks. In NAACL HLT. The Association for Computational Linguistics, 14941504.Google ScholarGoogle Scholar
  81. [81] Wang Bairui, Ma Lin, Zhang Wei, and Liu Wei. 2018. Reconstruction network for video captioning. In CVPR. IEEE Computer Society, 76227631.Google ScholarGoogle Scholar
  82. [82] Wang Junbo, Wang Wei, Huang Yan, Wang Liang, and Tan Tieniu. 2018. M3: Multimodal memory modelling for video captioning. In CVPR. Computer Vision Foundation/IEEE Computer Society, 75127520.Google ScholarGoogle Scholar
  83. [83] Wang Xin, Wu Jiawei, Chen Junkun, Li Lei, Wang Yuan-Fang, and Wang William Yang. 2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In IEEE/CVF ICCV. IEEE, 45804590.Google ScholarGoogle Scholar
  84. [84] Wu Aming and Han Yahong. 2018. Multi-modal circulant fusion for video-to-language and backward. In IJCAI.Google ScholarGoogle Scholar
  85. [85] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In CVPR. IEEE Computer Society, 52885296.Google ScholarGoogle Scholar
  86. [86] Xu Jun, Yao Ting, Zhang Yongdong, and Mei Tao. 2017. Learning multimodal attention LSTM networks for video captioning. In MM. ACM, 537545.Google ScholarGoogle Scholar
  87. [87] Xu Ran, Xiong Caiming, Chen Wei, and Corso Jason J.. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI. AAAI Press, 23462352.Google ScholarGoogle Scholar
  88. [88] Yan Chenggang, Tu Yunbin, Wang Xingzheng, Zhang Yongbing, Hao Xinhong, Zhang Yongdong, and Dai Qionghai. 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2020), 229241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. [89] Yao Li, Torabi Atousa, Cho Kyunghyun, Ballas Nicolas, Pal Christopher J., Larochelle Hugo, and Courville Aaron C.. 2015. Describing videos by exploiting temporal structure. In ICCV. IEEE Computer Society, 45074515.Google ScholarGoogle Scholar
  90. [90] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In CVPR.Google ScholarGoogle Scholar
  91. [91] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In ECCV.Google ScholarGoogle Scholar
  92. [92] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In IEEE/CVF ICCV. 26212629.Google ScholarGoogle Scholar
  93. [93] Yao Ting, Pan Yingwei, Li Yehao, Qiu Zhaofan, and Mei Tao. 2017. Boosting image captioning with attributes. In ICCV.Google ScholarGoogle Scholar
  94. [94] Yu Haonan, Wang Jiang, Huang Zhiheng, Yang Yi, and Xu Wei. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR. IEEE Computer Society, 45844593.Google ScholarGoogle Scholar
  95. [95] Zhang Ziqi, Qi Zhongang, Yuan Chunfeng, Shan Ying, Li Bing, Deng Ying, and Hu Weiming. 2021. Open-book video captioning with retrieve-copy-generate network. In IEEE CVPR. Computer Vision Foundation/IEEE, 98379846.Google ScholarGoogle Scholar
  96. [96] Zhang Ziqi, Shi Yaya, Yuan Chunfeng, Li Bing, Wang Peijin, Hu Weiming, and Zha Zheng-Jun. 2020. Object relational graph with teacher-recommended learning for video captioning. In IEEE/CVF CVPR. Computer Vision Foundation/ IEEE, 1327513285.Google ScholarGoogle Scholar
  97. [97] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2018. Video captioning with tube features. In IJCAI. ijcai.org, 11771183.Google ScholarGoogle Scholar
  98. [98] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2019. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans. Image Process. 28, 11 (2019), 55525565.Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. [99] Zhao Wentian, Wu Xinxiao, and Luo Jiebo. 2021. Multi-modal dependency tree for video captioning. Adv. Neural Inf. Process. Syst. 34 (2021).Google ScholarGoogle Scholar

Index Terms

  1. Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1s
        February 2023
        504 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3572859
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 23 January 2023
        • Online AM: 26 May 2022
        • Accepted: 16 May 2022
        • Revised: 12 April 2022
        • Received: 3 December 2021
        Published in tomm Volume 19, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format