ABSTRACT
Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence while ignoring intrinsic multimodality nature. Observing that different modalities (e.g., frame, motion, and audio streams), as well as the elements within each modality, contribute differently to the sentence generation, we present a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM). Our proposed MA-LSTM fully exploits both multimodal streams and temporal attention to selectively focus on specific elements during the sentence generation. Moreover, we design a novel child-sum fusion unit in the MA-LSTM to effectively combine different encoded modalities to the initial decoding states. Different from existing approaches that employ the same LSTM structure for different modalities, we train modality-specific LSTM to capture the intrinsic representations of individual modalities. The experiments on two benchmark datasets (MSVD and MSR-VTT) show that our MA-LSTM significantly outperforms the state-of-the-art methods with 52.3 BLEU@4 and 70.4 CIDER-D metrics on MSVD dataset, respectively.
- Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2016. Delving Deeper into Convolutional Networks for Learning Video Representations ICLR.Google Scholar
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments ACL.Google Scholar
- Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, and others. 2012. Video In Sentences Out. UAI (2012). Google ScholarDigital Library
- David L Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation ACL. Google ScholarDigital Library
- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description CVPR.Google Scholar
- Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In CVPR.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).Google Scholar
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks CVPR. Google ScholarDigital Library
- Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision (2002). Google ScholarDigital Library
- Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding ACM MM. Google ScholarDigital Library
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries ACL. Google ScholarDigital Library
- Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016 b. Hierarchical recurrent neural encoder for video representation with application to captioning CVPR.Google Scholar
- Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016 a. Jointly modeling embedding and translation to bridge video and language CVPR.Google Scholar
- Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video Captioning with Transferred Semantic Attributes CVPR.Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation ACL. Google ScholarDigital Library
- Marcus Rohrbach, Wei Qiu, Igor Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions ICCV. Google ScholarDigital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (2015). Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014 a. Two-stream convolutional networks for action recognition in videos NIPS. Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014 b. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks NIPS. Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks CVPR.Google Scholar
- Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation CVPR.Google Scholar
- Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to Sequence--Video to Text. In ICCV. Google ScholarDigital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014).Google Scholar
- Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015).Google Scholar
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language CVPR.Google Scholar
- Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework.. In AAAI. Google ScholarDigital Library
- Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, and Alexander G Hauptmann. 2013. Feature weighting via optimal thresholding for video analysis CVPR.Google Scholar
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure ICCV. Google ScholarDigital Library
- Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017 a. Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects CVPR.Google Scholar
- Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017 b. Boosting Image Captioning with Attributes. In ICCV.Google Scholar
- Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks CVPR.Google Scholar
- Barret Zoph and Kevin Knight. 2016. Multi-Source Neural Translation. In NAACL-HLT.Google Scholar
Index Terms
- Learning Multimodal Attention LSTM Networks for Video Captioning
Recommendations
Residual attention-based LSTM for video captioning
Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number ...
Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking
iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia FusionVideo hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...
Recurrent convolutional video captioning with global and local attention
Highlights- We propose a novel video captioning model with global-local attention.
- We ...
AbstractVideo captioning with encoder–decoder structures has been extensively studied in the recent literature, where a great deal of work focuses on multimodal features and attention mechanisms. Most of the previous work uses only the global ...
Comments