research-article

SibNet: Sibling Convolutional Encoder for Video Captioning

Authors:
Sheng Liu

State University of New York at Buffalo, Buffalo, NY, USA

State University of New York at Buffalo, Buffalo, NY, USA
View Profile

,
Zhou Ren

Snap Research, Los Angeles, CA, USA

Snap Research, Los Angeles, CA, USA
View Profile

,
Junsong Yuan

State University of New York at Buffalo, Buffalo, NY, USA

State University of New York at Buffalo, Buffalo, NY, USA
View Profile

MM '18: Proceedings of the 26th ACM international conference on MultimediaOctober 2018Pages 1425–1434https://doi.org/10.1145/3240508.3240667

Published:15 October 2018Publication History

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 1425–1434

ABSTRACT

Video captioning is a challenging task owing to the complexity of understanding the copious visual information in videos and describing it using natural language. Different from previous work that encodes video information using a single flow, in this work, we introduce a novel Sibling Convolutional Encoder (SibNet) for video captioning, which utilizes a two-branch architecture to collaboratively encode videos. The first content branch encodes the visual content information of the video via autoencoder, and the second semantic branch encodes the semantic information by visual-semantic joint embedding. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed method can better represent the rich information in videos. Extensive experiments on YouTube2Text and MSR-VTT datasets validate that the proposed architecture outperforms existing methods by a large margin across different evaluation metrics.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR) .Google Scholar
Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In Proceedings of International Conference on Learning Representations (ICLR) .Google Scholar
David L Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). ACL, 190--200. Google ScholarDigital Library
Fuhai Chen, Rongrong Ji, Jinsong Su, Yongjian Wu, and Yunsheng Wu. 2017b. StructCap: Structured Semantic Embedding for Image Captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 46--54. Google ScholarDigital Library
Shizhe Chen, Jia Chen, Qin Jin, and Alexander Hauptmann. 2017a. Video captioning with guidance of multimodal latent topics. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1838--1846. Google ScholarDigital Library
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
John R Crouse, Joel S Raichlen, Ward A Riley, Gregory W Evans, Mike K Palmer, Daniel H O'Leary, Diederick E Grobbee, Michiel L Bots, METEOR Study Group, et almbox. 2007. Effect of rosuvastatin on progression of carotid intima-media thickness in low-risk individuals with subclinical atherosclerosis: the METEOR Trial. Jama , Vol. 297, 12 (2007), 1344--1353.Google ScholarCross Ref
Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees GM Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1082--1086. Google ScholarDigital Library
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics . 249--256.Google Scholar
Zhao Guo, Lianli Gao, Jingkuan Song, Xing Xu, Jie Shao, and Heng Tao Shen. 2016. Attention-based LSTM with semantic consistency for videos captioning. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 357--361. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google ScholarCross Ref
Chiori Hori, Takaaki Hori, and Teng-Yok Lee. 2017. Attention-based multimodal fusion for video description. (2017).Google Scholar
Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google Scholar
Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1087--1091. Google ScholarDigital Library
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) (2014).Google Scholar
Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-Guided Cross-Lingual Image Captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1549--1557. Google ScholarDigital Library
Guang Li, Shubo Ma, and Yahong Han. {n. d.}. Summarization-based video caption via deep neural networks. In Proceedings of the 2015 ACM on Multimedia Conference, pages=1191--1194, year=2015,, ACM. Google ScholarDigital Library
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004).Google Scholar
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In International Conference on Learning Representations (ICLR) .Google Scholar
Yuan Liu and Zhongchao Shi. 2016. Boosting video description generation by explicitly translating from frame-level captions. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 631--634. Google ScholarDigital Library
Beth Logan et almbox. 2000. Mel frequency cepstral coefficients for music modeling. In ISMIR, Vol. 270. 1--11.Google Scholar
Tao Mei, Yong Rui, Xinmei Tian, and Ting Yao. 2017. MSR-VTT Challenge. http://ms-multimedia-challenge.com/2017/challenge . (2017).Google Scholar
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML). 807--814. Google ScholarDigital Library
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1029--1038.Google ScholarCross Ref
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL). ACL, 311--318. Google ScholarDigital Library
Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with video and entailment generation. In Proceedings of Association for Computational Linguistics (ACL). ACL.Google ScholarCross Ref
Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. 2018. Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) (2018).Google Scholar
Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal video description. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1092--1096. Google ScholarDigital Library
Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. 2016. Joint image-text representation by gaussian visual-semantic embedding. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 207--211. Google ScholarDigital Library
Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. 2017a. Multiple Instance Visual-Semantic Embedding. In Proceeding of the British Machine Vision Conference (BMVC) .Google ScholarCross Ref
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017b. Deep Reinforcement Learning-based Image Captioning with Embedding Reward. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).Google ScholarCross Ref
Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Google ScholarCross Ref
Rakshith Shetty and Jorma Laaksonen. 2016. Frame-and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1073--1076. Google ScholarDigital Library
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research , Vol. 15, 1 (2014), 1929--1958. Google ScholarDigital Library
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 1--9.Google ScholarCross Ref
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 2818--2826.Google ScholarCross Ref
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) . 4489--4497. Google ScholarDigital Library
Yunbin Tu, Xishan Zhang, Bingtao Liu, and Chenggang Yan. 2017. Video Description with Spatial-Temporal Attention. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1014--1022. Google ScholarDigital Library
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of Association for Computational Linguistics (ACL). ACL, 384--394. Google ScholarDigital Library
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 4566--4575.Google ScholarCross Ref
Subhashini Venugopalan, Lisa Anne Hendricks, Raymond Mooney, and Kate Saenko. 2016. Improving lstm-based video description with linguistic knowledge mined from text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL.Google ScholarCross Ref
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4534--4542. Google ScholarDigital Library
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. North American Chapter of the Association for Computational Linguistics (NAACL) .Google Scholar
Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction Network for Video Captioning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.Google ScholarCross Ref
Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016b. Image captioning with deep bidirectional LSTMs. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 988--997. Google ScholarDigital Library
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016a. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV). Springer, 20--36.Google ScholarCross Ref
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV) . 5288--5296.Google ScholarCross Ref
Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning Multimodal Attention LSTM Networks for Video Captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 537--545. Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML). 2048--2057. Google ScholarDigital Library
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) . 4507--4515. Google ScholarDigital Library
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4651--4659.Google ScholarCross Ref
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 4584--4593.Google ScholarCross Ref
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.Google ScholarCross Ref
Linchao Zhu, Zhongwen Xu, and Yi Yang. 2017. Bidirectional multirate reconstruction for temporal modeling in videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.Google ScholarCross Ref

Index Terms

SibNet: Sibling Convolutional Encoder for Video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning
Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. The well-established way of doing so is to rely on encoder-decoder paradigm by learning ...
Read More
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Read More
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image Processing

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
autoencoder
video captioning
visual semantic joint embedding
Qualifiers
- research-article
Conference

Acceptance Rates
MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 52
  Total Citations
  View Citations
- 497
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SibNet: Sibling Convolutional Encoder for Video Captioning

MM '18: Proceedings of the 26th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning

Learning Multimodal Attention LSTM Networks for Video Captioning

Video Captioning using Hierarchical Multi-Attention Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media