research-article

Image Captioning with Deep Bidirectional LSTMs

Authors:
Cheng Wang

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
View Profile

,
Haojin Yang

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
View Profile

,
Christian Bartz

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
View Profile

,
Christoph Meinel

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
View Profile

MM '16: Proceedings of the 24th ACM international conference on MultimediaOctober 2016Pages 988–997https://doi.org/10.1145/2964284.2964299

Published:01 October 2016Publication History

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 988–997

ABSTRACT

This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. Two novel deep bidirectional variant models, in which we increase the depth of nonlinearity transition in different way, are proposed to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. We visualize the evolution of bidirectional LSTM internal states over time and qualitatively analyze how our models "translate" image to sentence. Our proposed models are evaluated on caption generation and image-sentence retrieval tasks with three benchmark datasets: Flickr8K, Flickr30K and MSCOCO datasets. We demonstrate that bidirectional LSTM models achieve highly competitive performance to the state-of-the-art results on caption generation even without integrating additional mechanism (e.g. object detection, attention model etc.) and significantly outperform recent methods on retrieval task

References

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.Google Scholar
X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422--2431, 2015.Google ScholarCross Ref
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.Google ScholarCross Ref
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625--2634, 2015.Google ScholarCross Ref
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, and J. Platt. From captions to visual concepts and back. In CVPR, pages 1473--1482, 2015.Google ScholarCross Ref
F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACMMM, pages 7--16. ACM, 2014. Google ScholarDigital Library
A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121--2129, 2013. Google ScholarDigital Library
A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645--6649. IEEE, 2013.Google ScholarCross Ref
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACMMM, pages 675--678. ACM, 2014. Google ScholarDigital Library
A. Karpathy, A. Joulin, and F-F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889--1897, 2014. Google ScholarDigital Library
A. Karpathy and F-F. Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128--3137, 2015.Google ScholarCross Ref
R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, pages 595--603, 2014.Google Scholar
R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097--1105, 2012. Google ScholarDigital Library
G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI), 35(12):2891--2903, 2013. Google ScholarDigital Library
P. Kuznetsova, V. Ordonez, A. C. Berg, T. Berg, and Y. Choi. Collective generation of natural image descriptions. In ACL, volume 1, pages 359--368. ACL, 2012. Google ScholarDigital Library
P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. Trans. of the Association for Computational Linguistics(TACL), 2(10):351--362, 2014.Google Scholar
M. Lavie. Meteor universal: language specific translation evaluation for any target language. ACL, page 376, 2014.Google Scholar
Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436--444, 2015.Google ScholarCross Ref
S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In CoNLL, pages 220--228. ACL, 2011. Google ScholarDigital Library
T-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740--755. Springer, 2014.Google ScholarCross Ref
Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z Wang. Rapid: Rating pictorial aesthetics using deep learning. In ACMMM, pages 457--466. ACM, 2014. Google ScholarDigital Library
J. H. Mao, W. Xu, Y. Yang, J. Wang, Z. H. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.Google Scholar
T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, volume 2, page 3, 2010.Google ScholarCross Ref
T. Mikolov, S. Kombrink, L. Burget, J. H. Cernockỳ, and S. Khudanpur. Extensions of recurrent neural network language model. In ICASSP, pages 5528--5531. IEEE, 2011.Google ScholarCross Ref
M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. Midge: Generating image descriptions from computer vision detections. In ACL, pages 747--756. ACL, 2012. Google ScholarDigital Library
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. In ICML, pages 689--696, 2011.Google ScholarDigital Library
K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311--318. ACL, 2002. Google ScholarDigital Library
R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013.Google Scholar
J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet, R. Levy, and N. Vasconcelos. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI), 36(3):521--535, 2014. Google ScholarDigital Library
C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon's mechanical turk. In NAACL HLT Workshop, pages 139--147. Association for Computational Linguistics, 2010. Google ScholarDigital Library
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568--576, 2014. Google ScholarDigital Library
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. Trans. of the Association for Computational Linguistics(TACL), 2:207--218, 2014.Google Scholar
N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, pages 2222--2230, 2012. Google ScholarDigital Library
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104--3112, 2014. Google ScholarDigital Library
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1--9, 2015.Google ScholarCross Ref
R. Vedantam, Z. Lawrence, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566--4575, 2015.Google ScholarCross Ref
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156--3164, 2015.Google ScholarCross Ref
Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. Deepfont: Identify your font from an image. In ACMMM, pages 451--459. ACM, 2015. Google ScholarDigital Library
X.Jiang, F. Wu, X. Li, Z. Zhao, W. Lu, S. Tang, and Y. Zhuang. Deep compositional cross-modal learning to rank via local-global alignment. In ACMMM, pages 69--78. ACM, 2015. Google ScholarDigital Library
K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML, 2015.Google ScholarDigital Library
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. of the Association for Computational Linguistics(TACL), 2:67--78, 2014.Google Scholar
W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.Google Scholar
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818--833. Springer, 2014.Google ScholarCross Ref

Index Terms

Image Captioning with Deep Bidirectional LSTMs
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
    2. Natural language processing
      1. Natural language generation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long ...
Read More
Deep image captioning using an ensemble of CNN and LSTM based deep neural networks
Ethical Computational Intelligence for Cyber Market

The paper is concerned with the problem of Image Caption Generation. The purpose of this paper is to create a deep learning model to generate captions for a given image by decoding the information available in the image. For this purpose, a custom ...
Read More
A Comprehensive Survey of Deep Learning for Image Captioning

Generating a description of an image is called image captioning. Image captioning requires recognizing the important objects, their attributes, and their relationships in an image. It also needs to generate syntactically and semantically correct ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
LSTM
deep learning
image captioning
visual-language
Qualifiers
- research-article
Conference

Acceptance Rates
MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 166
  Total Citations
  View Citations
- 1,548
  Total Downloads
- Downloads (Last 12 months)142
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Image Captioning with Deep Bidirectional LSTMs

MM '16: Proceedings of the 24th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

Deep image captioning using an ensemble of CNN and LSTM based deep neural networks

A Comprehensive Survey of Deep Learning for Image Captioning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media