skip to main content
10.1145/2964284.2964299acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Image Captioning with Deep Bidirectional LSTMs

Authors Info & Claims
Published:01 October 2016Publication History

ABSTRACT

This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. Two novel deep bidirectional variant models, in which we increase the depth of nonlinearity transition in different way, are proposed to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. We visualize the evolution of bidirectional LSTM internal states over time and qualitatively analyze how our models "translate" image to sentence. Our proposed models are evaluated on caption generation and image-sentence retrieval tasks with three benchmark datasets: Flickr8K, Flickr30K and MSCOCO datasets. We demonstrate that bidirectional LSTM models achieve highly competitive performance to the state-of-the-art results on caption generation even without integrating additional mechanism (e.g. object detection, attention model etc.) and significantly outperform recent methods on retrieval task

References

  1. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015.Google ScholarGoogle Scholar
  2. X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422--2431, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  3. K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625--2634, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  5. H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, and J. Platt. From captions to visual concepts and back. In CVPR, pages 1473--1482, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  6. F. Feng, X. Wang, and R. Li. Cross-modal retrieval with correspondence autoencoder. In ACMMM, pages 7--16. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, pages 2121--2129, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645--6649. IEEE, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  9. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACMMM, pages 675--678. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Karpathy, A. Joulin, and F-F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, pages 1889--1897, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Karpathy and F-F. Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128--3137, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  12. R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, pages 595--603, 2014.Google ScholarGoogle Scholar
  13. R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.Google ScholarGoogle Scholar
  14. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097--1105, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI), 35(12):2891--2903, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Kuznetsova, V. Ordonez, A. C. Berg, T. Berg, and Y. Choi. Collective generation of natural image descriptions. In ACL, volume 1, pages 359--368. ACL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. Trans. of the Association for Computational Linguistics(TACL), 2(10):351--362, 2014.Google ScholarGoogle Scholar
  18. M. Lavie. Meteor universal: language specific translation evaluation for any target language. ACL, page 376, 2014.Google ScholarGoogle Scholar
  19. Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436--444, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  20. S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In CoNLL, pages 220--228. ACL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740--755. Springer, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  22. Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z Wang. Rapid: Rating pictorial aesthetics using deep learning. In ACMMM, pages 457--466. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. H. Mao, W. Xu, Y. Yang, J. Wang, Z. H. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.Google ScholarGoogle Scholar
  24. T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. In INTERSPEECH, volume 2, page 3, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  25. T. Mikolov, S. Kombrink, L. Burget, J. H. Cernockỳ, and S. Khudanpur. Extensions of recurrent neural network language model. In ICASSP, pages 5528--5531. IEEE, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  26. M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. Midge: Generating image descriptions from computer vision detections. In ACL, pages 747--756. ACL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. In ICML, pages 689--696, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311--318. ACL, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013.Google ScholarGoogle Scholar
  30. J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. Lanckriet, R. Levy, and N. Vasconcelos. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence(PAMI), 36(3):521--535, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using amazon's mechanical turk. In NAACL HLT Workshop, pages 139--147. Association for Computational Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568--576, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  34. R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. Trans. of the Association for Computational Linguistics(TACL), 2:207--218, 2014.Google ScholarGoogle Scholar
  35. N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, pages 2222--2230, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104--3112, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1--9, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  38. R. Vedantam, Z. Lawrence, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566--4575, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  39. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156--3164, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  40. Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. Deepfont: Identify your font from an image. In ACMMM, pages 451--459. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. X.Jiang, F. Wu, X. Li, Z. Zhao, W. Lu, S. Tang, and Y. Zhuang. Deep compositional cross-modal learning to rank via local-global alignment. In ACMMM, pages 69--78. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. of the Association for Computational Linguistics(TACL), 2:67--78, 2014.Google ScholarGoogle Scholar
  44. W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.Google ScholarGoogle Scholar
  45. M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818--833. Springer, 2014.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Image Captioning with Deep Bidirectional LSTMs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '16: Proceedings of the 24th ACM international conference on Multimedia
          October 2016
          1542 pages
          ISBN:9781450336031
          DOI:10.1145/2964284

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 October 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '16 Paper Acceptance Rate52of237submissions,22%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader