Skip to main content
Log in

Weakly-supervised image captioning based on rich contextual information

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatically generation of an image description is a challenging task which attracts broad attention in artificial intelligence. Inspired by methods of computer vision and natural language processing, different approaches have been proposed to solve the problem. However, captions generated by the existing approaches have been lack of enough contextual information to describe the corresponding images completely. The labeled captions in the training set only basically describe images and lack of enough contextual annotations. In this paper, we propose a Weakly-supervised Image Captioning Approach (WICA) to generate captions containing rich contextual information, without complete annotations for the contextual information in datasets. We utilize encoder-decoder neural networks to extract basic captioning features and leverage object detection networks to identify contextual features. Then, we encode the two levels of features by a phrase-based language model in order to generate captions with rich contextual information. The comprehensive experimental results reveal that proposed model outperforms the existing baselines in terms of on the richness and reasonability of contextual information for image captioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Abadi M, Agarwal A, Barham P et al (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467

  2. Aker A, Gaizauskas R (2010) Generating image descriptions using dependency relational patterns. Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 1250–1258

  3. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  4. Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP

  5. Elliott D, Keller F (2013) Image description using visual dependency representations. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing pp 1292–1302

  6. Farhadi A, Hejrati M, Sadeghi M et al (2010) Every picture tells a story: Generating sentences from images. Computer vision–ECCV pp 15–29

  7. Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  8. Gong Y, Wang L, Hodosh M, et al (2014) Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections. ECCV (4), pp 529–545

  9. He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

    Article  Google Scholar 

  10. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899

    MathSciNet  MATH  Google Scholar 

  11. Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems pp 1889–1897

  12. Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. Proceedings of the 31st International Conference on Machine Learning (ICML-14). pp 595–603

  13. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  14. Kulkarni G, Premraj V, Ordonez V et al (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903

    Article  Google Scholar 

  15. Kuznetsova P, Ordonez V, Berg AC et al (2012) Collective generation of natural image descriptions. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pp 359–368

  16. Kuznetsova P, Ordonez V, Berg TL et al (2014) TREETALK: Composition and Compression of Trees for Image Descriptions. TACL 2(10):351–362

    Google Scholar 

  17. Li S, Kulkarni G, Berg TL et al (2011) Composing simple image descriptions using web-scale n-grams. Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, pp 220–228

  18. Lin T Y, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312

  19. Mao J, Xu W, Yang Y et al (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632

  20. Mitchell M, Han X, Dodge J et al (2012) Midge: Generating image descriptions from computer vision detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 747–756

  21. Ordonez V, Kulkarni G, Berg TL (2011) Im2text: Describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems, pp 1143–1151

  22. Papineni K, Roukos S, Ward T et al (2002) BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318

  23. Ren S, He K, Girshick R et al (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  24. Russakovsky O, Deng J, Su H et al (2014) Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575

  25. Shi J, Wu J, Paul A et al (2014) Change Detection in Synthetic Aperture Radar Images Based on Fuzzy Active Contour Models and Genetic Algorithms. Mathematical Problems in Engineering, 2014

  26. Socher R, Karpathy A, Le QV et al (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218

    Google Scholar 

  27. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Advances in neural information processing systems, pp 3104–3112

  28. Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  29. Wu J, Paul A, Xing Y et al (2010) Morphological dilation image coding with context weights prediction. Signal Process Image Commun 25(10):717–728

    Article  Google Scholar 

  30. Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning, pp 2048–2057

Download references

Acknowledgements

This research is supported by National Natural Science Foundation of China (Grant No. 61375054), Natural Science Foundation of Guangdong Province (Grant No. 2014A030313745), Basic Scientific Research Program of Shenzhen City (Grant No. JCYJ20160331184440545), and Cross Fund of Graduate School at Shenzhen, Tsinghua University (Grant No. JC20140001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhe Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, HT., Wang, Z., Ma, N. et al. Weakly-supervised image captioning based on rich contextual information. Multimed Tools Appl 77, 18583–18599 (2018). https://doi.org/10.1007/s11042-017-5236-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5236-2

Keywords

Navigation