Weakly-supervised image captioning based on rich contextual information

Zheng, Hai-Tao; Wang, Zhe; Ma, Ningning; Chen, Jinyuan; Xiao, Xi; Sangaiah, Arun Kumar

doi:10.1007/s11042-017-5236-2

Weakly-supervised image captioning based on rich contextual information

Published: 25 September 2017

Volume 77, pages 18583–18599, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Hai-Tao Zheng¹,
Zhe Wang ORCID: orcid.org/0000-0002-7248-8389¹,
Ningning Ma¹,
Jinyuan Chen¹,
Xi Xiao² &
…
Arun Kumar Sangaiah³

546 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

Automatically generation of an image description is a challenging task which attracts broad attention in artificial intelligence. Inspired by methods of computer vision and natural language processing, different approaches have been proposed to solve the problem. However, captions generated by the existing approaches have been lack of enough contextual information to describe the corresponding images completely. The labeled captions in the training set only basically describe images and lack of enough contextual annotations. In this paper, we propose a Weakly-supervised Image Captioning Approach (WICA) to generate captions containing rich contextual information, without complete annotations for the contextual information in datasets. We utilize encoder-decoder neural networks to extract basic captioning features and leverage object detection networks to identify contextual features. Then, we encode the two levels of features by a phrase-based language model in order to generate captions with rich contextual information. The comprehensive experimental results reveal that proposed model outperforms the existing baselines in terms of on the richness and reasonability of contextual information for image captioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Object-Extensible Training Framework for Image Captioning

Object-Centric Unsupervised Image Captioning

Regenerating Image Caption with High-Level Semantics

References

Abadi M, Agarwal A, Barham P et al (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467
Aker A, Gaizauskas R (2010) Generating image descriptions using dependency relational patterns. Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 1250–1258
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP
Elliott D, Keller F (2013) Image description using visual dependency representations. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing pp 1292–1302
Farhadi A, Hejrati M, Sadeghi M et al (2010) Every picture tells a story: Generating sentences from images. Computer vision–ECCV pp 15–29
Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Gong Y, Wang L, Hodosh M, et al (2014) Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections. ECCV (4), pp 529–545
He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899
MathSciNet MATH Google Scholar
Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems pp 1889–1897
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. Proceedings of the 31st International Conference on Machine Learning (ICML-14). pp 595–603
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Kulkarni G, Premraj V, Ordonez V et al (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Kuznetsova P, Ordonez V, Berg AC et al (2012) Collective generation of natural image descriptions. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pp 359–368
Kuznetsova P, Ordonez V, Berg TL et al (2014) TREETALK: Composition and Compression of Trees for Image Descriptions. TACL 2(10):351–362
Google Scholar
Li S, Kulkarni G, Berg TL et al (2011) Composing simple image descriptions using web-scale n-grams. Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, pp 220–228
Lin T Y, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312
Mao J, Xu W, Yang Y et al (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632
Mitchell M, Han X, Dodge J et al (2012) Midge: Generating image descriptions from computer vision detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 747–756
Ordonez V, Kulkarni G, Berg TL (2011) Im2text: Describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems, pp 1143–1151
Papineni K, Roukos S, Ward T et al (2002) BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
Ren S, He K, Girshick R et al (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Russakovsky O, Deng J, Su H et al (2014) Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575
Shi J, Wu J, Paul A et al (2014) Change Detection in Synthetic Aperture Radar Images Based on Fuzzy Active Contour Models and Genetic Algorithms. Mathematical Problems in Engineering, 2014
Socher R, Karpathy A, Le QV et al (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218
Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Advances in neural information processing systems, pp 3104–3112
Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wu J, Paul A, Xing Y et al (2010) Morphological dilation image coding with context weights prediction. Signal Process Image Commun 25(10):717–728
Article Google Scholar
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning, pp 2048–2057

Download references

Acknowledgements

This research is supported by National Natural Science Foundation of China (Grant No. 61375054), Natural Science Foundation of Guangdong Province (Grant No. 2014A030313745), Basic Scientific Research Program of Shenzhen City (Grant No. JCYJ20160331184440545), and Cross Fund of Graduate School at Shenzhen, Tsinghua University (Grant No. JC20140001).

Author information

Authors and Affiliations

Tsinghua-Southampton Web Science Laboratory, Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, China
Hai-Tao Zheng, Zhe Wang, Ningning Ma & Jinyuan Chen
Graduate School at Shenzhen, Tsinghua University, Shenzhen, Guangdong, China
Xi Xiao
The School of Computing Science and Engineering, VIT University, Vellore, Tamil Nadu, 632014, India
Arun Kumar Sangaiah

Authors

Hai-Tao Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ningning Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jinyuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xi Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Arun Kumar Sangaiah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhe Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, HT., Wang, Z., Ma, N. et al. Weakly-supervised image captioning based on rich contextual information. Multimed Tools Appl 77, 18583–18599 (2018). https://doi.org/10.1007/s11042-017-5236-2

Download citation

Received: 06 August 2017
Revised: 05 September 2017
Accepted: 15 September 2017
Published: 25 September 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11042-017-5236-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly-supervised image captioning based on rich contextual information

Abstract

Access this article

Similar content being viewed by others

An Object-Extensible Training Framework for Image Captioning

Object-Centric Unsupervised Image Captioning

Regenerating Image Caption with High-Level Semantics

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Weakly-supervised image captioning based on rich contextual information

Abstract

Access this article

Similar content being viewed by others

An Object-Extensible Training Framework for Image Captioning

Object-Centric Unsupervised Image Captioning

Regenerating Image Caption with High-Level Semantics

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation