Abstract
Automatically generation of an image description is a challenging task which attracts broad attention in artificial intelligence. Inspired by methods of computer vision and natural language processing, different approaches have been proposed to solve the problem. However, captions generated by the existing approaches have been lack of enough contextual information to describe the corresponding images completely. The labeled captions in the training set only basically describe images and lack of enough contextual annotations. In this paper, we propose a Weakly-supervised Image Captioning Approach (WICA) to generate captions containing rich contextual information, without complete annotations for the contextual information in datasets. We utilize encoder-decoder neural networks to extract basic captioning features and leverage object detection networks to identify contextual features. Then, we encode the two levels of features by a phrase-based language model in order to generate captions with rich contextual information. The comprehensive experimental results reveal that proposed model outperforms the existing baselines in terms of on the richness and reasonability of contextual information for image captioning.
Similar content being viewed by others
References
Abadi M, Agarwal A, Barham P et al (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467
Aker A, Gaizauskas R (2010) Generating image descriptions using dependency relational patterns. Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 1250–1258
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP
Elliott D, Keller F (2013) Image description using visual dependency representations. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing pp 1292–1302
Farhadi A, Hejrati M, Sadeghi M et al (2010) Every picture tells a story: Generating sentences from images. Computer vision–ECCV pp 15–29
Girshick R (2015) Fast r-cnn. Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Gong Y, Wang L, Hodosh M, et al (2014) Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections. ECCV (4), pp 529–545
He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899
Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems pp 1889–1897
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. Proceedings of the 31st International Conference on Machine Learning (ICML-14). pp 595–603
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Kulkarni G, Premraj V, Ordonez V et al (2013) Babytalk: Understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Kuznetsova P, Ordonez V, Berg AC et al (2012) Collective generation of natural image descriptions. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pp 359–368
Kuznetsova P, Ordonez V, Berg TL et al (2014) TREETALK: Composition and Compression of Trees for Image Descriptions. TACL 2(10):351–362
Li S, Kulkarni G, Berg TL et al (2011) Composing simple image descriptions using web-scale n-grams. Proceedings of the Fifteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, pp 220–228
Lin T Y, Maire M, Belongie S et al (2014) Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312
Mao J, Xu W, Yang Y et al (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632
Mitchell M, Han X, Dodge J et al (2012) Midge: Generating image descriptions from computer vision detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 747–756
Ordonez V, Kulkarni G, Berg TL (2011) Im2text: Describing images using 1 million captioned photographs. In: Advances in Neural Information Processing Systems, pp 1143–1151
Papineni K, Roukos S, Ward T et al (2002) BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
Ren S, He K, Girshick R et al (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Russakovsky O, Deng J, Su H et al (2014) Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575
Shi J, Wu J, Paul A et al (2014) Change Detection in Synthetic Aperture Radar Images Based on Fuzzy Active Contour Models and Genetic Algorithms. Mathematical Problems in Engineering, 2014
Socher R, Karpathy A, Le QV et al (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Advances in neural information processing systems, pp 3104–3112
Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: A neural image caption generator. Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wu J, Paul A, Xing Y et al (2010) Morphological dilation image coding with context weights prediction. Signal Process Image Commun 25(10):717–728
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning, pp 2048–2057
Acknowledgements
This research is supported by National Natural Science Foundation of China (Grant No. 61375054), Natural Science Foundation of Guangdong Province (Grant No. 2014A030313745), Basic Scientific Research Program of Shenzhen City (Grant No. JCYJ20160331184440545), and Cross Fund of Graduate School at Shenzhen, Tsinghua University (Grant No. JC20140001).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zheng, HT., Wang, Z., Ma, N. et al. Weakly-supervised image captioning based on rich contextual information. Multimed Tools Appl 77, 18583–18599 (2018). https://doi.org/10.1007/s11042-017-5236-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5236-2