skip to main content
survey

A Comprehensive Survey of Deep Learning for Image Captioning

Published:04 February 2019Publication History
Skip Abstract Section

Abstract

Generating a description of an image is called image captioning. Image captioning requires recognizing the important objects, their attributes, and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep-learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey article, we aim to present a comprehensive review of existing deep-learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths, and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep-learning-based automatic image captioning.

References

  1. Abhaya Agarwal and Alon Lavie. 2008. Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine translation output. In Proceedings of the 3rd Workshop on Statistical Machine Translation. Association for Computational Linguistics, 115--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ahmet Aker and Robert Gaizauskas. 2010. Generating image descriptions using dependency relational patterns. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1250--1258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382--398.Google ScholarGoogle ScholarCross RefCross Ref
  4. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998 (2017).Google ScholarGoogle Scholar
  5. Jyoti Aneja, Aditya Deshpande, and Alexander G. Schwing. 2018. Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5561--5570.Google ScholarGoogle Scholar
  6. Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  7. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR’15). arXiv: https://arxiv.org/abs/1409.0473.Google ScholarGoogle Scholar
  8. Shuang Bai and Shan An. 2018. A survey on automatic image caption generation. Neurocomputing.Google ScholarGoogle Scholar
  9. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Vol. 29. 65--72.Google ScholarGoogle Scholar
  10. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. 1171--1179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (Feb. 2003), 1137--1155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics 22, 1 (1996), 39--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures.Journal of Artificial Intelligence Research (JAIR) 55 (2016), 409--442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, (Jan. 2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cristian Bodnar. 2018. Text to image synthesis using generative adversarial networks. arXiv preprint arXiv:1805.00676.Google ScholarGoogle Scholar
  16. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 1247--1250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual Workshop on Computational Learning Theory. ACM, 144--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Andreas Bulling, Jamie A. Ward, Hans Gellersen, and Gerhard Troster. 2011. Eye movement analysis for activity recognition using electrooculography. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 4 (2011), 741--753. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand. 2011. Learning photographic global tonal adjustment with a database of input/output image pairs. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, 97--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluation the role of bleu in machine translation research. In Proceedings of European Chapter of the Association for Computational Linguistics (EACL), Vol. 6. 249--256.Google ScholarGoogle Scholar
  21. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6298--6306.Google ScholarGoogle ScholarCross RefCross Ref
  22. Tseng-Hung Chen, Yuan-Hong Liao, Ching-Yao Chuang, Wan-Ting Hsu, Jianlong Fu, and Min Sun. 2017. Show, adapt and tell: Adversarial training of cross-domain image captioner. In IEEE International Conference on Computer Vision (ICCV’17), Vol. 2.Google ScholarGoogle ScholarCross RefCross Ref
  23. Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2422--2431.Google ScholarGoogle Scholar
  24. Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. In Association for Computational Linguistics. 103--111.Google ScholarGoogle Scholar
  25. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.Google ScholarGoogle Scholar
  26. Bo Dai, Dahua Lin, Raquel Urtasun, and Sanja Fidler. 2017. Towards diverse and natural image descriptions via a conditional GAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2989--2998.Google ScholarGoogle ScholarCross RefCross Ref
  27. Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005 (CVPR’05), Vol. 1. IEEE, 886--893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Marie-Catherine De Marneffe, Bill MacCartney, and Christopher D. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, Vol. 6. 449--454.Google ScholarGoogle Scholar
  29. Etienne Denoual and Yves Lepage. 2005. BLEU in characters: Towards automatic MT evaluation in languages without word delimiters. In Companion Volume to the Proceedings of the 2nd International Joint Conference on Natural Language Processing. 81--86.Google ScholarGoogle Scholar
  30. Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. arXiv preprint arXiv:1505.01809.Google ScholarGoogle Scholar
  31. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  32. Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292--1302.Google ScholarGoogle Scholar
  33. Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, and John C. Platt. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google ScholarGoogle Scholar
  34. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Alireza Fathi, Yin Li, and James M. Rehg. 2012. Learning to recognize daily actions using gaze. In European Conference on Computer Vision. Springer, 314--327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. William Fedus, Ian Goodfellow, and Andrew M. Dai. 2018. Maskgan: Better text generation via filling in the _. arXiv preprint arXiv:1801.07736.Google ScholarGoogle Scholar
  37. Nicholas FitzGerald, Yoav Artzi, and Luke Zettlemoyer. 2013. Learning distributions over logical forms for referring expression generation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1914--1925.Google ScholarGoogle Scholar
  38. Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121--2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3137--3146.Google ScholarGoogle ScholarCross RefCross Ref
  40. Chuang Gan, Tianbao Yang, and Boqing Gong. 2016. Learning attributes equals multi-source domain generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 87--97.Google ScholarGoogle ScholarCross RefCross Ref
  41. Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1141--1150.Google ScholarGoogle ScholarCross RefCross Ref
  42. Jonas Gehring, Michael Auli, David Grangier, and Yann N. Dauphin. 2016. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344.Google ScholarGoogle Scholar
  43. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ross Girshick. 2015. Fast r-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Dave Golland, Percy Liang, and Dan Klein. 2010. A game-theoretic approach to generating spatial descriptions. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 410--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529--545.Google ScholarGoogle ScholarCross RefCross Ref
  48. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation. In Proceedings of Machine Learning Research. 1462--1471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Michael Grubinger, Paul Clough, Henning Müller, and Thomas Deselaers. 2006. The IAPR TC-12 benchmark: A new evaluation resource for visual information systems. In International Workshop Ontoimage, Vol. 5, 10.Google ScholarGoogle Scholar
  51. Jiuxiang Gu, Gang Wang, Jianfei Cai, and Tsuhan Chen. 2017. An empirical study of language CNN for image captioning. In Proceedings of the International Conference on Computer Vision (ICCV’17). 1231--1240.Google ScholarGoogle ScholarCross RefCross Ref
  52. Yahong Han and Guang Li. 2015. Describing images with hierarchical concepts and object class localization. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 251--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  54. Sepp Hochreiter and JÃijrgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853--899. Google ScholarGoogle ScholarCross RefCross Ref
  56. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (CVPR’17). 5967--5976.Google ScholarGoogle Scholar
  57. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with Gumbel-softmax. In International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  59. Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision. 2407--2415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. Salicon: Saliency in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1072--1080.Google ScholarGoogle ScholarCross RefCross Ref
  61. Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang. 2015. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272.Google ScholarGoogle Scholar
  62. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565--4574.Google ScholarGoogle ScholarCross RefCross Ref
  63. Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3668--3678.Google ScholarGoogle ScholarCross RefCross Ref
  64. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.Google ScholarGoogle Scholar
  65. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  66. Andrej Karpathy, Armand Joulin, and Fei Fei F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. 1889--1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, and B. S. Manjunath. 2013. From where and how to what we see. In Proceedings of the IEEE International Conference on Computer Vision. 625--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. 2014. ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP. 787--798.Google ScholarGoogle Scholar
  69. Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal neural language models. In Proceedings of the 31st International Conference on Machine Learning (ICML’14). 595--603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. In Workshop on Neural Information Processing Systems (NIPS’14).Google ScholarGoogle Scholar
  71. Vijay R. Konda and John N. Tsitsiklis. 2000. Actor-critic algorithms. In Advances in Neural Information Processing Systems. 1008--1014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR. Citeseer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Akshi Kumar and Shivali Goel. 2017. A survey of evolution of image captioning techniques. International Journal of Hybrid Intelligent Systems Preprint 14, 3 (2017), 123--139.Google ScholarGoogle ScholarCross RefCross Ref
  76. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Vol. 1. Association for Computational Linguistics, 359--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, and Yejin Choi. 2014. TREETALK: Composition and compression of trees for image descriptions.TACL 2, 10 (2014), 351--362.Google ScholarGoogle Scholar
  78. RÃl’mi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2015. Simple image description generator via a linear phrase-based approach. In Workshop on International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  79. Yann LeCun, LÃl’;on Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  80. Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8.Google ScholarGoogle Scholar
  82. Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 605. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft Coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755.Google ScholarGoogle Scholar
  84. Chenxi Liu, Junhua Mao, Fei Sha, and Alan L. Yuille. 2017. Attention correctness in neural image captioning. In AAAI. 4176--4182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17), Vol. 3. 873--881.Google ScholarGoogle ScholarCross RefCross Ref
  86. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.Google ScholarGoogle ScholarCross RefCross Ref
  87. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3242--3250.Google ScholarGoogle ScholarCross RefCross Ref
  89. Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  90. Shubo Ma and Yahong Han. 2016. Describing images by feeding LSTM with structural words. In 2016 IEEE International Conference on Multimedia and Expo (ICME’16). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  91. Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  92. Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11--20.Google ScholarGoogle ScholarCross RefCross Ref
  93. Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the IEEE International Conference on Computer Vision. 2533--2541. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  95. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090.Google ScholarGoogle Scholar
  96. Oded Maron and Tomás Lozano-Pérez. 1998. A framework for multiple-instance learning. In Advances in Neural Information Processing Systems. 570--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Alexander Patrick Mathews, Lexing Xie, and Xuming He. 2016. SentiCap: Generating image descriptions with sentiments. In AAAI. 3574--3580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google ScholarGoogle Scholar
  99. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In 11th Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  100. Ajay K. Mishra, Yiannis Aloimonos, Loong Fah Cheong, and Ashraf Kassim. 2012. Active visual segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 4 (2012), 639--653. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 747--756. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Margaret Mitchell, Kees van Deemter, and Ehud Reiter. 2010. Natural reference to objects in a visual domain. In Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics, 95--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Margaret Mitchell, Kees Van Deemter, and Ehud Reiter. 2013. Generating expressions that refer to visible objects. In HLT-NAACL. 1174--1184.Google ScholarGoogle Scholar
  104. Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine Learning. ACM, 641--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Andriy Mnih and Geoffrey Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine Learning. ACM, 641--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Vol. 1. Association for Computational Linguistics, 160--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Timo Ojala, Matti Pietikäinen, and Topi Mäenpä. 2000. Gray scale and rotation invariant texture classification with local binary patterns. In European Conference on Computer Vision. Springer, 404--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143--1151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Dim P. Papadopoulos, Alasdair D. F. Clarke, Frank Keller, and Vittorio Ferrari. 2014. Training object class detectors from eye tracking data. In European Conference on Computer Vision. Springer, 361--376.Google ScholarGoogle ScholarCross RefCross Ref
  110. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim. 2017. Attend to you: Personalized image captioning with context sequence memory networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6432--6440.Google ScholarGoogle ScholarCross RefCross Ref
  112. Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 1251--1259.Google ScholarGoogle ScholarCross RefCross Ref
  113. Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641--2649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  115. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In Proceedings of Machine Learning Research, Vol. 48. 1060--1069. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. 2015. Multi-instance visual-semantic embedding. arXiv preprint arXiv:1512.06963.Google ScholarGoogle Scholar
  118. Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and Alan Yuille. 2016. Joint image-text representation by gaussian visual-semantic embedding. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 207--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1151--1159.Google ScholarGoogle ScholarCross RefCross Ref
  120. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1179--1195.Google ScholarGoogle ScholarCross RefCross Ref
  121. Stephen Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation 60, 5 (2004), 503--520.Google ScholarGoogle Scholar
  122. Hosnieh Sattar, Sabine Muller, Mario Fritz, and Andreas Bulling. 2015. Prediction of search targets from fixations in open-world settings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 981--990.Google ScholarGoogle ScholarCross RefCross Ref
  123. Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D. Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language, Vol. 2.Google ScholarGoogle Scholar
  124. Karthikeyan Shanmuga Vadivel, Thuyen Ngo, Miguel Eckstein, and B. S. Manjunath. 2015. Eye tracking assisted extraction of attentionally important objects from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3241--3250.Google ScholarGoogle Scholar
  125. Naeha Sharif, Lyndon White, Mohammed Bennamoun, and Syed Afaq Ali Shah. 2018. Learning-based composite metrics for improved caption evaluation. In Proceedings of ACL 2018, Student Research Workshop. 14--20.Google ScholarGoogle ScholarCross RefCross Ref
  126. Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In IEEE International Conference on Computer Vision (ICCV’17). 4155--4164.Google ScholarGoogle ScholarCross RefCross Ref
  127. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  128. Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207--218.Google ScholarGoogle ScholarCross RefCross Ref
  129. Yusuke Sugano and Andreas Bulling. 2016. Seeing with humans: Gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203.Google ScholarGoogle Scholar
  130. Chen Sun, Chuang Gan, and Ram Nevatia. 2015. Automatic concept discovery from parallel text and visual corpora. In Proceedings of the IEEE International Conference on Computer Vision. 2596--2604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems. 1057--1063. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  134. Hamed R. Tavakoli, Rakshith Shetty, Ali Borji, and Jorma Laaksonen. 2017. Paying attention to descriptions generated by image captioning models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2487--2496.Google ScholarGoogle ScholarCross RefCross Ref
  135. Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, and Chris Sienkiewicz. 2016. Rich image captioning in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 49--56.Google ScholarGoogle ScholarCross RefCross Ref
  136. Kees van Deemter, Ielka van der Sluis, and Albert Gatt. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of the 4th International Natural Language Generation Conference. Association for Computational Linguistics, 130--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional image generation with pixelCNN decoders. In Advances in Neural Information Processing Systems. 4790--4798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarGoogle ScholarCross RefCross Ref
  140. Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2017. Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1170--1178.Google ScholarGoogle ScholarCross RefCross Ref
  141. Jette Viethen and Robert Dale. 2008. The use of spatial relations in referring expression generation. In Proceedings of the 5th International Natural Language Generation Conference. Association for Computational Linguistics, 59--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  143. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 652--663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 988--997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. Heng Wang, Zengchang Qin, and Tao Wan. 2018. Text generation based on generative adversarial nets with latent variables. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 92--103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. Minsi Wang, Li Song, Xiaokang Yang, and Chuanfei Luo. 2016. A parallel-fusion RNN-LSTM architecture for image caption generation. In 2016 IEEE International Conference on Image Processing (ICIP’16). IEEE, 4448--4452.Google ScholarGoogle ScholarCross RefCross Ref
  147. Qingzhong Wang and Antoni B. Chan. 2018. CNN+ CNN: Convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019.Google ScholarGoogle Scholar
  148. Yufei Wang, Zhe Lin, Xiaohui Shen, Scott Cohen, and Garrison W. Cottrell. 2017. Skeleton key: Image captioning by skeleton-attribute decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 7378--7387.Google ScholarGoogle Scholar
  149. Qi Wu, Chunhua Shen, Anton van den Hengel, Lingqiao Liu, and Anthony Dick. 2015. Image captioning with an intermediate attributes layer. arXiv preprint arXiv:1506.01144.Google ScholarGoogle Scholar
  150. Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van den Hengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2018), 1367--1381.Google ScholarGoogle ScholarCross RefCross Ref
  151. Zhilin Yang, Ye Yuan, Yuexin Wu, Ruslan Salakhutdinov, and William W. Cohen. 2016. Encode, review, and decode: Reviewer module for caption generation. In 30th Conference on Neural Image Processing System (NIPS’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  152. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  153. Linjie Yang, Kevin Tang, Jianchao Yang, and Li-Jia Li. 2016. Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1978--1987.Google ScholarGoogle Scholar
  154. Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 5263--5271.Google ScholarGoogle ScholarCross RefCross Ref
  155. Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision (ICCV’17). 4904--4912.Google ScholarGoogle ScholarCross RefCross Ref
  156. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.Google ScholarGoogle ScholarCross RefCross Ref
  157. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu Seqgan. 2017. Sequence generative adversarial nets with policy gradient. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, and Tamara L. Berg. 2013. Exploring the role of gaze behavior and object detection in scene understanding. Frontiers in Psychology Article 917 (2013), 1--14.Google ScholarGoogle Scholar
  159. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 818--833.Google ScholarGoogle Scholar
  160. Gregory J. Zelinsky. 2013. Understanding scene understanding. Frontiers in Psychology Article 954 (2013), 1--3.Google ScholarGoogle Scholar
  161. Li Zhang, Flood Sung, Feng Liu, Tao Xiang, Shaogang Gong, Yongxin Yang, and Timothy M. Hospedales. 2017. Actor-critic sequence training for image captioning. arXiv preprint arXiv:1706.09601.Google ScholarGoogle Scholar

Index Terms

  1. A Comprehensive Survey of Deep Learning for Image Captioning

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Computing Surveys
            ACM Computing Surveys  Volume 51, Issue 6
            November 2019
            786 pages
            ISSN:0360-0300
            EISSN:1557-7341
            DOI:10.1145/3303862
            • Editor:
            • Sartaj Sahni
            Issue’s Table of Contents

            Copyright © 2019 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 4 February 2019
            • Revised: 1 October 2018
            • Accepted: 1 October 2018
            • Received: 1 April 2018
            Published in csur Volume 51, Issue 6

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • survey
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format