Skip to main content
Top
Published in: Artificial Intelligence Review 5/2022

29-11-2021

Neural attention for image captioning: review of outstanding methods

Authors: Zanyar Zohourianshahzadi, Jugal K. Kalita

Published in: Artificial Intelligence Review | Issue 5/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Image captioning is the task of automatically generating sentences that describe an input image in the best way possible. The most successful techniques for automatically generating image captions have recently used attentive deep learning models. There are variations in the way deep learning models with attention are designed. In this survey, we provide a review of literature related to attentive deep learning models for image captioning. Instead of offering a comprehensive review of all prior work on deep image captioning models, we explain various types of attention mechanisms used for the task of image captioning in deep learning models. The most successful deep learning models used for image captioning follow the encoder-decoder architecture, although there are differences in the way these models employ attention mechanisms. Via analysis on performance results from different attentive deep models for image captioning, we aim at finding the most successful types of attention mechanisms in deep models for image captioning. Soft attention, bottom-up attention, and multi-head attention are the types of attention mechanism widely used in state-of-the-art attentive deep learning models for image captioning. At the current time, the best results are achieved from variants of multi-head attention with bottom-up attention.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference (2002) Proceedings of the 40th annual meeting of the association for computational linguistics, July 6-12, 2002, Philadelphia, PA, USA. ACL (2002) Proceedings of the 40th annual meeting of the association for computational linguistics, July 6-12, 2002, Philadelphia, PA, USA. ACL
go back to reference Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: ECCV Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: ECCV
go back to reference Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR
go back to reference Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd international conference on learning representations, ICLR 2015 Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd international conference on learning representations, ICLR 2015
go back to reference Bengio Y, LeCun Y, Henderson D (1994) Globally trained handwritten word recognizer using spatial representation, convolutional neural networks, and hidden Markov models. In: Advances in neural information processing systems 6, pp 937–944. Morgan-Kaufmann Bengio Y, LeCun Y, Henderson D (1994) Globally trained handwritten word recognizer using spatial representation, convolutional neural networks, and hidden Markov models. In: Advances in neural information processing systems 6, pp 937–944. Morgan-Kaufmann
go back to reference Bruna J, Zaremba W, Szlam A, LeCun Y (2014) Spectral networks and locally connected networks on graphs. In: Bengio Y, LeCun Y (eds) Proceedings of the 2nd international conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings. arXiv:1312.6203 Bruna J, Zaremba W, Szlam A, LeCun Y (2014) Spectral networks and locally connected networks on graphs. In: Bengio Y, LeCun Y (eds) Proceedings of the 2nd international conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings. arXiv:​1312.​6203
go back to reference Chen S, Jin Q, Wang P, Wu Q (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR) Chen S, Jin Q, Wang P, Wu Q (2020) Say as you wish: fine-grained control of image caption generation with abstract scene graphs. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR)
go back to reference Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Chen S, Zhao Q (2018) Boosted attention: leveraging human attention for image captioning. In: The European conference on computer vision (ECCV) Chen S, Zhao Q (2018) Boosted attention: leveraging human attention for image captioning. In: The European conference on computer vision (ECCV)
go back to reference Cho K, Courville AC, Bengio Y (2015) Describing multimedia content using attention-based encoder–decoder networks. IEEE Trans Multimedia 17(11):1875–1886CrossRef Cho K, Courville AC, Bengio Y (2015) Describing multimedia content using attention-based encoder–decoder networks. IEEE Trans Multimedia 17(11):1875–1886CrossRef
go back to reference Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1724–1734. ACL, Doha, Qatar. https://doi.org/10.3115/v1/D14-1179 Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1724–1734. ACL, Doha, Qatar. https://​doi.​org/​10.​3115/​v1/​D14-1179
go back to reference Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: a framework for generating controllable and grounded captions. In: The IEEE conference on computer vision and pattern recognition (CVPR) Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: a framework for generating controllable and grounded captions. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR) Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR)
go back to reference Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision (ICCV) Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision (ICCV)
go back to reference Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Adv Neural Inform Process Syst 29:3844–3852 Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Adv Neural Inform Process Syst 29:3844–3852
go back to reference Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR09 Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR09
go back to reference Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: The IEEE conference on computer vision and pattern recognition (CVPR) Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt JC, Lawrence Zitnick C, Zweig G (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt JC, Lawrence Zitnick C, Zweig G (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer vision: ECCV 2010, pp 15–29. Springer, Berlin, Heidelberg Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer vision: ECCV 2010, pp 15–29. Springer, Berlin, Heidelberg
go back to reference Feng Y, Ma L, Liu W, Luo J (2019) Unsupervised image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) Feng Y, Ma L, Liu W, Luo J (2019) Unsupervised image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
go back to reference Girshick R (2015) Fast r-cnn. In: The Ieee international conference on computer vision (ICCV) Girshick R (2015) Fast r-cnn. In: The Ieee international conference on computer vision (ICCV)
go back to reference Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, pp. 580–587. https://doi.org/10.1109/CVPR.2014.81 Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE conference on computer vision and pattern recognition, pp. 580–587. https://​doi.​org/​10.​1109/​CVPR.​2014.​81
go back to reference Gong Y, Wang L, Hodosh M, Hockenmaier JC, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer vision, ECCV 2014: 13th European conference, proceedings, no. PART 4 in lecture notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 529–545. Springer, New York. https://doi.org/10.1007/978-3-319-10593-2_35 Gong Y, Wang L, Hodosh M, Hockenmaier JC, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer vision, ECCV 2014: 13th European conference, proceedings, no. PART 4 in lecture notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 529–545. Springer, New York. https://​doi.​org/​10.​1007/​978-3-319-10593-2_​35
go back to reference Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR) Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR)
go back to reference He K, Gkioxari G, Dollar P, Girshick R (2017) Mask r-cnn. In: The IEEE international conference on computer vision (ICCV) He K, Gkioxari G, Dollar P, Girshick R (2017) Mask r-cnn. In: The IEEE international conference on computer vision (ICCV)
go back to reference Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. Adv Neural Inform Process Syst 32:11135–11145 Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. Adv Neural Inform Process Syst 32:11135–11145
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR) He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Hodosh M, Young P, Hockenmaier JC (2015) Framing image description as a ranking task: Data, models and evaluation metrics. In: IJCAI 2015: proceedings of the 24th international joint conference on artificial intelligence, IJCAI international joint conference on artificial intelligence, pp 4188–4192. International Joint Conferences on Artificial Intelligence Hodosh M, Young P, Hockenmaier JC (2015) Framing image description as a ranking task: Data, models and evaluation metrics. In: IJCAI 2015: proceedings of the 24th international joint conference on artificial intelligence, IJCAI international joint conference on artificial intelligence, pp 4188–4192. International Joint Conferences on Artificial Intelligence
go back to reference Huang L, Wang W, Xia Y, Chen J (2019) Adaptively aligned image captioning via adaptive attention time. Adv Neural Inform Process Syst 32:8940–8949 Huang L, Wang W, Xia Y, Chen J (2019) Adaptively aligned image captioning via adaptive attention time. Adv Neural Inform Process Syst 32:8940–8949
go back to reference Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: The IEEE international conference on computer vision (ICCV) Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: The IEEE international conference on computer vision (ICCV)
go back to reference Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on international conference on machine learning, vol 37, ICML’15, pp 448–456. JMLR.org Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on international conference on machine learning, vol 37, ICML’15, pp 448–456. JMLR.org
go back to reference Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu k (2015) Spatial transformer networks. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28 Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu k (2015) Spatial transformer networks. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R (eds) Advances in neural information processing systems, vol 28
go back to reference Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: The IEEE international conference on computer vision (ICCV) Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: The IEEE international conference on computer vision (ICCV)
go back to reference Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Computer vision: ECCV 2018, pp 510–526 Jiang W, Ma L, Jiang YG, Liu W, Zhang T (2018) Recurrent fusion network for image captioning. In: Computer vision: ECCV 2018, pp 510–526
go back to reference Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: The IEEE conference on computer vision and pattern recognition (CVPR), pp 3128–3137. IEEE Computer Society Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: The IEEE conference on computer vision and pattern recognition (CVPR), pp 3128–3137. IEEE Computer Society
go back to reference Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems vol 27, pp 1889–1897. Curran Associates, Inc Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems vol 27, pp 1889–1897. Curran Associates, Inc
go back to reference Ke L, Pei W, Li R, Shen X, Tai YW (2019) Reflective decoding network for image captioning. In: The IEEE international conference on computer vision (ICCV) Ke L, Pei W, Li R, Shen X, Tai YW (2019) Reflective decoding network for image captioning. In: The IEEE international conference on computer vision (ICCV)
go back to reference Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. Int Conf Learn Repres (ICLR) Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. Int Conf Learn Repres (ICLR)
go back to reference Kiros R, Salakhutdinov RR, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. CoRR arXiv:1411.2539 Kiros R, Salakhutdinov RR, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. CoRR arXiv:​1411.​2539
go back to reference Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of the 31st international conference on machine learning, Proceedings of machine learning research, vol 32, pp 595–603. PMLR Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of the 31st international conference on machine learning, Proceedings of machine learning research, vol 32, pp 595–603. PMLR
go back to reference Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Adv Neural Inform Process Syst vol 25, pp 1097–1105. Curran Associates, Inc Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Adv Neural Inform Process Syst vol 25, pp 1097–1105. Curran Associates, Inc
go back to reference Laina I, Rupprecht C, Navab N (2019) Towards unsupervised image captioning with shared multimodal embeddings. In: The IEEE international conference on computer vision (ICCV) Laina I, Rupprecht C, Navab N (2019) Towards unsupervised image captioning with shared multimodal embeddings. In: The IEEE international conference on computer vision (ICCV)
go back to reference LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef
go back to reference Li S, Tao Z, Li K, Fu Y (2019) Visual to text: survey of image and video captioning. IEEE Trans Emerg Top Comput Intell 3(4):297–312CrossRef Li S, Tao Z, Li K, Fu Y (2019) Visual to text: survey of image and video captioning. IEEE Trans Emerg Top Comput Intell 3(4):297–312CrossRef
go back to reference Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning, CoNLL ’11, pp 220–228. Association for Computational Linguistics, Stroudsburg, PA, USA Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning, CoNLL ’11, pp 220–228. Association for Computational Linguistics, Stroudsburg, PA, USA
go back to reference Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer vision: ECCV 2014, pp 740–755. Springer, Cham Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer vision: ECCV 2014, pp 740–755. Springer, Cham
go back to reference Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: The IEEE international conference on computer vision (ICCV) Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: The IEEE international conference on computer vision (ICCV)
go back to reference Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR) Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition (CVPR) pp 3242–3250 Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition (CVPR) pp 3242–3250
go back to reference Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: CVPR Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: CVPR
go back to reference Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, vol 2, NIPS’13, pp 3111–3119. Curran Associates Inc., USA Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, vol 2, NIPS’13, pp 3111–3119. Curran Associates Inc., USA
go back to reference Mun J, Cho M, Han B (2017) Text-guided attention model for image captioning. In: AAAI conference on artificial intelligence Mun J, Cho M, Han B (2017) Text-guided attention model for image captioning. In: AAAI conference on artificial intelligence
go back to reference Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems, vol 24, pp 1143–1151. Curran Associates, Inc Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems, vol 24, pp 1143–1151. Curran Associates, Inc
go back to reference Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR) Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR)
go back to reference Pavlopoulos J, Kougia V, Androutsopoulos I (2019) A survey on biomedical image captioning. In: Proceedings of the second workshop on shortcomings in vision and language, pp 26–36. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/W19-1803 Pavlopoulos J, Kougia V, Androutsopoulos I (2019) A survey on biomedical image captioning. In: Proceedings of the second workshop on shortcomings in vision and language, pp 26–36. Association for Computational Linguistics, Minneapolis, Minnesota. https://​doi.​org/​10.​18653/​v1/​W19-1803
go back to reference Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of the IEEE international conference on computer vision (ICCV) Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of the IEEE international conference on computer vision (ICCV)
go back to reference Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543
go back to reference Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: The IEEE conference on computer vision and pattern recognition (CVPR) Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28, pp 91–99. Curran Associates, Inc Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, vol 28, pp 91–99. Curran Associates, Inc
go back to reference Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: The IEEE conference on computer vision and pattern recognition (CVPR) Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR) Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR)
go back to reference Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. In: Proceedings of the 2020 international conference on power electronics IoT applications in renewable energy and its control (PARC), pp 325–328 Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. In: Proceedings of the 2020 international conference on power electronics IoT applications in renewable energy and its control (PARC), pp 325–328
go back to reference Shuster K, Humeau S, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: The IEEE conference on computer vision and pattern recognition (CVPR) Shuster K, Humeau S, Hu H, Bordes A, Weston J (2019) Engaging image captioning via personality. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Sun C, Gan C, Nevatia R (2015) Automatic concept discovery from parallel text and visual corpora. In: The IEEE international conference on computer vision (ICCV) Sun C, Gan C, Nevatia R (2015) Automatic concept discovery from parallel text and visual corpora. In: The IEEE international conference on computer vision (ICCV)
go back to reference Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27, pp 3104–3112. Curran Associates, Inc Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, vol 27, pp 3104–3112. Curran Associates, Inc
go back to reference Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR) Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Tieleman T, Hinton G (2012) Lecture 6.5—RmsProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning Tieleman T, Hinton G (2012) Lecture 6.5—RmsProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30, pp 5998–6008. Curran Associates, Inc Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30, pp 5998–6008. Curran Associates, Inc
go back to reference Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: The IEEE conference on computer vision and pattern recognition (CVPR) Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. In: Advances in neural information processing systems, vol 28, pp 2692–2700. Curran Associates, Inc Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. In: Advances in neural information processing systems, vol 28, pp 2692–2700. Curran Associates, Inc
go back to reference Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: The IEEE conference on computer vision and pattern recognition (CVPR) Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Xie S, Girshick R, Dollar P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR) Xie S, Girshick R, Dollar P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Xu Huijuanand Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: Computer vision: ECCV 2016, pp 451–466. Springer, Cham Xu Huijuanand Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: Computer vision: ECCV 2016, pp 451–466. Springer, Cham
go back to reference Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, proceedings of machine learning research, vol. 37, pp. 2048–2057. PMLR Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, proceedings of machine learning research, vol. 37, pp. 2048–2057. PMLR
go back to reference Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: The IEEE conference on computer vision and pattern recognition (CVPR) Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Yang X, Zhang H, Cai J (2019) Learning to collocate neural modules for image captioning. In: The IEEE international conference on computer vision (ICCV) Yang X, Zhang H, Cai J (2019) Learning to collocate neural modules for image captioning. In: The IEEE international conference on computer vision (ICCV)
go back to reference Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Computer vision: ECCV 2018, pp 711–727 Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Computer vision: ECCV 2018, pp 711–727
go back to reference Yao T, Pan Y, Li Y, Mei T (2019) Hierarchy parsing for image captioning. In: The IEEE international conference on computer vision (ICCV) Yao T, Pan Y, Li Y, Mei T (2019) Hierarchy parsing for image captioning. In: The IEEE international conference on computer vision (ICCV)
go back to reference You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: The IEEE conference on computer vision and pattern recognition (CVPR) You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: The IEEE conference on computer vision and pattern recognition (CVPR)
go back to reference Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2:67–78CrossRef Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2:67–78CrossRef
go back to reference Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR) Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR)
go back to reference Zhou L, Xu C, Koch P, Corso JJ (2017) Watch what you just said: image captioning with text-conditional attention. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp 305–313. Association for Computing Machinery, New York, NY. https://doi.org/10.1145/3126686.3126717 Zhou L, Xu C, Koch P, Corso JJ (2017) Watch what you just said: image captioning with text-conditional attention. In: Proceedings of the on thematic workshops of ACM multimedia 2017, pp 305–313. Association for Computing Machinery, New York, NY. https://​doi.​org/​10.​1145/​3126686.​3126717
go back to reference Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: Computer vision: ECCV 2014, pp 391–405. Springer, New York Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: Computer vision: ECCV 2014, pp 391–405. Springer, New York
Metadata
Title
Neural attention for image captioning: review of outstanding methods
Authors
Zanyar Zohourianshahzadi
Jugal K. Kalita
Publication date
29-11-2021
Publisher
Springer Netherlands
Published in
Artificial Intelligence Review / Issue 5/2022
Print ISSN: 0269-2821
Electronic ISSN: 1573-7462
DOI
https://doi.org/10.1007/s10462-021-10092-2

Other articles of this Issue 5/2022

Artificial Intelligence Review 5/2022 Go to the issue

Premium Partner