Skip to main content
Top
Published in: International Journal of Multimedia Information Retrieval 2/2022

12-04-2022 | Regular Paper

A local representation-enhanced recurrent convolutional network for image captioning

Authors: Xiaoyi Wang, Jun Huang

Published in: International Journal of Multimedia Information Retrieval | Issue 2/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Image captioning is a challenging task that aims to generate a natural description for an image. The word prediction is dependent on local linguistic contexts and fine-grained visual information and is also guided by previous linguistic tokens. However, current captioning works do not fully utilize local visual and linguistic information, generating coarse or incorrect descriptions. Also, captioning decoders have less recently focused on convolutional neural network (CNN), which has the advantage in extracting features. To solve these problems, we propose a local representation-enhanced recurrent convolutional network (Lore-RCN). Specifically, we propose a visual convolutional network to obtain enhanced local linguistic context, which incorporates selected local visual information and models short-term neighboring. Furthermore, we propose a linguistic convolutional network to obtain enhanced linguistic representation, which models long- and short-term correlations explicitly to leverage guiding information from previous linguistic tokens. Experiments conducted on COCO and Flickr30k datasets have verified the superiority of our proposed recurrent CNN-based model.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Anderson P, Fernando B, Johnson M, et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, 382–398 Anderson P, Fernando B, Johnson M, et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, 382–398
2.
go back to reference Anderson P, He X, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086 Anderson P, He X, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086
3.
go back to reference Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 5561–5570 Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 5561–5570
4.
go back to reference Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72 Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72
5.
go back to reference Chen L, Zhang H, Xiao J, et al (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 5659–5667 Chen L, Zhang H, Xiao J, et al (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 5659–5667
6.
go back to reference Chen R, Li Z, Zhang D (2019) Adaptive joint attention with reinforcement training for convolutional image caption. In: International workshop on human brain and artificial intelligence. Springer, 235–247 Chen R, Li Z, Zhang D (2019) Adaptive joint attention with reinforcement training for convolutional image caption. In: International workshop on human brain and artificial intelligence. Springer, 235–247
7.
go back to reference Cornia M, Stefanini M, Baraldi L, et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10,578–10,587 Cornia M, Stefanini M, Baraldi L, et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10,578–10,587
8.
go back to reference Dorfer M, Schlüter J, Vall A et al (2018) End-to-end cross-modality retrieval with cca projections and pairwise ranking loss. Int J Multimed Inf Retrieval 7(2):117–128CrossRef Dorfer M, Schlüter J, Vall A et al (2018) End-to-end cross-modality retrieval with cca projections and pairwise ranking loss. Int J Multimed Inf Retrieval 7(2):117–128CrossRef
9.
go back to reference Gao L, Li X, Song J et al (2019) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42:1112–1131 Gao L, Li X, Song J et al (2019) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42:1112–1131
10.
go back to reference Gu J, Wang G, Cai J, et al (2017) An empirical study of language cnn for image captioning. In: Proceedings of the IEEE international conference on computer vision, 1222–1231 Gu J, Wang G, Cai J, et al (2017) An empirical study of language cnn for image captioning. In: Proceedings of the IEEE international conference on computer vision, 1222–1231
11.
go back to reference Huang L, Wang W, Chen J, et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, 4634–4643 Huang L, Wang W, Chen J, et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, 4634–4643
13.
go back to reference Ji J, Xu C, Zhang X et al (2020) Spatio-temporal memory attention for image captioning. IEEE Trans Image Process 29:7615–7628CrossRef Ji J, Xu C, Zhang X et al (2020) Spatio-temporal memory attention for image captioning. IEEE Trans Image Process 29:7615–7628CrossRef
14.
go back to reference Jiang W, Ma L, Jiang YG, et al (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), 499–515 Jiang W, Ma L, Jiang YG, et al (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), 499–515
15.
go back to reference Jin T, Li Y, Zhang Z (2019) Recurrent convolutional video captioning with global and local attention. Neurocomputing 370:118–127CrossRef Jin T, Li Y, Zhang Z (2019) Recurrent convolutional video captioning with global and local attention. Neurocomputing 370:118–127CrossRef
16.
go back to reference Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3128–3137 Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3128–3137
17.
go back to reference Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105 Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
20.
go back to reference Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, 74–81 Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, 74–81
21.
go back to reference Lin TY, Maire M, Belongie S, et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, 740–755 Lin TY, Maire M, Belongie S, et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, 740–755
22.
go back to reference Lu J, Xiong C, Parikh D, et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 375–383 Lu J, Xiong C, Parikh D, et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 375–383
23.
go back to reference Pan Y, Yao T, Li Y, et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition Pan Y, Yao T, Li Y, et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
24.
go back to reference Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318
25.
go back to reference Qin Y, Du J, Zhang Y, et al (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8367–8375 Qin Y, Du J, Zhang Y, et al (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8367–8375
26.
go back to reference Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99 Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
27.
go back to reference Sharma P, Ding N, Goodman S, et al (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2556–2565 Sharma P, Ding N, Goodman S, et al (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2556–2565
28.
go back to reference Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575 Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575
29.
go back to reference Vinyals O, Toshev A, Bengio S, et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3156–3164 Vinyals O, Toshev A, Bengio S, et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3156–3164
31.
go back to reference Wang L, Bai Z, Zhang Y, et al (2020) Show, recall, and tell: Image captioning with recall mechanism. In: Proceedings of the AAAI conference on artificial intelligence, 12,176–12,183 Wang L, Bai Z, Zhang Y, et al (2020) Show, recall, and tell: Image captioning with recall mechanism. In: Proceedings of the AAAI conference on artificial intelligence, 12,176–12,183
33.
go back to reference Wang W, Chen Z, Hu H (2019) Hierarchical attention network for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, 8957–8964 Wang W, Chen Z, Hu H (2019) Hierarchical attention network for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, 8957–8964
34.
go back to reference Wang Y, Sun X, Li X et al (2021) Reasoning like humans: on dynamic attention prior in image captioning. Knowl-Based Syst 228(107):313 Wang Y, Sun X, Li X et al (2021) Reasoning like humans: on dynamic attention prior in image captioning. Knowl-Based Syst 228(107):313
36.
go back to reference Wu A, Han Y, Yang Y et al (2019) Convolutional reconstruction-to-sequence for video captioning. IEEE Trans Circuits Syst Video Technol 30(11):4299–4308CrossRef Wu A, Han Y, Yang Y et al (2019) Convolutional reconstruction-to-sequence for video captioning. IEEE Trans Circuits Syst Video Technol 30(11):4299–4308CrossRef
37.
go back to reference Wu J, Hu H, Wu Y (2018) Image captioning via semantic guidance attention and consensus selection strategy. ACM Trans Multimed Comput Commun Appl (TOMM) 14(4):1–19 Wu J, Hu H, Wu Y (2018) Image captioning via semantic guidance attention and consensus selection strategy. ACM Trans Multimed Comput Commun Appl (TOMM) 14(4):1–19
38.
go back to reference Xiao H, Xu J, Shi J (2020) Exploring diverse and fine-grained caption for video by incorporating convolutional architecture into lstm-based model. Pattern Recognit Lett 129:173–180CrossRef Xiao H, Xu J, Shi J (2020) Exploring diverse and fine-grained caption for video by incorporating convolutional architecture into lstm-based model. Pattern Recognit Lett 129:173–180CrossRef
39.
go back to reference Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, 2048–2057 Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, 2048–2057
40.
go back to reference Yang L, Tang K, Yang J, et al (2017) Dense captioning with joint inference and visual context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2193–2202 Yang L, Tang K, Yang J, et al (2017) Dense captioning with joint inference and visual context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2193–2202
41.
go back to reference Yang L, Wang H, Tang P et al (2020) Captionnet: a tailor-made recurrent neural network for generating image descriptions. IEEE Trans Multimed 23:835–845CrossRef Yang L, Wang H, Tang P et al (2020) Captionnet: a tailor-made recurrent neural network for generating image descriptions. IEEE Trans Multimed 23:835–845CrossRef
42.
go back to reference Yao T, Pan Y, Li Y, et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), 684–699 Yao T, Pan Y, Li Y, et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), 684–699
43.
go back to reference You Q, Jin H, Wang Z, et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4651–4659 You Q, Jin H, Wang Z, et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4651–4659
44.
go back to reference Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef
Metadata
Title
A local representation-enhanced recurrent convolutional network for image captioning
Authors
Xiaoyi Wang
Jun Huang
Publication date
12-04-2022
Publisher
Springer London
Published in
International Journal of Multimedia Information Retrieval / Issue 2/2022
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-022-00231-y

Other articles of this Issue 2/2022

International Journal of Multimedia Information Retrieval 2/2022 Go to the issue

Premium Partner