Top

International Journal of Multimedia Information Retrieval

Published in:

12-04-2022 | Regular Paper

A local representation-enhanced recurrent convolutional network for image captioning

Authors: Xiaoyi Wang, Jun Huang

Published in: International Journal of Multimedia Information Retrieval | Issue 2/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Image captioning is a challenging task that aims to generate a natural description for an image. The word prediction is dependent on local linguistic contexts and fine-grained visual information and is also guided by previous linguistic tokens. However, current captioning works do not fully utilize local visual and linguistic information, generating coarse or incorrect descriptions. Also, captioning decoders have less recently focused on convolutional neural network (CNN), which has the advantage in extracting features. To solve these problems, we propose a local representation-enhanced recurrent convolutional network (Lore-RCN). Specifically, we propose a visual convolutional network to obtain enhanced local linguistic context, which incorporates selected local visual information and models short-term neighboring. Furthermore, we propose a linguistic convolutional network to obtain enhanced linguistic representation, which models long- and short-term correlations explicitly to leverage guiding information from previous linguistic tokens. Experiments conducted on COCO and Flickr30k datasets have verified the superiority of our proposed recurrent CNN-based model.

previous article Multi-sensor human activity recognition using CNN and GRU

next article Siamese coding network and pair similarity prediction for near-duplicate image detection

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Anderson P, Fernando B, Johnson M, et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, 382–398

Anderson P, He X, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086

Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 5561–5570

Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65–72

Chen L, Zhang H, Xiao J, et al (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 5659–5667

Chen R, Li Z, Zhang D (2019) Adaptive joint attention with reinforcement training for convolutional image caption. In: International workshop on human brain and artificial intelligence. Springer, 235–247

Cornia M, Stefanini M, Baraldi L, et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10,578–10,587

Dorfer M, Schlüter J, Vall A et al (2018) End-to-end cross-modality retrieval with cca projections and pairwise ranking loss. Int J Multimed Inf Retrieval 7(2):117–128CrossRef

Gao L, Li X, Song J et al (2019) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42:1112–1131

10.

Gu J, Wang G, Cai J, et al (2017) An empirical study of language cnn for image captioning. In: Proceedings of the IEEE international conference on computer vision, 1222–1231

11.

Huang L, Wang W, Chen J, et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, 4634–4643

12.

Huang L, Wang W, Xia Y, et al (2019) Adaptively aligned image captioning via adaptive attention time. arXiv:1909.09060

13.

Ji J, Xu C, Zhang X et al (2020) Spatio-temporal memory attention for image captioning. IEEE Trans Image Process 29:7615–7628CrossRef

14.

Jiang W, Ma L, Jiang YG, et al (2018) Recurrent fusion network for image captioning. In: Proceedings of the European conference on computer vision (ECCV), 499–515

15.

Jin T, Li Y, Zhang Z (2019) Recurrent convolutional video captioning with global and local attention. Neurocomputing 370:118–127CrossRef

16.

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3128–3137

17.

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105

18.

Lan H, Zhang P (2022) Learning and Integrating Multi-Level Matching Features for Image-Text Retrieval. IEEE Signal Process Lett 29:374–378. https://doi.org/10.1109/LSP.2021.3135825

19.

Li G, Zhu L, Liu P, et al (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), 8927–8936. https://doi.org/10.1109/ICCV.2019.00902

20.

Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, 74–81

21.

Lin TY, Maire M, Belongie S, et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, 740–755

22.

Lu J, Xiong C, Parikh D, et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 375–383

23.

Pan Y, Yao T, Li Y, et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

24.

Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318

25.

Qin Y, Du J, Zhang Y, et al (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8367–8375

26.

Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

27.

Sharma P, Ding N, Goodman S, et al (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2556–2565

28.

Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575

29.

Vinyals O, Toshev A, Bengio S, et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 3156–3164

30.

Wang C, Gu X (2021) Image captioning with adaptive incremental global context attention. Appl Intell. https://doi.org/10.1007/s10489-021-02734-3

31.

Wang L, Bai Z, Zhang Y, et al (2020) Show, recall, and tell: Image captioning with recall mechanism. In: Proceedings of the AAAI conference on artificial intelligence, 12,176–12,183

32.

Wang Q, Chan AB (2018) Cnn+ cnn: convolutional decoders for image captioning. arXiv:1805.09019

33.

Wang W, Chen Z, Hu H (2019) Hierarchical attention network for image captioning. In: Proceedings of the AAAI conference on artificial intelligence, 8957–8964

34.

Wang Y, Sun X, Li X et al (2021) Reasoning like humans: on dynamic attention prior in image captioning. Knowl-Based Syst 228(107):313

35.

Wei H, Li Z, Huang F et al (2021) Integrating scene semantic knowledge into image captioning. ACM Trans Multimed Comput Commun Appl 17:2. https://doi.org/10.1145/3439734CrossRef

36.

Wu A, Han Y, Yang Y et al (2019) Convolutional reconstruction-to-sequence for video captioning. IEEE Trans Circuits Syst Video Technol 30(11):4299–4308CrossRef

37.

Wu J, Hu H, Wu Y (2018) Image captioning via semantic guidance attention and consensus selection strategy. ACM Trans Multimed Comput Commun Appl (TOMM) 14(4):1–19

38.

Xiao H, Xu J, Shi J (2020) Exploring diverse and fine-grained caption for video by incorporating convolutional architecture into lstm-based model. Pattern Recognit Lett 129:173–180CrossRef

39.

Xu K, Ba J, Kiros R, et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, 2048–2057

40.

Yang L, Tang K, Yang J, et al (2017) Dense captioning with joint inference and visual context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2193–2202

41.

Yang L, Wang H, Tang P et al (2020) Captionnet: a tailor-made recurrent neural network for generating image descriptions. IEEE Trans Multimed 23:835–845CrossRef

42.

Yao T, Pan Y, Li Y, et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), 684–699

43.

You Q, Jin H, Wang Z, et al (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4651–4659

44.

Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef

Title: A local representation-enhanced recurrent convolutional network for image captioning
Authors: Xiaoyi Wang
Jun Huang
Publication date: 12-04-2022
Publisher: Springer London
Published in: International Journal of Multimedia Information Retrieval / Issue 2/2022
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-022-00231-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2022

Multi-sensor human activity recognition using CNN and GRU

PDS-Net: A novel point and depth-wise separable convolution for real-time object detection

Few2Decide: towards a robust model via using few neuron connections to decide

Caption TLSTMs: combining transformer with LSTMs for image captioning

Anomaly detection using edge computing in video surveillance system: review

Siamese coding network and pair similarity prediction for near-duplicate image detection

Premium Partner