Top

International Journal of Multimedia Information Retrieval

Published in:

01-12-2023 | Regular Paper

PSNet: position-shift alignment network for image caption

Authors: Lixia Xue, Awen Zhang, Ronggui Wang, Juan Yang

Published in: International Journal of Multimedia Information Retrieval | Issue 2/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Recently, Transformer-based models have gained increasing popularity in the field of image captioning. The global attention mechanism of the Transformer facilitates the integration of region and grid features, leading to a significant improvement in accuracy. However, combining two features through direct fusion may lead to inevitable semantic noise, which is caused by non-synergistic issue of the region and grid features; meanwhile, the additional detector to extract region features also decrease the efficiency of the model. In this paper, we introduce a novel position-shift alignment network (PSNet) to exploit the advantages of the two features. Concretely, we embed a simple detector DETR into the model and extracted region features based on grid features to improve model efficiency. Moreover, we propose a P-shift alignment module to address semantic noise caused by non-synergistic issue of the region and grid features. To validate our model, we conduct extensive experiments and visualization on the MS-COCO dataset, and results show that PSNet is qualitatively competitive with existing models under comparable experimental conditions.

previous article Sentiment analysis using deep learning techniques: a comprehensive review

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch AC, Berg AC, Berg TL, Daumé H (2012) Midge: generating image descriptions from computer vision detections. In: Conference of the European chapter of the association for computational linguistics

Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9

Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision

Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence

Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6077–6086

Ren S, He K, Girshick RB, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149CrossRef

Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. In: AAAI conference on artificial intelligence

Xian T, Li Z, Tang Z, Ma H (2022) Adaptive path selection for dynamic image captioning. IEEE Trans Circuits Syst Video Technol 32:5762–5775CrossRef

10.

Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:abs/2005.12872

11.

Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision

12.

Cornia M, Stefanini M, Baraldi L, Cucchiara R (2019) Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10575–10584

13.

Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15460–15469

14.

Zeng P, Zhang H, Song J, Gao L (2022) S2 transformer for image captioning. In: International joint conference on artificial intelligence

15.

Uijlings JRR, Sande KEA, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171CrossRef

16.

He K, Gkioxari G, Dollár P, Girshick RB (2017) Mask r-CNN. https://api.semanticscholar.org/CorpusID:54465873

17.

Singh B, Davis LS (2017) An analysis of scale invariance in object detection—snip. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 3578–3587

18.

Najibi M, Singh B, Davis LS (2018) Autofocus: efficient multi-scale inference. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9744–9754

19.

Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C-Y, Berg AC (2015) Ssd: single shot multibox detector. In: European conference on computer vision. https://api.semanticscholar.org/CorpusID:2141740

20.

Zhang S, Wen L, Bian X, Lei Z, Li S (2017) Single-shot refinement neural network for object detection. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4203–4212

21.

Shen Z, Liu Z, Li J, Jiang Y-G, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: 2017 IEEE international conference on computer vision (ICCV), pp 1937–1945

22.

Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788

23.

Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv:abs/2010.04159

24.

Yao Z, Ai J, Li B, Zhang C (2021) Efficient detr: improving end-to-end object detector with dense prior. arXiv:abs/2104.01318

25.

Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:abs/1706.03762

26.

Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: Neural information processing systems

27.

Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333

28.

Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936

29.

He S, Liao W, Tavakoli HR, Yang MY, Rosenhahn B, Pugeault N (2020) Image captioning through image transformer. arXiv:abs/2004.14231

30.

Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2016) Self-critical sequence training for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1179–1195

31.

Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676CrossRef

32.

Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). J. Artif. Intell. Res. 47:853–899CrossRefMATH

33.

Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef

34.

Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics

35.

Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL

36.

Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Annual meeting of the association for computational linguistics

37.

Vedantam R, Zitnick CL, Parikh D (2014) Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575

38.

Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision

39.

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: NIPS

40.

Jiang H, Misra I, Rohrbach M, Learned-Miller EG, Chen X (2020) In defense of grid features for visual question answering. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10264–10273

41.

Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: European Conference on Computer Vision

42.

Yang X, Tang K, Zhang H, Cai J (2018) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10677–10686

43.

Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4633–4642

44.

Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977

45.

Fan Z, Wei Z, Wang S, Wang R, Li Z, Shan H, Huang X (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence

46.

Jiang W, Zhou W, Hu H (2022) Double-stream position learning transformer network for image captioning. IEEE Trans Circuits Syst Video Technol 32:7706–7718CrossRef

47.

Jiang Z, Wang X, Zhai Z, Cheng B (2023) LG-MLFormer: local and global MLP for image captioning. Int J Multimedia Inf Retr 12. https://api.semanticscholar.org/CorpusID:257106495

48.

Chen L, Yang Y, Hu J, Pan L, Zhai H (2023) Relational-convergent transformer for image captioning. Displays 77:102377CrossRef

49.

Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning. https://api.semanticscholar.org/CorpusID:1055111

50.

Cheng Y, Huang F, Zhou L, Jin C, Zhang Y, Zhang T (2017) A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval

51.

Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) Gla: global-local attention for image description. IEEE Trans Multimedia 20:726–737CrossRef

52.

Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia 21:2942–2956CrossRef

53.

Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530

54.

Wang C, Gu X (2021) Image captioning with adaptive incremental global context attention. Appl Intell 52:6575–6597CrossRef

55.

Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn. 138:109420

56.

You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659

57.

Lu J, Xiong C, Parikh D, Socher R (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3242–3250

58.

Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104CrossRef

Title: PSNet: position-shift alignment network for image caption
Authors: Lixia Xue
Awen Zhang
Ronggui Wang
Juan Yang
Publication date: 01-12-2023
Publisher: Springer London
Published in: International Journal of Multimedia Information Retrieval / Issue 2/2023
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-023-00307-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2023

Dual-feature collaborative relation-attention networks for visual question answering

FOF: a fine-grained object detection and feature extraction end-to-end network

Deep adversarial multi-label cross-modal hashing algorithm

Joint multi-scale information and long-range dependence for video captioning

A comprehensive survey of multimodal fake news detection techniques: advances, challenges, and opportunities

Sentiment analysis using deep learning techniques: a comprehensive review

Premium Partner