Skip to main content
Top
Published in: International Journal of Multimedia Information Retrieval 2/2023

01-12-2023 | Regular Paper

PSNet: position-shift alignment network for image caption

Authors: Lixia Xue, Awen Zhang, Ronggui Wang, Juan Yang

Published in: International Journal of Multimedia Information Retrieval | Issue 2/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recently, Transformer-based models have gained increasing popularity in the field of image captioning. The global attention mechanism of the Transformer facilitates the integration of region and grid features, leading to a significant improvement in accuracy. However, combining two features through direct fusion may lead to inevitable semantic noise, which is caused by non-synergistic issue of the region and grid features; meanwhile, the additional detector to extract region features also decrease the efficiency of the model. In this paper, we introduce a novel position-shift alignment network (PSNet) to exploit the advantages of the two features. Concretely, we embed a simple detector DETR into the model and extracted region features based on grid features to improve model efficiency. Moreover, we propose a P-shift alignment module to address semantic noise caused by non-synergistic issue of the region and grid features. To validate our model, we conduct extensive experiments and visualization on the MS-COCO dataset, and results show that PSNet is qualitatively competitive with existing models under comparable experimental conditions.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch AC, Berg AC, Berg TL, Daumé H (2012) Midge: generating image descriptions from computer vision detections. In: Conference of the European chapter of the association for computational linguistics Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch AC, Berg AC, Berg TL, Daumé H (2012) Midge: generating image descriptions from computer vision detections. In: Conference of the European chapter of the association for computational linguistics
2.
go back to reference Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9 Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9
3.
go back to reference Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision
4.
go back to reference Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence
5.
go back to reference Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164 Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164
6.
go back to reference Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6077–6086 Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6077–6086
7.
go back to reference Ren S, He K, Girshick RB, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149CrossRef Ren S, He K, Girshick RB, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149CrossRef
8.
go back to reference Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. In: AAAI conference on artificial intelligence Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. In: AAAI conference on artificial intelligence
9.
go back to reference Xian T, Li Z, Tang Z, Ma H (2022) Adaptive path selection for dynamic image captioning. IEEE Trans Circuits Syst Video Technol 32:5762–5775CrossRef Xian T, Li Z, Tang Z, Ma H (2022) Adaptive path selection for dynamic image captioning. IEEE Trans Circuits Syst Video Technol 32:5762–5775CrossRef
11.
go back to reference Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision
12.
go back to reference Cornia M, Stefanini M, Baraldi L, Cucchiara R (2019) Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10575–10584 Cornia M, Stefanini M, Baraldi L, Cucchiara R (2019) Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10575–10584
13.
go back to reference Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15460–15469 Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15460–15469
14.
go back to reference Zeng P, Zhang H, Song J, Gao L (2022) S2 transformer for image captioning. In: International joint conference on artificial intelligence Zeng P, Zhang H, Song J, Gao L (2022) S2 transformer for image captioning. In: International joint conference on artificial intelligence
15.
go back to reference Uijlings JRR, Sande KEA, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171CrossRef Uijlings JRR, Sande KEA, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171CrossRef
17.
go back to reference Singh B, Davis LS (2017) An analysis of scale invariance in object detection—snip. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 3578–3587 Singh B, Davis LS (2017) An analysis of scale invariance in object detection—snip. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 3578–3587
18.
go back to reference Najibi M, Singh B, Davis LS (2018) Autofocus: efficient multi-scale inference. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9744–9754 Najibi M, Singh B, Davis LS (2018) Autofocus: efficient multi-scale inference. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9744–9754
20.
go back to reference Zhang S, Wen L, Bian X, Lei Z, Li S (2017) Single-shot refinement neural network for object detection. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4203–4212 Zhang S, Wen L, Bian X, Lei Z, Li S (2017) Single-shot refinement neural network for object detection. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4203–4212
21.
go back to reference Shen Z, Liu Z, Li J, Jiang Y-G, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: 2017 IEEE international conference on computer vision (ICCV), pp 1937–1945 Shen Z, Liu Z, Li J, Jiang Y-G, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: 2017 IEEE international conference on computer vision (ICCV), pp 1937–1945
22.
go back to reference Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788 Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788
26.
go back to reference Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: Neural information processing systems Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: Neural information processing systems
27.
go back to reference Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333 Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333
28.
go back to reference Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936 Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936
30.
go back to reference Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2016) Self-critical sequence training for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1179–1195 Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2016) Self-critical sequence training for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1179–1195
31.
go back to reference Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676CrossRef Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676CrossRef
32.
go back to reference Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). J. Artif. Intell. Res. 47:853–899CrossRefMATH Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). J. Artif. Intell. Res. 47:853–899CrossRefMATH
33.
go back to reference Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef
34.
go back to reference Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics
35.
go back to reference Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL
36.
go back to reference Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Annual meeting of the association for computational linguistics Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Annual meeting of the association for computational linguistics
37.
go back to reference Vedantam R, Zitnick CL, Parikh D (2014) Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575 Vedantam R, Zitnick CL, Parikh D (2014) Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575
38.
go back to reference Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision
39.
go back to reference Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: NIPS Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: NIPS
40.
go back to reference Jiang H, Misra I, Rohrbach M, Learned-Miller EG, Chen X (2020) In defense of grid features for visual question answering. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10264–10273 Jiang H, Misra I, Rohrbach M, Learned-Miller EG, Chen X (2020) In defense of grid features for visual question answering. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10264–10273
41.
go back to reference Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: European Conference on Computer Vision Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: European Conference on Computer Vision
42.
go back to reference Yang X, Tang K, Zhang H, Cai J (2018) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10677–10686 Yang X, Tang K, Zhang H, Cai J (2018) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10677–10686
43.
go back to reference Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4633–4642 Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4633–4642
44.
go back to reference Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977 Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977
45.
go back to reference Fan Z, Wei Z, Wang S, Wang R, Li Z, Shan H, Huang X (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence Fan Z, Wei Z, Wang S, Wang R, Li Z, Shan H, Huang X (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence
46.
go back to reference Jiang W, Zhou W, Hu H (2022) Double-stream position learning transformer network for image captioning. IEEE Trans Circuits Syst Video Technol 32:7706–7718CrossRef Jiang W, Zhou W, Hu H (2022) Double-stream position learning transformer network for image captioning. IEEE Trans Circuits Syst Video Technol 32:7706–7718CrossRef
48.
go back to reference Chen L, Yang Y, Hu J, Pan L, Zhai H (2023) Relational-convergent transformer for image captioning. Displays 77:102377CrossRef Chen L, Yang Y, Hu J, Pan L, Zhai H (2023) Relational-convergent transformer for image captioning. Displays 77:102377CrossRef
50.
go back to reference Cheng Y, Huang F, Zhou L, Jin C, Zhang Y, Zhang T (2017) A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval Cheng Y, Huang F, Zhou L, Jin C, Zhang Y, Zhang T (2017) A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval
51.
go back to reference Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) Gla: global-local attention for image description. IEEE Trans Multimedia 20:726–737CrossRef Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) Gla: global-local attention for image description. IEEE Trans Multimedia 20:726–737CrossRef
52.
go back to reference Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia 21:2942–2956CrossRef Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia 21:2942–2956CrossRef
53.
go back to reference Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530 Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530
54.
go back to reference Wang C, Gu X (2021) Image captioning with adaptive incremental global context attention. Appl Intell 52:6575–6597CrossRef Wang C, Gu X (2021) Image captioning with adaptive incremental global context attention. Appl Intell 52:6575–6597CrossRef
55.
go back to reference Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn. 138:109420 Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn. 138:109420
56.
go back to reference You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659 You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659
57.
go back to reference Lu J, Xiong C, Parikh D, Socher R (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3242–3250 Lu J, Xiong C, Parikh D, Socher R (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3242–3250
58.
go back to reference Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104CrossRef Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104CrossRef
Metadata
Title
PSNet: position-shift alignment network for image caption
Authors
Lixia Xue
Awen Zhang
Ronggui Wang
Juan Yang
Publication date
01-12-2023
Publisher
Springer London
Published in
International Journal of Multimedia Information Retrieval / Issue 2/2023
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-023-00307-3

Other articles of this Issue 2/2023

International Journal of Multimedia Information Retrieval 2/2023 Go to the issue

Premium Partner