Skip to main content
Top
Published in: International Journal of Multimedia Information Retrieval 4/2022

06-10-2022 | Regular Paper

Prototype local–global alignment network for image–text retrieval

Authors: Lingtao Meng, Feifei Zhang, Xi Zhang, Changsheng Xu

Published in: International Journal of Multimedia Information Retrieval | Issue 4/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Image–text retrieval is a challenging task due to the requirement of thorough multimodal understanding and precise inter-modality relationship discovery. However, most previous approaches resort to doing global image–text alignment and neglect fine-grained correspondence. Although some works explore local region–word alignment, they usually suffer from a heavy computing burden. In this paper, we propose a prototype local–global alignment (PLGA) network for image–text retrieval by jointly performing the fine-grained local alignment and high-level global alignment. Specifically, our PLGA contains two key components: a prototype-based local alignment module and a multi-scale global alignment module. The former enables efficient fine-grained local matching by combining region–prototype alignment and word–prototype alignment, and the latter helps perceive hierarchical global semantics by exploring multi-scale global correlations between the image and text. Overall, the local and global alignment modules can boost their performances for each other via the unified model. Quantitative and qualitative experimental results on Flickr30K and MS-COCO benchmarks demonstrate that our proposed approach performs favorably against state-of-the-art methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6077–6086 Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6077–6086
2.
go back to reference Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12655–12663 Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12655–12663
3.
go back to reference Chen J, Hu H, Wu H, Jiang Y, Wang C (2021) Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15789–15798 Chen J, Hu H, Wu H, Jiang Y, Wang C (2021) Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15789–15798
4.
go back to reference Chen T, Luo J (2020) Expressing objects just like words: recurrent visual embedding for image-text matching. Proc Assoc Adv Artif Intell (AAAI) 34:10583–10590 Chen T, Luo J (2020) Expressing objects just like words: recurrent visual embedding for image-text matching. Proc Assoc Adv Artif Intell (AAAI) 34:10583–10590
5.
go back to reference Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In; Proceedings of the conference of the North American chapter of the association for computational linguistics (ACL), pp 4171–4186 Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In; Proceedings of the conference of the North American chapter of the association for computational linguistics (ACL), pp 4171–4186
6.
go back to reference Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the association for the advance of artificial intelligence (AAAI) Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the association for the advance of artificial intelligence (AAAI)
8.
go back to reference Ge Y, Zhu F, Chen D, Zhao R et al (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 33:11309–11321 Ge Y, Zhu F, Chen D, Zhao R et al (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 33:11309–11321
10.
go back to reference Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7181–7189 Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7181–7189
11.
go back to reference He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag (SPM) 25(5):14–36CrossRef He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag (SPM) 25(5):14–36CrossRef
12.
go back to reference Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5403–5413 Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5403–5413
13.
go back to reference Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2310–2318 Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2310–2318
14.
go back to reference Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6163–6171 Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6163–6171
15.
go back to reference Ji Z, Wang H, Han J, Pang Y (2019) Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5754–5763 Ji Z, Wang H, Han J, Pang Y (2019) Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5754–5763
16.
go back to reference Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
17.
18.
go back to reference Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4437–4446 Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4437–4446
19.
go back to reference Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma David A et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis (IJCV) 123(1):32–73MathSciNetCrossRef Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma David A et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis (IJCV) 123(1):32–73MathSciNetCrossRef
20.
go back to reference Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 201–216 Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 201–216
21.
22.
go back to reference Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4654–4662 Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4654–4662
23.
go back to reference Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision (ECCV), pp 740–755. Springer Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision (ECCV), pp 740–755. Springer
24.
go back to reference Chunxiao L, Zhendong M, Tianzhu Z, Hongtao X, Bin W, Yongdong Z (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10921–10930 Chunxiao L, Zhendong M, Tianzhu Z, Hongtao X, Bin W, Yongdong Z (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10921–10930
25.
go back to reference Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4107–4116 Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4107–4116
26.
go back to reference Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 299–307 Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 299–307
27.
go back to reference Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1881–1889 Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1881–1889
28.
go back to reference Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic Differentiation in PyTorch. In: Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS). https://openreview.net/forum?id=BJJsrmfCZ Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic Differentiation in PyTorch. In: Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS). https://​openreview.​net/​forum?​id=​BJJsrmfCZ
29.
go back to reference Peng Y, Qi J (2019) CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24 Peng Y, Qi J (2019) CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24
30.
go back to reference Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International joint conferences on artificial intelligence organization (IJCAI), pp 3846–3853 Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International joint conferences on artificial intelligence organization (IJCAI), pp 3846–3853
31.
go back to reference Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 28:91–99 Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 28:91–99
32.
go back to reference Salvador A, Gundogdu E, Bazzani L, Donoser M (2021) Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15475–15484 Salvador A, Gundogdu E, Bazzani L, Donoser M (2021) Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15475–15484
33.
go back to reference Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5814–5824 Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5814–5824
35.
go back to reference Toyama J, Misono M, Suzuki M, Nakayama K, Matsuo Y (2016) Neural machine translation with latent semantic of image and text. arXiv:1611.08459 Toyama J, Misono M, Suzuki M, Nakayama K, Matsuo Y (2016) Neural machine translation with latent semantic of image and text. arXiv:​1611.​08459
36.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN , Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the conference and workshop on neural information processing systems (NIPS), vol 30 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN , Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the conference and workshop on neural information processing systems (NIPS), vol 30
37.
go back to reference Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 18–34. Springer Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 18–34. Springer
38.
go back to reference Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5005–5013 Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5005–5013
39.
go back to reference Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(2):394–407CrossRef Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(2):394–407CrossRef
40.
go back to reference Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), pp 1508–1517 Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), pp 1508–1517
41.
go back to reference Wang X, Liu Z, Yu SX (2021) Unsupervised feature learning by cross-level instance-group discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12586–12595 Wang X, Liu Z, Yu SX (2021) Unsupervised feature learning by cross-level instance-group discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12586–12595
42.
go back to reference Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: visual-textual attributes alignment in person search by natural language. In: Proceedings of the European conference on computer vision (ECCV), pp 402–420. Springer Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: visual-textual attributes alignment in person search by natural language. In: Proceedings of the European conference on computer vision (ECCV), pp 402–420. Springer
43.
go back to reference Wehrmann J, Kolling C, Barros RC (2020) Adaptive cross-modal embeddings for image-text alignment. In: Proceedings of the association for the advance of artificial intelligence (AAAI), vol 34, pp 12313–12320 Wehrmann J, Kolling C, Barros RC (2020) Adaptive cross-modal embeddings for image-text alignment. In: Proceedings of the association for the advance of artificial intelligence (AAAI), vol 34, pp 12313–12320
44.
go back to reference Wei X, Zhang T, Li Y, Zhang Y, Feng W (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10941–10950 Wei X, Zhang T, Li Y, Zhang Y, Feng W (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10941–10950
45.
go back to reference Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3441–3450 Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3441–3450
46.
go back to reference Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist (TACL) 2:67–78CrossRef Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist (TACL) 2:67–78CrossRef
47.
go back to reference Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In; Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3536–3545 Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In; Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3536–3545
48.
go back to reference Zhang X, Ge Y, Qiao Y, Li H (2021) Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3436–3445 Zhang X, Ge Y, Qiao Y, Li H (2021) Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3436–3445
49.
go back to reference Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23CrossRef Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23CrossRef
Metadata
Title
Prototype local–global alignment network for image–text retrieval
Authors
Lingtao Meng
Feifei Zhang
Xi Zhang
Changsheng Xu
Publication date
06-10-2022
Publisher
Springer London
Published in
International Journal of Multimedia Information Retrieval / Issue 4/2022
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-022-00258-1

Other articles of this Issue 4/2022

International Journal of Multimedia Information Retrieval 4/2022 Go to the issue

Premium Partner