Top

International Journal of Multimedia Information Retrieval

Published in:

06-10-2022 | Regular Paper

Prototype local–global alignment network for image–text retrieval

Authors: Lingtao Meng, Feifei Zhang, Xi Zhang, Changsheng Xu

Published in: International Journal of Multimedia Information Retrieval | Issue 4/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Image–text retrieval is a challenging task due to the requirement of thorough multimodal understanding and precise inter-modality relationship discovery. However, most previous approaches resort to doing global image–text alignment and neglect fine-grained correspondence. Although some works explore local region–word alignment, they usually suffer from a heavy computing burden. In this paper, we propose a prototype local–global alignment (PLGA) network for image–text retrieval by jointly performing the fine-grained local alignment and high-level global alignment. Specifically, our PLGA contains two key components: a prototype-based local alignment module and a multi-scale global alignment module. The former enables efficient fine-grained local matching by combining region–prototype alignment and word–prototype alignment, and the latter helps perceive hierarchical global semantics by exploring multi-scale global correlations between the image and text. Overall, the local and global alignment modules can boost their performances for each other via the unified model. Quantitative and qualitative experimental results on Flickr30K and MS-COCO benchmarks demonstrate that our proposed approach performs favorably against state-of-the-art methods.

previous article Special issue on cross-modal retrieval and analysis

next article Who is gambling? Finding cryptocurrency gamblers using multi-modal retrieval methods

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6077–6086

Chen H, Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12655–12663

Chen J, Hu H, Wu H, Jiang Y, Wang C (2021) Learning the best pooling strategy for visual semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15789–15798

Chen T, Luo J (2020) Expressing objects just like words: recurrent visual embedding for image-text matching. Proc Assoc Adv Artif Intell (AAAI) 34:10583–10590

Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In; Proceedings of the conference of the North American chapter of the association for computational linguistics (ACL), pp 4171–4186

Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. In: Proceedings of the association for the advance of artificial intelligence (AAAI)

Faghri F, Fleet DJ, Kiros JR, Fidler S Vse+ (2017) Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612

Ge Y, Zhu F, Chen D, Zhao R et al (2020) Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 33:11309–11321

Jocher G et al (2021) Yolov5. https://github.com/ultralytics/yolov5

10.

Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7181–7189

11.

He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag (SPM) 25(5):14–36CrossRef

12.

Hu P, Peng X, Zhu H, Zhen L, Lin J (2021) Learning cross-modal retrieval with noisy labels. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5403–5413

13.

Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2310–2318

14.

Huang Y, Wu Q, Song C, Wang L (2018) Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6163–6171

15.

Ji Z, Wang H, Han J, Pang Y (2019) Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5754–5763

16.

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

17.

Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. arXiv:1406.5679

18.

Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4437–4446

19.

Ranjay K, Yuke Z, Oliver G, Justin J, Kenji H, Joshua K, Stephanie C, Yannis K, Li-Jia L, Shamma David A et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis (IJCV) 123(1):32–73MathSciNetCrossRef

20.

Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 201–216

21.

Li J, Zhou P, Xiong C, Hoi Steven CH (2020) Prototypical contrastive learning of unsupervised representations. arXiv:2005.04966

22.

Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4654–4662

23.

Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Proceedings of the European conference on computer vision (ECCV), pp 740–755. Springer

24.

Chunxiao L, Zhendong M, Tianzhu Z, Hongtao X, Bin W, Yongdong Z (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10921–10930

25.

Liu Y, Guo Y, Bakker EM, Lew MS (2017) Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4107–4116

26.

Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 299–307

27.

Niu Z, Zhou M, Wang L, Gao X, Hua G (2017) Hierarchical multimodal lstm for dense visual-semantic embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1881–1889

28.

Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic Differentiation in PyTorch. In: Proceedings of the Conference and Workshop on Neural Information Processing Systems (NIPS). https://openreview.net/forum?id=BJJsrmfCZ

29.

Peng Y, Qi J (2019) CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24

30.

Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International joint conferences on artificial intelligence organization (IJCAI), pp 3846–3853

31.

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Proceedings of the conference and workshop on neural information processing systems (NIPS) 28:91–99

32.

Salvador A, Gundogdu E, Bazzani L, Donoser M (2021) Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15475–15484

33.

Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5814–5824

34.

Snell J, Swersky K, Zemel RS (2017) Prototypical networks for few-shot learning. arXiv:1703.05175

35.

Toyama J, Misono M, Suzuki M, Nakayama K, Matsuo Y (2016) Neural machine translation with latent semantic of image and text. arXiv:1611.08459

36.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN , Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the conference and workshop on neural information processing systems (NIPS), vol 30

37.

Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: Proceedings of the European conference on computer vision (ECCV), pp 18–34. Springer

38.

Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5005–5013

39.

Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell (TPAMI) 41(2):394–407CrossRef

40.

Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), pp 1508–1517

41.

Wang X, Liu Z, Yu SX (2021) Unsupervised feature learning by cross-level instance-group discrimination. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12586–12595

42.

Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: visual-textual attributes alignment in person search by natural language. In: Proceedings of the European conference on computer vision (ECCV), pp 402–420. Springer

43.

Wehrmann J, Kolling C, Barros RC (2020) Adaptive cross-modal embeddings for image-text alignment. In: Proceedings of the association for the advance of artificial intelligence (AAAI), vol 34, pp 12313–12320

44.

Wei X, Zhang T, Li Y, Zhang Y, Feng W (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10941–10950

45.

Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3441–3450

46.

Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist (TACL) 2:67–78CrossRef

47.

Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In; Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3536–3545

48.

Zhang X, Ge Y, Qiao Y, Li H (2021) Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3436–3445

49.

Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl (TOMM) 16(2):1–23CrossRef

Title: Prototype local–global alignment network for image–text retrieval
Authors: Lingtao Meng
Feifei Zhang
Xi Zhang
Changsheng Xu
Publication date: 06-10-2022
Publisher: Springer London
Published in: International Journal of Multimedia Information Retrieval / Issue 4/2022
Print ISSN: 2192-6611
Electronic ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-022-00258-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2022

Special issue on cross-modal retrieval and analysis

Multimodal Quasi-AutoRegression: forecasting the visual popularity of new fashion products

Visual and semantic ensemble for scene text recognition with gated dual mutual attention

Similar interior coordination image retrieval with multi-view features

Generative adversarial networks for 2D-based CNN pose-invariant face recognition

A novel method for video shot boundary detection using CNN-LSTM approach

Premium Partner