nach oben

International Journal of Multimedia Information Retrieval

Erschienen in:

01.03.2024 | Regular Paper

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

verfasst von: Huaying Zhang, Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 1/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.

Vorheriger Artikel A voting-based novel spatio-temporal fusion framework for video saliency using transfer learning mechanism

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Agnolucci L, Baldrati A, Todino F et al (2023) Eco: ensembling context optimization for vision-language models. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops, pp 2811–2815

Alayrac JB, Donahue J, Luc P et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736

Bahng H, Jahanian A, Sankaranarayanan S et al (2022) Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274

Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

Chen J, Guo H, Yi K et al (2022) Visualgpt: data-efficient adaptation of pretrained language models for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 18030–18040

Cheng Y, Zhu X, Qian J et al (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23CrossRef

Chun S, Oh SJ, de Rezende RS et al (2021) Probabilistic embeddings for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 8415–8424

Diao H, Zhang Y, Ma L et al (2021) Similarity reasoning and filtration for image-text matching. In: AAAI conference on artificial intelligence, pp 1218–1226

Ding N, Qin Y, Yang G et al (2023) Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat Mach Intell pp 1–16

10.

Dong X, Zheng Y, Bao J et al (2022) Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262

11.

Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations

12.

Faghri F, Fleet DJ, Kiros JR et al (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612

13.

Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recogn 77:354–377CrossRef

14.

Jia C, Yang Y, Xia Y et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR, pp 4904–4916

15.

Jia M, Tang L, Chen BC et al (2022) Visual prompt tuning. In: European conference on computer vision

16.

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE/CVF conference on computer vision and pattern recognition, pp 3128–3137

17.

Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539

18.

Kottur S, Moon S, Markosyan AH et al (2022) Tell your story: task-oriented dialogs for interactive content creation. arXiv preprint arXiv:2211.03940

19.

Lee KH, Chen X, Hua G et al (2018) Stacked cross attention for image-text matching. In: European conference on computer vision, pp 201–216

20.

Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: 2021 Conference on empirical methods in natural language processing, pp 3045–3059

21.

Li A, Jabri A, Joulin A et al (2017) Learning visual n-grams from web data. In: IEEE International conference on computer vision

22.

Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. In: 59th Annual Meeting of the Association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 4582–4597

23.

Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755

24.

Liu L, Liu X, Gao J et al (2020) Understanding the difficulty of training transformers. In: 2020 conference on empirical methods in natural language processing, pp 5747–5763

25.

Liu P, Yuan W, Fu J et al (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586

26.

Logan R IV, Balažević I, Wallace E et al (2022) Cutting down on prompts and parameters: simple few-shot learning with language models. Find Assoc Comput Linguist ACL 2022:2824–2835CrossRef

27.

Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations

28.

Lu D, Liu X, Qian X (2016) Tag-based image search by social re-ranking. IEEE Trans Multimed 18(8):1628–1639CrossRef

29.

Lu Y, Liu J, Zhang Y et al (2022) Prompt distribution learning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5206–5215

30.

van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605

31.

Mikolov T, Karafiát M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech, pp 1045–1048

32.

Mokady R, Hertz A, Bermano AH (2021) Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734

33.

Petroni F, Rocktäschel T, Riedel S et al (2019) Language models as knowledge bases?. In 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, pp 2463–2473

34.

Pham H, Dai Z, Ghiasi G et al (2021) Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050

35.

Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training

36.

Radford A, Kim JW, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763

37.

Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28

38.

Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: 16th Conference of the European chapter of the association for computational linguistics: main volume, pp 255–269

39.

Schuhmann C, Kaczmarczyk R, Komatsuzaki A, et al (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop Datacentric AI, Jülich Supercomputing Center, FZJ-2022-00923

40.

Song H, Kim M, Park D et al (2022) Learning from noisy labels with deep neural networks: a survey. IEEE Trans Neural Netw Learn Syst

41.

Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 1979–1978

42.

Strauss J, Paluska JM, Lesniewski-Laas C et al (2011) Eyo: device-transparent personal storage. In: USENIX Annual technical conference

43.

Su W, Zhu X, Cao Y et al (2020) Vl-bert: pre-training of generic visual-linguistic representations. In: International conference on learning representations

44.

Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30

45.

Wang K, Yin Q, Wang W et al (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215

46.

Wang Y, Yao Q, Kwok JT et al (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv 53(3):1–34CrossRef

47.

Wortsman M, Ilharco G, Kim JW et al (2022) Robust fine-tuning of zero-shot models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 7959–7971

48.

Xie Q, Luong MT, Hovy E et al (2020) Self-training with noisy student improves imagenet classification. In: IEEE/CVF conference on computer vision and pattern recognition

49.

Yanagi R, Togo R, Ogawa T et al (2021) Database-adaptive re-ranking for enhancing cross-modal image retrieval. In: 29th ACM international conference on multimedia, pp 3816–3825

50.

Yanagi R, Togo R, Ogawa T et al (2022) Interactive re-ranking via object entropy-guided question answering for cross-modal image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(3):1–17CrossRef

51.

Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef

52.

Zhang H, Yanagi R, Togo R et al (2023) Cross-modal image retrieval considering semantic relationships with many-to-many correspondence loss. IEEE Access 11:10675–10686CrossRef

53.

Zhang S, Tong H, Xu J et al (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw 6(1):1–23CrossRef

54.

Zhou K, Yang J, Loy CC et al (2022a) Conditional prompt learning for vision-language models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825

55.

Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348CrossRef

56.

Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: a literature survey. arXiv preprint arXiv:1706.06064

Titel: Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts
verfasst von: Huaying Zhang
Rintaro Yanagi
Ren Togo
Takahiro Ogawa
Miki Haseyama
Publikationsdatum: 01.03.2024
Verlag: Springer London
Erschienen in: International Journal of Multimedia Information Retrieval / Ausgabe 1/2024
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-024-00322-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2024

Deep multiple aggregation networks for action recognition

An emotion-driven, transformer-based network for multimodal fake news detection

Incremental image retrieval method based on feature perception and deep hashing

Image enhancement with bi-directional normalization and color attention-guided generative adversarial networks

Opinion convergence-based sentiment prediction of image advertisement

A voting-based novel spatio-temporal fusion framework for video saliency using transfer learning mechanism

Premium Partner