Skip to main content
Erschienen in: International Journal of Multimedia Information Retrieval 1/2024

01.03.2024 | Regular Paper

Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

verfasst von: Huaying Zhang, Rintaro Yanagi, Ren Togo, Takahiro Ogawa, Miki Haseyama

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 1/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Agnolucci L, Baldrati A, Todino F et al (2023) Eco: ensembling context optimization for vision-language models. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops, pp 2811–2815 Agnolucci L, Baldrati A, Todino F et al (2023) Eco: ensembling context optimization for vision-language models. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops, pp 2811–2815
2.
Zurück zum Zitat Alayrac JB, Donahue J, Luc P et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736 Alayrac JB, Donahue J, Luc P et al (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35:23716–23736
3.
Zurück zum Zitat Bahng H, Jahanian A, Sankaranarayanan S et al (2022) Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 Bahng H, Jahanian A, Sankaranarayanan S et al (2022) Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:​2203.​17274
4.
Zurück zum Zitat Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901 Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
5.
Zurück zum Zitat Chen J, Guo H, Yi K et al (2022) Visualgpt: data-efficient adaptation of pretrained language models for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 18030–18040 Chen J, Guo H, Yi K et al (2022) Visualgpt: data-efficient adaptation of pretrained language models for image captioning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 18030–18040
6.
Zurück zum Zitat Cheng Y, Zhu X, Qian J et al (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23CrossRef Cheng Y, Zhu X, Qian J et al (2022) Cross-modal graph matching network for image-text retrieval. ACM Trans Multimed Comput Commun Appl 18(4):1–23CrossRef
7.
Zurück zum Zitat Chun S, Oh SJ, de Rezende RS et al (2021) Probabilistic embeddings for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 8415–8424 Chun S, Oh SJ, de Rezende RS et al (2021) Probabilistic embeddings for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 8415–8424
8.
Zurück zum Zitat Diao H, Zhang Y, Ma L et al (2021) Similarity reasoning and filtration for image-text matching. In: AAAI conference on artificial intelligence, pp 1218–1226 Diao H, Zhang Y, Ma L et al (2021) Similarity reasoning and filtration for image-text matching. In: AAAI conference on artificial intelligence, pp 1218–1226
9.
Zurück zum Zitat Ding N, Qin Y, Yang G et al (2023) Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat Mach Intell pp 1–16 Ding N, Qin Y, Yang G et al (2023) Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat Mach Intell pp 1–16
10.
Zurück zum Zitat Dong X, Zheng Y, Bao J et al (2022) Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262 Dong X, Zheng Y, Bao J et al (2022) Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:​2208.​12262
11.
Zurück zum Zitat Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations Dosovitskiy A, Beyer L, Kolesnikov A et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations
12.
Zurück zum Zitat Faghri F, Fleet DJ, Kiros JR et al (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 Faghri F, Fleet DJ, Kiros JR et al (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:​1707.​05612
13.
Zurück zum Zitat Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recogn 77:354–377CrossRef Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recogn 77:354–377CrossRef
14.
Zurück zum Zitat Jia C, Yang Y, Xia Y et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR, pp 4904–4916 Jia C, Yang Y, Xia Y et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning, PMLR, pp 4904–4916
15.
Zurück zum Zitat Jia M, Tang L, Chen BC et al (2022) Visual prompt tuning. In: European conference on computer vision Jia M, Tang L, Chen BC et al (2022) Visual prompt tuning. In: European conference on computer vision
16.
Zurück zum Zitat Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE/CVF conference on computer vision and pattern recognition, pp 3128–3137 Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE/CVF conference on computer vision and pattern recognition, pp 3128–3137
17.
Zurück zum Zitat Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:​1411.​2539
18.
Zurück zum Zitat Kottur S, Moon S, Markosyan AH et al (2022) Tell your story: task-oriented dialogs for interactive content creation. arXiv preprint arXiv:2211.03940 Kottur S, Moon S, Markosyan AH et al (2022) Tell your story: task-oriented dialogs for interactive content creation. arXiv preprint arXiv:​2211.​03940
19.
Zurück zum Zitat Lee KH, Chen X, Hua G et al (2018) Stacked cross attention for image-text matching. In: European conference on computer vision, pp 201–216 Lee KH, Chen X, Hua G et al (2018) Stacked cross attention for image-text matching. In: European conference on computer vision, pp 201–216
20.
Zurück zum Zitat Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: 2021 Conference on empirical methods in natural language processing, pp 3045–3059 Lester B, Al-Rfou R, Constant N (2021) The power of scale for parameter-efficient prompt tuning. In: 2021 Conference on empirical methods in natural language processing, pp 3045–3059
21.
Zurück zum Zitat Li A, Jabri A, Joulin A et al (2017) Learning visual n-grams from web data. In: IEEE International conference on computer vision Li A, Jabri A, Joulin A et al (2017) Learning visual n-grams from web data. In: IEEE International conference on computer vision
22.
Zurück zum Zitat Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. In: 59th Annual Meeting of the Association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 4582–4597 Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation. In: 59th Annual Meeting of the Association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 4582–4597
23.
Zurück zum Zitat Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755 Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755
24.
Zurück zum Zitat Liu L, Liu X, Gao J et al (2020) Understanding the difficulty of training transformers. In: 2020 conference on empirical methods in natural language processing, pp 5747–5763 Liu L, Liu X, Gao J et al (2020) Understanding the difficulty of training transformers. In: 2020 conference on empirical methods in natural language processing, pp 5747–5763
25.
Zurück zum Zitat Liu P, Yuan W, Fu J et al (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 Liu P, Yuan W, Fu J et al (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:​2107.​13586
26.
Zurück zum Zitat Logan R IV, Balažević I, Wallace E et al (2022) Cutting down on prompts and parameters: simple few-shot learning with language models. Find Assoc Comput Linguist ACL 2022:2824–2835CrossRef Logan R IV, Balažević I, Wallace E et al (2022) Cutting down on prompts and parameters: simple few-shot learning with language models. Find Assoc Comput Linguist ACL 2022:2824–2835CrossRef
27.
Zurück zum Zitat Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations
28.
Zurück zum Zitat Lu D, Liu X, Qian X (2016) Tag-based image search by social re-ranking. IEEE Trans Multimed 18(8):1628–1639CrossRef Lu D, Liu X, Qian X (2016) Tag-based image search by social re-ranking. IEEE Trans Multimed 18(8):1628–1639CrossRef
29.
Zurück zum Zitat Lu Y, Liu J, Zhang Y et al (2022) Prompt distribution learning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5206–5215 Lu Y, Liu J, Zhang Y et al (2022) Prompt distribution learning. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5206–5215
30.
Zurück zum Zitat van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605 van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(86):2579–2605
31.
Zurück zum Zitat Mikolov T, Karafiát M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech, pp 1045–1048 Mikolov T, Karafiát M, Burget L et al (2010) Recurrent neural network based language model. In: Interspeech, pp 1045–1048
33.
Zurück zum Zitat Petroni F, Rocktäschel T, Riedel S et al (2019) Language models as knowledge bases?. In 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, pp 2463–2473 Petroni F, Rocktäschel T, Riedel S et al (2019) Language models as knowledge bases?. In 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, pp 2463–2473
34.
35.
Zurück zum Zitat Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training Radford A, Narasimhan K, Salimans T et al (2018) Improving language understanding by generative pre-training
36.
Zurück zum Zitat Radford A, Kim JW, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763 Radford A, Kim JW, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
37.
Zurück zum Zitat Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28 Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
38.
Zurück zum Zitat Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: 16th Conference of the European chapter of the association for computational linguistics: main volume, pp 255–269 Schick T, Schütze H (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In: 16th Conference of the European chapter of the association for computational linguistics: main volume, pp 255–269
39.
Zurück zum Zitat Schuhmann C, Kaczmarczyk R, Komatsuzaki A, et al (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop Datacentric AI, Jülich Supercomputing Center, FZJ-2022-00923 Schuhmann C, Kaczmarczyk R, Komatsuzaki A, et al (2021) Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop Datacentric AI, Jülich Supercomputing Center, FZJ-2022-00923
40.
Zurück zum Zitat Song H, Kim M, Park D et al (2022) Learning from noisy labels with deep neural networks: a survey. IEEE Trans Neural Netw Learn Syst Song H, Kim M, Park D et al (2022) Learning from noisy labels with deep neural networks: a survey. IEEE Trans Neural Netw Learn Syst
41.
Zurück zum Zitat Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 1979–1978 Song Y, Soleymani M (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In: IEEE/CVF conference on computer vision and pattern recognition, pp 1979–1978
42.
Zurück zum Zitat Strauss J, Paluska JM, Lesniewski-Laas C et al (2011) Eyo: device-transparent personal storage. In: USENIX Annual technical conference Strauss J, Paluska JM, Lesniewski-Laas C et al (2011) Eyo: device-transparent personal storage. In: USENIX Annual technical conference
43.
Zurück zum Zitat Su W, Zhu X, Cao Y et al (2020) Vl-bert: pre-training of generic visual-linguistic representations. In: International conference on learning representations Su W, Zhu X, Cao Y et al (2020) Vl-bert: pre-training of generic visual-linguistic representations. In: International conference on learning representations
44.
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30 Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30
45.
46.
Zurück zum Zitat Wang Y, Yao Q, Kwok JT et al (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv 53(3):1–34CrossRef Wang Y, Yao Q, Kwok JT et al (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv 53(3):1–34CrossRef
47.
Zurück zum Zitat Wortsman M, Ilharco G, Kim JW et al (2022) Robust fine-tuning of zero-shot models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 7959–7971 Wortsman M, Ilharco G, Kim JW et al (2022) Robust fine-tuning of zero-shot models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 7959–7971
48.
Zurück zum Zitat Xie Q, Luong MT, Hovy E et al (2020) Self-training with noisy student improves imagenet classification. In: IEEE/CVF conference on computer vision and pattern recognition Xie Q, Luong MT, Hovy E et al (2020) Self-training with noisy student improves imagenet classification. In: IEEE/CVF conference on computer vision and pattern recognition
49.
Zurück zum Zitat Yanagi R, Togo R, Ogawa T et al (2021) Database-adaptive re-ranking for enhancing cross-modal image retrieval. In: 29th ACM international conference on multimedia, pp 3816–3825 Yanagi R, Togo R, Ogawa T et al (2021) Database-adaptive re-ranking for enhancing cross-modal image retrieval. In: 29th ACM international conference on multimedia, pp 3816–3825
50.
Zurück zum Zitat Yanagi R, Togo R, Ogawa T et al (2022) Interactive re-ranking via object entropy-guided question answering for cross-modal image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(3):1–17CrossRef Yanagi R, Togo R, Ogawa T et al (2022) Interactive re-ranking via object entropy-guided question answering for cross-modal image retrieval. ACM Trans Multimed Comput Commun Appl (TOMM) 18(3):1–17CrossRef
51.
Zurück zum Zitat Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRef
52.
Zurück zum Zitat Zhang H, Yanagi R, Togo R et al (2023) Cross-modal image retrieval considering semantic relationships with many-to-many correspondence loss. IEEE Access 11:10675–10686CrossRef Zhang H, Yanagi R, Togo R et al (2023) Cross-modal image retrieval considering semantic relationships with many-to-many correspondence loss. IEEE Access 11:10675–10686CrossRef
53.
Zurück zum Zitat Zhang S, Tong H, Xu J et al (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw 6(1):1–23CrossRef Zhang S, Tong H, Xu J et al (2019) Graph convolutional networks: a comprehensive review. Comput Soc Netw 6(1):1–23CrossRef
54.
Zurück zum Zitat Zhou K, Yang J, Loy CC et al (2022a) Conditional prompt learning for vision-language models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825 Zhou K, Yang J, Loy CC et al (2022a) Conditional prompt learning for vision-language models. In: IEEE/CVF conference on computer vision and pattern recognition, pp 16816–16825
55.
Zurück zum Zitat Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348CrossRef Zhou K, Yang J, Loy CC et al (2022) Learning to prompt for vision-language models. Int J Comput Vis 130(9):2337–2348CrossRef
56.
Zurück zum Zitat Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: a literature survey. arXiv preprint arXiv:1706.06064 Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: a literature survey. arXiv preprint arXiv:​1706.​06064
Metadaten
Titel
Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts
verfasst von
Huaying Zhang
Rintaro Yanagi
Ren Togo
Takahiro Ogawa
Miki Haseyama
Publikationsdatum
01.03.2024
Verlag
Springer London
Erschienen in
International Journal of Multimedia Information Retrieval / Ausgabe 1/2024
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-024-00322-y

Weitere Artikel der Ausgabe 1/2024

International Journal of Multimedia Information Retrieval 1/2024 Zur Ausgabe

Premium Partner