Skip to main content
Erschienen in: International Journal of Computer Vision 9/2022

31.07.2022

Learning to Prompt for Vision-Language Models

verfasst von: Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu

Erschienen in: International Journal of Computer Vision | Ausgabe 9/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
CoOp is pronounced as /ku:p/.
 
3
We find that the negative results on Food101, for learning-based models including CoOp and linear probe, are caused by the noisy training data with “intense colors and sometimes wrong labels” (Bossard et al., 2014).
 
Literatur
Zurück zum Zitat Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In ECCV Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In ECCV
Zurück zum Zitat Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., & et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., & et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:​2108.​07258
Zurück zum Zitat Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In ECCV Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In ECCV
Zurück zum Zitat Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:​2005.​14165
Zurück zum Zitat Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In ICML Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In ICML
Zurück zum Zitat Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In CVPR Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In CVPR
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR Deng, J., Dong, W., Socher, R., Li, LJ., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR
Zurück zum Zitat Desai, K., & Johnson, J. (2021). Virtex: Learning visual representations from textual annotations. In CVPR Desai, K., & Johnson, J. (2021). Virtex: Learning visual representations from textual annotations. In CVPR
Zurück zum Zitat Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
Zurück zum Zitat Elhoseiny, M., Saleh, B., & Elgammal, A. (2013). Write a classifier: Zero-shot learning using purely textual descriptions. In ICCV Elhoseiny, M., Saleh, B., & Elgammal, A. (2013). Write a classifier: Zero-shot learning using purely textual descriptions. In ICCV
Zurück zum Zitat Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR-W Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR-W
Zurück zum Zitat Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In NeurIPS Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In NeurIPS
Zurück zum Zitat Fürst, A., Rumetshofer, E., Tran, V., Ramsauer, H., Tang, F., Lehner, J., Kreil, D., Kopp, M., Klambauer, G., Bitto-Nemling, A., & et al. (2021). Cloob: Modern hopfield networks with infoloob outperform clip. arXiv preprint arXiv:2110.11316 Fürst, A., Rumetshofer, E., Tran, V., Ramsauer, H., Tang, F., Lehner, J., Kreil, D., Kopp, M., Klambauer, G., Bitto-Nemling, A., & et al. (2021). Cloob: Modern hopfield networks with infoloob outperform clip. arXiv preprint arXiv:​2110.​11316
Zurück zum Zitat Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2021). Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:​2110.​04544
Zurück zum Zitat Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723 Gao, T., Fisch, A., & Chen, D. (2020). Making pre-trained language models better few-shot learners. arXiv preprint arXiv:​2012.​15723
Zurück zum Zitat Gomez, L., Patel, Y., Rusiñol, M., Karatzas, D., & Jawahar, C. (2017). Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR Gomez, L., Patel, Y., Rusiñol, M., Karatzas, D., & Jawahar, C. (2017). Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR
Zurück zum Zitat He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR
Zurück zum Zitat Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal Selected Topics in Applied Earth Observations and Remote Sensing Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal Selected Topics in Applied Earth Observations and Remote Sensing
Zurück zum Zitat Hénaff, O. J., Srinivas, A., Fauw, J. D., Razavi, A., Doersch, C., Eslami, S. M. A., & van den Oord, A. (2020). Data-efficient image recognition with contrastive predictive coding. In ICML Hénaff, O. J., Srinivas, A., Fauw, J. D., Razavi, A., Doersch, C., Eslami, S. M. A., & van den Oord, A. (2020). Data-efficient image recognition with contrastive predictive coding. In ICML
Zurück zum Zitat Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang. F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., & Gilmer, J. (2021a). The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang. F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., & Gilmer, J. (2021a). The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV
Zurück zum Zitat Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021b). Natural adversarial examples. In CVPR Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., & Song, D. (2021b). Natural adversarial examples. In CVPR
Zurück zum Zitat Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In ICML Jia, C., Yang, Y., Xia, Y., Chen, Y. T., Parekh, Z., Pham, H., Le, Q. V., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In ICML
Zurück zum Zitat Jia, M., Tang, L., Chen, B. C., Cardie, C., Belongie, S., Hariharan, B., & Lim. S. N. (2022). Visual prompt tuning. arXiv preprint arXiv:2203.12119 Jia, M., Tang, L., Chen, B. C., Cardie, C., Belongie, S., Hariharan, B., & Lim. S. N. (2022). Visual prompt tuning. arXiv preprint arXiv:​2203.​12119
Zurück zum Zitat Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? ACL Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? ACL
Zurück zum Zitat Joulin, A., Van Der Maaten, L., Jabri, A., & Vasilache, N. (2016). Learning visual features from large weakly supervised data. In ECCV Joulin, A., Van Der Maaten, L., Jabri, A., & Vasilache, N. (2016). Learning visual features from large weakly supervised data. In ECCV
Zurück zum Zitat Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In ICCV-W Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In ICCV-W
Zurück zum Zitat Lei Ba, J., Swersky, K., Fidler, S., & et al. (2015). Predicting deep zero-shot convolutional neural networks using textual descriptions. In ICCV Lei Ba, J., Swersky, K., Fidler, S., & et al. (2015). Predicting deep zero-shot convolutional neural networks using textual descriptions. In ICCV
Zurück zum Zitat Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:​2104.​08691
Zurück zum Zitat Li, A., Jabri, A., Joulin, A., & van der Maaten, L. (2017). Learning visual n-grams from web data. In ICCV Li, A., Jabri, A., Joulin, A., & van der Maaten, L. (2017). Learning visual n-grams from web data. In ICCV
Zurück zum Zitat Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., & Yan, J. (2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208 Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., & Yan, J. (2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:​2110.​05208
Zurück zum Zitat Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021a). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021a). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:​2107.​13586
Zurück zum Zitat Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., & Tang, J. (2021b). Gpt understands, too. arXiv preprint arXiv:2103.10385 Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., & Tang, J. (2021b). Gpt understands, too. arXiv preprint arXiv:​2103.​10385
Zurück zum Zitat Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:​1306.​5151
Zurück zum Zitat Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In ICVGIP Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In ICVGIP
Zurück zum Zitat Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. (2012). Cats and dogs. In CVPR Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. (2012). Cats and dogs. In CVPR
Zurück zum Zitat Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? In EMNLP Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? In EMNLP
Zurück zum Zitat Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & et al. (2021). Learning transferable visual models from natural language supervision. In ICML Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & et al. (2021). Learning transferable visual models from natural language supervision. In ICML
Zurück zum Zitat Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In ICML Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do imagenet classifiers generalize to imagenet? In ICML
Zurück zum Zitat Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In ACL Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In ACL
Zurück zum Zitat Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020). Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020). Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP
Zurück zum Zitat Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. (2021). Flava: A foundational language and vision alignment model. arXiv preprint arXiv:2112.04482 Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. (2021). Flava: A foundational language and vision alignment model. arXiv preprint arXiv:​2112.​04482
Zurück zum Zitat Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Manning, C. D., & Ng, A.Y. (2013). Zero-shot learning through cross-modal transfer. In NeurIPS Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Manning, C. D., & Ng, A.Y. (2013). Zero-shot learning through cross-modal transfer. In NeurIPS
Zurück zum Zitat Soomro, K., Zamir, A. R, & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 Soomro, K., Zamir, A. R, & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:​1212.​0402
Zurück zum Zitat Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., & Schmidt, L. (2020). Measuring robustness to natural distribution shifts in image classification. In NeurIPS Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., & Schmidt, L. (2020). Measuring robustness to natural distribution shifts in image classification. In NeurIPS
Zurück zum Zitat Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., & Isola, P. (2020). Rethinking few-shot image classification: a good embedding is all you need? In ECCV Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., & Isola, P. (2020). Rethinking few-shot image classification: a good embedding is all you need? In ECCV
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS
Zurück zum Zitat Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2020). Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2020). Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:​2006.​10726
Zurück zum Zitat Wang, H., Ge, S., Lipton, Z., & Xing, E. P. (2019). Learning robust global representations by penalizing local predictive power. In NeurIPS Wang, H., Ge, S., Lipton, Z., & Xing, E. P. (2019). Learning robust global representations by penalizing local predictive power. In NeurIPS
Zurück zum Zitat Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In CVPR
Zurück zum Zitat Yuan, L., Chen, D., Chen, Y. L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., & et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 Yuan, L., Chen, D., Chen, Y. L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., & et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:​2111.​11432
Zurück zum Zitat Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., & Langlotz, C. P. (2020). Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., & Langlotz, C. P. (2020). Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:​2010.​00747
Zurück zum Zitat Zhong, Z., Friedman, D., & Chen, D. (2021). Factual probing is [mask]: Learning vs. learning to recall. In NAACL Zhong, Z., Friedman, D., & Chen, D. (2021). Factual probing is [mask]: Learning vs. learning to recall. In NAACL
Zurück zum Zitat Zhou, K., Liu, Z., Qiao, Y., Xiang, T., & Loy, C. C. (2021). Domain generalization in vision: A survey. arXiv preprint arXiv:2103.02503 Zhou, K., Liu, Z., Qiao, Y., Xiang, T., & Loy, C. C. (2021). Domain generalization in vision: A survey. arXiv preprint arXiv:​2103.​02503
Zurück zum Zitat Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. arXiv preprint arXiv:2203.05557 Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. arXiv preprint arXiv:​2203.​05557
Metadaten
Titel
Learning to Prompt for Vision-Language Models
verfasst von
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
Publikationsdatum
31.07.2022
Verlag
Springer US
Erschienen in
International Journal of Computer Vision / Ausgabe 9/2022
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-022-01653-1

Weitere Artikel der Ausgabe 9/2022

International Journal of Computer Vision 9/2022 Zur Ausgabe

Premium Partner