1 Introduction
a photo of a
{cls
_name
}, or a black and white photo of a
{cls
_name
} were passed through the text encoder of the V &L model to create class-specific weights for category cls_name
that can be used for zero-shot recognition. Following research in NLP (Lester et al., 2021; Li & Liang, 2021), subsequent work (Zhou et al., 2022, 2022a) has proposed replacing the manually picked templates with a sequence of learnable vectors, also coined soft prompts, which are fed as input to the text encoder along with the class name cls_name
. The soft prompts are learned from a few training examples, with the parameters of the entire V &L model kept frozen. The whole process can be seen as parameter efficient fine-tuning of the V &L model on a small training dataset.-
We propose, for the first time, language-only optimization for vision-language adaption. Specifically, we propose a novel text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to the hand-engineered ones and show its effectiveness in terms of alleviating base-class overfitting.
-
To increase the representation capacity of the prompts, and inspired by grouped convolution and multi-head attention, we propose a grouped language-aware prompt representation where each group of prompts specializes to a different subset of the pre-defined manual templates.
-
We identify a visual-language misalignment introduced by prompt learning and LASP which impacts the generalization. More importantly, we propose a re-calibration mechanism based on (a) Layer Normalization fine-tuning and (b) learning a class-agnostic bias to address it.
-
Thanks to our language-only learning framework, we propose training LASP with virtual classes by including, during training, class names for which no visual samples are available. Importantly, we show that this further increases the robustness of the learned prompts.
-
Finally, by capitalizing on our language-only optimization framework, we present a zero-shot variant of LASP where no visual samples at all are available for the downstream adaptation task and show its superiority upon CLIP with prompt engineering. Effectively, this accomplishes vision-language adaptation without vision data.
2 Related Work
3 Method
3.1 Background
a photo of a
\(\{\texttt {class\_name}_c\}\), is passed through the V &L’s text encoder \(g_T(.)\) to compute the class specific text feature (weight) \(\textbf{t}^h_c = g_T(h_c)\). Moreover, an image \({\textbf {x}}\) to be classified is passed through the V &L’s image encoder \(g_I(.)\) to compute image specific feature \({\textbf {f}} = g_I(\textbf{x})\). A probability distribution over the class labels is given by:3.2 Language-Aware Soft Prompting (LASP)
3.3 Grouped LASP
3.4 Re-aligning LASP
3.5 LASP with Virtual Classes (LASP-V)
4 Zero-Shot LASP (LASP-Z)
large, small, rotates, pixelated, colorful
etc.). This explores, in essence, class-generic appearance variations directly in the text domain, analogous to the image ones. Moreover, \(f_{post}(\textbf{t}) = \textbf{t} + \textbf{x}, \textbf{x} \sim {\mathcal {N}}(\mu ,\sigma ^2)\) adds to the text feature descriptor t a noise vector sampled from a normal distribution. Depending on its magnitude, this allows the model to explore the immediate vicinity of the prompt in the CLIP embedding space, increasing the chance of matching points located in the proximity of true visual samples, mitigating to some extent the domain gap.5 Experiments
5.1 Comparison with State-of-the-Art
-
Conclusion 1: In terms of harmonic mean, LASP outperforms all methods by large margin. It outperforms, on average, the second best (ProDA) by \(>2\%\). The improvement on specific datasets is even bigger (e.g. \(>3\%\) on Flowers102, \(>11\%\) on EuroSAT, \(>3\%\) on UCF101).
-
Conclusion 2: On the novel classes, LASP outperforms all methods by large margin. It is the first reported method outperforming CLIP by 0.68% (but notice that CLIP performs very poorly on the bases classes). It also outperforms ProDA (third best) by \(>2.5\%\). Again, compared to ProDA, the improvement on specific datasets is even bigger (e.g. \(>5\%\) on Flowers102, \(>3\%\) on Food101, \(>11\%\) on EuroSAT, \(>6\%\) on UCF101).
-
Conclusion 3: On new classes, LASP with virtual classes has significant impact for specific datasets. These include datasets with informative class names like EuroSAT and DTD where the improvement over LASP is \(\sim 5.5\%\) and \(\sim 4.0\%\), respectively.
Base | New | H | |
---|---|---|---|
(a) Average over 11 datasets | |||
CoCoOp | 72.46 | 64.77 | 68.39 |
LASP | 76.59 | 67.55 | 71.78 |
LASP-V | 77.23 | 68.52 | 72.61 |
(b) ImageNet | |||
Base | New | H | |
CoCoOp | 71.90 | 67.50 | 69.63 |
LASP | 72.00 | 67.33 | 69.51 |
LASP-V | 71.90 | 68.00 | 69.78 |
(c) Caltech101 | |||
Base | New | H | |
CoCoOp | 95.20 | 90.67 | 92.87 |
LASP | 94.87 | 92.20 | 93.51 |
LASP-V | 95.54 | 92.78 | 94.13 |
(d) OxfordPets | |||
Base | New | H | |
CoCoOp | 91.01 | 93.10 | 92.04 |
LASP | 91.53 | 92.87 | 92.19 |
LASP-V | 92.23 | 93.17 | 92.69 |
(e) StanfordCars | |||
Base | New | H | |
CoCoOp | 67.26 | 69.43 | 68.33 |
LASP | 72.27 | 68.73 | 70.45 |
LASP-V | 71.00 | 68.50 | 69.27 |
(f) Flowers102 | |||
Base | New | H | |
CoCoOp | 86.73 | 64.63 | 74.06 |
LASP | 90.97 | 68.80 | 78.34 |
LASP-V | 92.20 | 69.93 | 79.53 |
(g) Food101 | |||
Base | New | H | |
CoCoOp | 85.73 | 85.50 | 85.61 |
LASP | 87.53 | 87.17 | 87.34 |
LASP-V | 87.73 | 87.17 | 87.45 |
(h) FGVCAircraft | |||
Base | New | H | |
CoCoOp | 24.50 | 25.93 | 25.19 |
LASP | 24.33 | 27.03 | 25.61 |
LASP-V | 28.77 | 27.80 | 28.27 |
(i) SUN397 | |||
Base | New | H | |
CoCoOp | 71.13 | 67.76 | 69.40 |
LASP | 72.60 | 67.21 | 69.80 |
LASP-V | 72.55 | 69.11 | 70.79 |
(j) DTD | |||
Base | New | H | |
CoCoOp | 59.33 | 42.70 | 49.65 |
LASP | 67.53 | 46.93 | 55.37 |
LASP-V | 65.67 | 49.90 | 56.71 |
(k) EuroSAT | |||
Base | New | H | |
CoCoOp | 69.20 | 39.23 | 50.14 |
LASP | 89.38 | 54.87 | 67.99 |
LASP-V | 90.80 | 56.80 | 69.88 |
(l) UCF101 | |||
Base | New | H | |
CoCoOp | 75.16 | 66.10 | 70.34 |
LASP | 79.57 | 70.00 | 74.47 |
LASP-V | 81.20 | 70.60 | 75.52 |
5.2 Zero-Shot Adaptation Setting
-
Conclusion 4: Zero-shot LASP (LASP-Z) significantly outperforms CLIP for the zero-shot adaptation setting. For this purpose, the proposed language-based augmentations are necessary.
Dataset | Set | Baseline | Text-to-Text | +Grouped | +Align | + Virtual |
---|---|---|---|---|---|---|
Zhou et al. (2022) | (LASP) | (LASP-V) | ||||
Average | Base | 82.69 | 81.26 | 81.87 | 82.70 | 83.18 |
New | 63.22 | 71.54 | 73.48 | 74.90 | 76.11 | |
H | 71.66 | 76.09 | 77.44 | 78.61 | 79.48 | |
ImageNet | Base | 76.47 | 75.97 | 76.20 | 76.20 | 76.25 |
New | 67.88 | 70.31 | 70.70 | 70.95 | 71.17 | |
H | 71.92 | 73.03 | 73.34 | 73.48 | 73.62 | |
Caltech101 | Base | 98.00 | 97.70 | 97.97 | 98.10 | 98.17 |
New | 89.91 | 94.08 | 94.27 | 94.24 | 94.33 | |
H | 93.73 | 95.85 | 96.08 | 96.16 | 96.21 | |
OxfordPets | Base | 93.67 | 95.13 | 95.63 | 95.90 | 95.73 |
New | 95.29 | 96.23 | 97.87 | 97.93 | 97.87 | |
H | 94.47 | 95.68 | 96.73 | 96.90 | 96.79 | |
Stanford Cars | Base | 78.12 | 72.46 | 73.50 | 75.17 | 75.23 |
New | 60.40 | 71.80 | 72.10 | 71.60 | 71.77 | |
H | 68.13 | 72.19 | 72.93 | 73.34 | 73.46 | |
Flowers102 | Base | 97.60 | 96.47 | 96.80 | 97.00 | 97.17 |
New | 59.67 | 70.70 | 74.00 | 74.00 | 73.53 | |
H | 74.06 | 81.59 | 83.87 | 83.95 | 83.71 | |
Food101 | Base | 88.33 | 90.30 | 91.00 | 91.20 | 91.20 |
New | 82.26 | 90.73 | 90.87 | 91.70 | 91.90 | |
H | 85.19 | 90.51 | 90.93 | 91.44 | 91.54 | |
FGVC Aircraft | Base | 40.44 | 32.63 | 33.05 | 34.53 | 38.05 |
New | 22.30 | 30.46 | 31.80 | 30.57 | 33.20 | |
H | 28.75 | 31.57 | 32.41 | 32.43 | 35.46 | |
SUN397 | Base | 80.60 | 80.20 | 80.55 | 80.70 | 80.70 |
New | 65.89 | 75.56 | 77.11 | 78.60 | 79.30 | |
H | 72.51 | 77.81 | 78.79 | 79.63 | 80.00 | |
DTD | Base | 79.44 | 79.13 | 80.50 | 81.40 | 81.10 |
New | 41.18 | 52.10 | 56.20 | 58.60 | 62.57 | |
H | 54.24 | 62.82 | 66.19 | 68.14 | 70.64 | |
EuroSAT | Base | 92.19 | 91.23 | 91.90 | 94.60 | 95.00 |
New | 54.74 | 63.16 | 66.37 | 77.78 | 83.37 | |
H | 68.90 | 74.64 | 77.07 | 85.36 | 88.86 | |
UCF101 | Base | 84.69 | 82.70 | 83.47 | 84.77 | 85.53 |
New | 56.05 | 71.80 | 77.07 | 78.03 | 78.20 | |
H | 67.46 | 76.86 | 80.14 | 81.26 | 81.70 |
Source | Target | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
ImageNet | Caltech101 | OxfordPets | StanfordCars | Flowers102 | Food101 | FGVCAircraft | SUN397 | DTD | EuroSAT | UCF101 | Average | |
CoOp | 71.51 | 93.70 | 89.14 | 64.51 | 68.71 | 85.30 | 18.47 | 64.15 | 41.92 | 46.39 | 66.55 | 63.88 |
CoCoOp | 71.02 | 94.43 | 90.14 | 65.32 | 71.88 | 86.06 | 22.94 | 67.36 | 45.73 | 45.37 | 68.21 | 65.74 |
LASP | 71.30 | 94.50 | 89.36 | 66.20 | 71.74 | 86.40 | 23.03 | 67.0 | 45.54 | 48.50 | 68.24 | 66.52 |
5.3 Ablation Studies
-
Conclusion 5: Our idea in its plain form (Text-to-Text loss) outperforms its direct baseline (CoOp) by large margin. Specifically, it improves upon CoOp by \(\sim 4.5\%\) on average, demonstrating its effectiveness.
-
Conclusion 6: All components are needed to obtain high accuracy.
a photo of {}
. Random templates are produced by sampling grammatically plausible random sentences that contain incoherent words, with length between 5 and 20 words. The class names are inserted at the end of these random templates. All variations use the same training scheduler and hyperparameters, except for the case of random templates, where \(\alpha _{TT}=5\).-
Conclusion 7: The exact choice of the templates might not be so significant for the few-shot setting.
-
Conclusion 8: For the case of novel classes, both the number and the content of the templates are important to obtain high accuracy.
-
Conclusion 9: The proposed CE loss based formulation outperforms other losses for LASP.
Learnable? | Source | Target | ||||
---|---|---|---|---|---|---|
ImageNet | ImageNetV2 | ImageNet-Sketch | ImageNet-A | ImageNet-R | ||
CLIP | 66.73 | 60.83 | 46.15 | 47.77 | 73.96 | |
CoOp | \(\checkmark \) | 71.51 | 64.20 | 47.99 | 49.71 | 75.21 |
CoCoOp | \(\checkmark \) | 71.02 | 64.07 | 48.75 | 50.63 | 76.18 |
LASP | \(\checkmark \) | 71.10 | 63.96 | 49.01 | 50.70 | 77.07 |
-
Conclusion 10: The models are somewhat robust to out-of-domain distractors. Specifically, the drop in accuracy is moderate (typically 1-2%). The exception is EuroSAT where the number of classes increases \(25\times \). Importantly, LASP-V manages to largely recover the lost accuracy.
-
Conclusion 11: In-domain distractors significantly increase the problem difficulty. Specifically, the drop in accuracy is large (4-7%). LASP-V manages to recover part of the lost accuracy.
Set | CE | \(L_1\) | \(L_2\) |
---|---|---|---|
Base | 81.26 | 81.50 | 81.47 |
New | 71.54 | 66.01 | 65.80 |
H | 76.09 | 73.54 | 72.80 |
(a) EuroSAT | ||||||
---|---|---|---|---|---|---|
Method | w/o distractors | with distractors | ||||
Base | New | H | Base | New | H | |
LASP | 86.25 | 64.63 | 73.89 | 86.00 | 55.80 | 67.68 |
LASP-V | 90.00 | 65.73 | 75.97 | 90.80 | 59.87 | 72.16 |
(b) Food101 | ||||||
Method | w/o distractors | with distractors | ||||
Base | New | H | Base | New | H | |
LASP | 87.17 | 87.53 | 87.34 | 87.01 | 86.90 | 86.95 |
LASP-V | 87.17 | 87.63 | 87.39 | 86.99 | 87.10 | 87.04 |
(c) Flowers102 | ||||||
Method | w/o distractors | with distractors | ||||
Base | New | H | Base | New | H | |
LASP | 90.97 | 67.8 | 77.69 | 90.0 | 67.10 | 76.68 |
LASP-V | 93.20 | 69.93 | 79.9 | 92.05 | 69.08 | 78.92 |
(d) OxfordPets | ||||||
Method | w/o distractors | with distractors | ||||
Base | New | H | Base | New | H | |
LASP | 92.53 | 94.20 | 91.52 | 91.53 | 92.60 | 92.06 |
LASP-V | 92.25 | 93.97 | 93.10 | 92.23 | 93.17 | 92.69 |
Method | w/o distractors | with distractors | ||||
---|---|---|---|---|---|---|
Base | New | H | Base | New | H | |
(a) Food101. | ||||||
LASP | 87.17 | 87.53 | 87.34 | 82.70 | 83.47 | 83.08 |
LASP-V | 87.17 | 87.63 | 87.39 | 83.11 | 83.95 | 83.52 |
(b) Flowers102. | ||||||
Method | w/o distractors | with distractors | ||||
Base | New | H | Base | New | H | |
LASP | 90.97 | 67.80 | 77.69 | 80.16 | 62.50 | 70.23 |
LASP-V | 93.20 | 69.93 | 79.90 | 83.95 | 65.31 | 73.47 |
#Templates | 1 | 34 | 100 |
---|---|---|---|
(a) DTD | |||
Text-to-Text (R) | 49.02 | 51.63 | 52.64 |
Text-to-Text | 50.73 | 52.10 | 56.53 |
(b) EuroSAT | |||
#Templates | 1 | 34 | 100 |
Text-to-Text (R) | 55.01 | 59.90 | 62.10 |
Text-to-Text | 56.97 | 63.16 | 65.13 |
(c) UCF101 | |||
#Templates | 1 | 34 | 100 |
Text-to-Text (R) | 67.50 | 68.60 | 70.03 |
Text-to-Text | 71.36 | 71.80 | 72.77 |
s | 0 | 0.05 | 0.15 | 0.3 |
---|---|---|---|---|
LASP-Z | 70.81 | 71.90 | 72.84 | 72.70 |
s | 0 | 10 | 15 | 20 |
---|---|---|---|---|
LASP-Z | 69.70 | 71.73 | 72.40 | 72.45 |
Dataset | Text | clean | rotated | ||||
---|---|---|---|---|---|---|---|
Augm | Base | New | H | Base | New | H | |
Oxford pets | ✗ | 95.73 | 97.87 | 96.79 | 71.20 | 72.41 | 71.79 |
\(\checkmark \) | 95.64 | 97.85 | 96.73 | 72.14 | 72.70 | 72.41 |
-
Conclusion 12: Text-based augmentations are a viable solution for increased robustness.