1 Introduction
-
We find that the visualized position encodings accessed by the concatenation and addition methods are similar. However, the concatenation method reduces attention distance in the final few layers.
-
We observe that the concatenation method can learn horizontal, vertical, and angle information, whereas the addition method has limitations in this regard.
-
We discover that the concatenation method has greater robustness than the addition method, resulting in a gain of +0.1% to +0.5% for the existing models.
2 Related Work
2.1 Vision Transformers
2.2 Self-attention
2.3 Absolute Position Encoding
3 Methodology
3.1 Theoretically Analysis and Assumption
3.2 Efficient Implementation
4 Experiment
Models | Param.\(\downarrow \) | FLOPs\(\downarrow \) | Mem.\(\downarrow \) | Speed\(\uparrow \) | Top-1 (Add./Concat.)\(\uparrow \) |
---|---|---|---|---|---|
(M) | (G) | (G) | (im/s) | (%) | |
DeiT-S [23] | 22.0/21.6 | 4.23/4.22 | 6.77/6.96 | 986.37/916.29 | 79.8/80.0(+0.2) |
CaiT-XXS36 [24] | 17.2/17.3 | 3.24/3.22 | 14.12/14.13 | 522.84/543.09 | 79.1/79.3(+0.2) |
ConViT-S [25] | 27.7/27.5 | 5.36/5.32 | 12.26/12.25 | 600.84/620.23 | 81.3/81.6(+0.3) |
VoLo-D1 [38] | 26.6/26.7 | 6.52/6.53 | 14.01/14.03 | 511.35/526.45 | 84.2/84.2 |
TnT-S [39] | 23.8/23.6 | 4.85/4.83 | 13.47/13.45 | 466.92/467.39 | 81.5/81.4(\(-\)0.1) |
PvT-S [9] | 24.5/23.9 | 3.69/3.64 | 10.50/9.99 | 854.09/864.26 | 79.8/80.3(+0.5) |
LocalViT-PvT [28] | 13.5/12.6 | 1.96/1.91 | 17.89/17.52 | 768.09/773.06 | 78.2/78.5(+0.3) |
DeiT-B [23] | 86.6/85.8 | 16.86/16.81 | 14.30/14.29 | 310.78/311.10 | 81.8/82.2(+0.4) |
CaiT-XS36 [24] | 38.5/38.4 | 7.25/7.23 | 22.16/21.36 | 330.80/344.26 | 82.6/82.6 |
ConViT-B [25] | 86.4/86.1 | 16.81/16.75 | 17.51/17.50 | 256.20/262.41 | 82.2/82.2 |
VoLo-D2 [38] | 58.6/58.7 | 13.61/13.63 | 23.98/23.98 | 266.24/273.19 | 85.2/85.3(+0.1) |
TnT-B [39] | 65.4/65.1 | 13.44/13.40 | 21.68/21.65 | 264.99/265.12 | 82.9/82.8(\(-\)0.1) |
PvT-M [9] | 44.2/43.6 | 6.47/6.41 | 14.54/14.26 | 542.86/546.84 | 81.2/81.5(+0.3) |
4.1 Image Classification
4.2 Transfer Learning
Datasets | Domain | Input size | Train size | Test size | Classes |
---|---|---|---|---|---|
ImageNet-1k [19] | Mixed | Various | 1,281,167 | 50,000 | 1000 |
Cifar100 [34] | Mixed | \(32\times 32\) | 50,000 | 10,000 | 100 |
Cifar10 [33] | Mixed | \(32\times 32\) | 50,000 | 10,000 | 10 |
Flowers102 [35] | Flowers | Various | 2040 | 6149 | 102 |
Cars [37] | Cars | Various | 8144 | 8,041 | 196 |
Pets [36] | Dogs and cats | Various | 3.698 | 3695 | 37 |
Models | Methods | Param.(M) | ImageNet-1k | Cifar10 | Cifar100 | Flowers102 | Cars | Pets |
---|---|---|---|---|---|---|---|---|
DeiT-S [23] | Add | 22.0 | 79.8 | 98.6 | 87.5 | 99.2 | 92.0 | 95.5 |
Concat | 21.6 | 80.0 | 98.6 | 87.8 | 99.1 | 92.1 | 95.7 | |
CaiT-XXS36 [24] | Add | 17.2 | 79.1 | 98.8 | 88.8 | 99.1 | 92.3 | 95.8 |
Concat | 17.3 | 79.3 | 98.9 | 89.0 | 99.3 | 92.5 | 96.0 | |
ConViT-S [25] | Add | 27.7 | 81.3 | 98.5 | 89.1 | 99.2 | 92.2 | 95.8 |
Concat | 27.5 | 81.6 | 98.6 | 89.3 | 99.3 | 92.5 | 96.1 | |
PvT-S [9] | Add | 24.5 | 79.8 | 98.3 | 81.7 | 99.1 | 91.5 | 95.8 |
Concat | 23.9 | 80.3 | 98.4 | 82.0 | 98.9 | 91.8 | 96.0 | |
TnT-S [39] | Add | 23.8 | 81.5 | 98.6 | 88.6 | 99.2 | 92.8 | 95.9 |
Concat | 23.6 | 81.4 | 98.6 | 88.6 | 99.1 | 92.6 | 95.8 | |
VoLo-D1 [38] | Add | 26.6 | 84.2 | 98.9 | 89.8 | 99.4 | 93.4 | 96.7 |
Concat | 26.7 | 84.2 | 98.9 | 89.9 | 99.4 | 93.5 | 96.7 |