In this section, we first make an ablation study of our approach SCE to find the best hyperparameters on videos. Then, we compare SCE to the state of the art after pretraining on Kinetics400 and assess generalization on various tasks.
4.2.1 Ablation study
Pretraining dataset To make the ablation study, we perform pretraining experiments on Mini-Kinetics200 [
106], later called Kinetics200 for simplicity. It is a subset of Kinetics400 [
107] meaning they have a close distribution with less resources required on Kinetics200 to train. Kinetics400 is composed of 216k videos for training and 18k for validation for 400 action classes. However, it has been created from Youtube and some videos have been deleted. We use the dataset hosted
1 by the CVD foundation.
Evaluation datasets To study the quality of our pretrained representation, we perform linear evaluation classification on the Kinetics200 dataset. Also, we finetune on the first split of the UCF101 [
108] and HMDB51 [
109] datasets. UCF101 is an action classification dataset that contains 13k3 different videos for 101 classes and has 3 different training and validation splits. HMDB51 is also an action classification dataset that contains 6k7 different videos from 51 classes with 3 different splits.
Pretraining implementation details We used the ResNet3D-18 network [
110] following the Slow path of Feichtenhofer et al. [
111]. We kept hyperparameters close to the ones used for ImageNet in Sect.
4.1.3. More details can be found in Appendix D.3. We pretrain for 200 epochs with a batch size of 512. The loss is symmetrized. To form two different views from a video, we follow Feichtenhofer et al. [
4] and randomly sample two clips from the video that lasts 2.56 seconds and keep only 8 frames.
Linear evaluation and finetuning evaluation protocols We follow Feichtenhofer et al. [
4] and details can be found in Appendix D.3. For finetuning on UCF101 and HMDB51, we only use the first split in ablation study.
Table 11
Object detection and Instance Segmentation on COCO [
94] training a Mask R-CNN [
95]
Random | 35.6 | 31.4 |
Supervised | 40.0 | 34.7 |
| 40.0 | 35.0 |
| 40.0 | 34.9 |
| 39.4 | 34.5 |
| 40.9 | 35.5 |
| 40.9 | 35.5 |
| 39.6 | 34.6 |
| 40.3 | 35.1 |
SCE (ours) | 41.6 | 36.0 |
| 41.7 | 36.2 |
Table 12
Comparison of our baseline and supervised training on the Kinetics200, UCF101 and HMDB51 Top-1 accuracy
SCE baseline | 63.9 | 86.3 | 57.0 |
Supervised | \(\mathbf {72.0}\) | \(\mathbf {87.5}\) | \(\mathbf {60.1}\) |
Baseline and supervised learning We define an SCE baseline which uses the hyperparameters
\(\lambda =0.5\),
\(\tau =0.1\),
\(\tau _m=0.07\). We provide performance of our SCE baseline as well as supervised training in Table
12. We observe that our baseline has lower results than supervised learning with
\(-8.1\)p.p for Kinetics200,
\(-1.2\)p.p for UCF101 and
\(-3.1\)p.p for HMDB51 which shows that our representation has a large margin for improvement.
Table 13
Effect of varying \(\lambda \) on the Kinetics200, UCF101 and HMDB51 Top-1 accuracy
0.000 | 64.2 | 86.2 | \(\underline{57.5}\) |
0.125 | \(\mathbf {64.8}\) | \(\mathbf {86.9}\) | \(\mathbf {58.2}\) |
0.250 | 64.3 | \(\underline{86.7}\) | \(\mathbf {58.2}\) |
0.375 | \(\underline{64.7}\) | 86.3 | 56.8 |
0.500 | 63.9 | 86.3 | 57.0 |
0.625 | 63.4 | 86.2 | 55.7 |
0.750 | 63.1 | 85.8 | 56.2 |
0.875 | 62.1 | 85.7 | 55.3 |
1.000 | 61.9 | 85.0 | 55.4 |
Leveraging contrastive and relational learning As for the image study, we varied
\(\lambda \) from Eq. (
8) in the set
\(\{0, 0.125,..., 0.875, 1\}\) to observe the effect of leveraging the relational and contrastive aspects and report results in Table
13. Using relations during pretraining improves the results rather than only optimizing a contrastive learning objective. The performance on Kinetics200, UCF101 and HMDB51 consistently increases by decreasing
\(\lambda \) from 1 to 0.25. The best
\(\lambda \) obtained is 0.125. Moreover,
\(\lambda =0\) performs better than
\(\lambda =1\). These results suggest that for video pretraining with standard image contrastive learning augmentations, relational learning performs better than contrastive learning and leveraging both further improve the quality of the representation.
Target temperature variation We studied the effect of varying the target temperature with values in the set
\(\tau _m \in \{0.03, 0.04,..., 0.08\}\) while maintaining the online temperature
\(\tau = 0.1\). We report results in Table
14. We observe that the best temperature is
\(\tau _m=0.05\), indicating that a sharper target distribution is required for video pretraining. We also observe that varying
\(\tau _m\) has a lower impact on performance than varying
\(\lambda \).
Table 14
Effect of varying \(\tau _m\) on the Top-1 accuracy on Kinetics200, UCF101 and HMDB51 while maintaining \(\tau =0.1\)
0.03 | 63.4 | 86.1 | 56.9 |
0.04 | 63.8 | \(\mathbf {86.6}\) | 56.6 |
0.05 | \(\mathbf {64.3}\) | \(\underline{86.4}\) | \(\mathbf {57.1}\) |
0.06 | \(\underline{64.1}\) | 86.2 | 56.4 |
0.07 | 63.9 | 86.3 | \(\underline{57.0}\) |
0.08 | 63.8 | 85.9 | 55.8 |
Table 15
Effect of strength for color jittering for strong- \(\alpha \) and strong- \(\beta \) augmentations on the Kinetics200, UCF101 and HMDB51 Top-1 accuracy
0.50 | 63.9 | 86.3 | 57.0 |
0.75 | \(\underline{64.6}\) | \(\underline{86.8}\) | \(\underline{57.8}\) |
1.00 | \(\mathbf {64.8}\) | \(\mathbf {87.0}\) | \(\mathbf {58.1}\) |
Spatial and temporal augmentations We tested varying and adding some data augmentations that generates the pairs of views. As we are dealing with videos, these augmentations can be either spatial or temporal. We define the
jitter augmentation that jitters by a factor the duration of a clip,
reverse that randomly reverses the order of frames and
diff that randomly applies RGB difference on the frames. RGB difference consists in converting the frames to grayscale and subtracting them over time to approximate the magnitude of optical flow. In this work, we consider RGB difference as a data augmentation that is randomly applied during pretraining. In the literature, it is often used as a modality to provide better representation quality than RGB frames [
5,
61,
70]. Here, we only apply it during pretraining as a random augmentation. Evaluation only sees RGB frames.
We tested to increase the color jittering strength in Table
15. Using a strength of 1.0 improved our performance on all the benchmarks, suggesting that video pretraining requires harder spatial augmentations than images.
Table 16
Effect of using the temporal augmentations by applying clip duration jittering jitter, randomly reversing the order of frames reverse or randomly using RGB difference diff on the Kinetics200, UCF101 and HMDB51 Top-1 accuracy
0.0 | 0.0 | 0.0 | 63.9 | 86.3 | 57.0 |
0.2 | 0.0 | 0.0 | 64.2 | 86.4 | 56.9 |
0.0 | 0.2 | 0.0 | 64.0 | 85.7 | 55.4 |
0.0 | 0.0 | 0.2 | \(\underline{65.4}\) | \(\mathbf {88.3}\) | \(\mathbf {61.4}\) |
0.0 | 0.0 | 0.5 | 64.1 | \(\underline{87.7}\) | \(\underline{60.8}\) |
Supervised | \(\mathbf {72.0}\) | 87.5 | 60.1 |
We tested our defined temporal augmentations with
jitter of factor 0.2, meaning sampling clips between
\(0.80\times 2.56\) and
\(1.20 \times 2.56\) seconds, randomly applying
reverse with 0.2 probability and randomly applying
diff with 0.2 or 0.5 probability. We report results in Table
16. Varying the clip duration had no noticeable impact on our benchmarks, but reversing the order of frames decreased the performance on UCF101 and HMDB51. This can be explained by the fact that this augmentation can prevent the model to correctly represent the arrow of time. Finally, applying
diff with 0.2 probability considerably improved our performance over our baseline with
\(\mathbf {+1.5}\)p.p. on Kinetics200,
\(\mathbf {+2.0}\)p.p. on UCF101 and
\(\mathbf {+4.4}\)p.p. on HMDB51. It outperforms supervised learning for generalization with
\(\mathbf {+0.8}\)p.p. on UCF101 and
\(\mathbf {+1.3}\)p.p. on HMDB51. Applying more often
diff decreases performance. These results show that SCE benefits from using views that are more biased towards motion than appearance. We believe that it is particularly efficient to model relations based on motion.
Bringing all together We studied varying one hyperparameter from our baseline and how it affects performance. In this final study, we combined our baseline with the different best hyperparameters found which are
\(\lambda =0.125\),
\(\tau _m=0.05\), color strength
\(=1.0\) and applying
diff with 0.2 probability. We report results in Table
17 and found out that using harder augmentations increased the optimal
\(\lambda \) value as using
\(\lambda =0.5\) performs better than
\(\lambda =0.125\). This indicates that relational learning by itself cannot learn a better representation through positive views that share less mutual information. The contrastive aspect of our approach is proven efficient for such harder positives. We take as best configuration
\(\lambda =0.5\),
\(\tau _m=0.05\),
diff applied with probability 0.2 and color strength
\(=1.0\) as it provides best or second best results for all our benchmarks. It improves our baseline by
\(\mathbf {+2.1}\)p.p. on Kinetics200 and UCF101, and
\(\mathbf {+5.0}\)p.p. on HMDB51. It outperforms our supervised baseline by
\(\mathbf {+0.9}\)p.p. on UCF101 and
\(\mathbf {+1.9}\)p.p. on HMDB51.
4.2.2 Comparison with the state of the art
Pretraining dataset To compare SCE with the state of the art, we perform pretraining on Kinetics400 [
107] introduced in Sect.
4.2.1.
Evaluation datasets UCF101 [
108] and HMDB51 [
109] have been introduced in Sect.
4.2.1.
AVA (v2.2) [
112] is a dataset used for spatiotemporal localization of humans actions composed of 211k training videos and 57k validation videos for 60 different classes. Bounding box annotations are used as targets, and we report the mean Average Precision (mAP) for evaluation.
Something-Something V2 (SSv2) [
113] is a dataset composed of human-object interactions for 174 different classes. It contains 169k training and 25k validation videos.
Pretraining implementation details We use the ResNet3D-18 and ResNet3D-50 network [
110] and more specifically the slow path of Feichtenhofer et al. [
111]. We kept the best hyperparameters from Sect.
4.2.1 which are
\(\lambda =0.5\),
\(\tau _m=0.05\), RGB difference with probability of 0.2, and color strength
\(=1.0\) on top of the
\(strong-\alpha \) and
\(strong-\beta \) augmentations. From the randomly sampled clips, we specify if we keep 8 or 16 frames.
Table 17
Effect of combining best hyper-parameters found in the ablation study which are \(\lambda =0.125\), \(\tau _m=0.05\), color strength\( = 1.0\) and adding randomly time difference on the Kinetics200, UCF101 and HMDB51 Top-1 accuracy
0.125 | 0.05 | 0.2 | 1.0 | 65.0 | 87.4 | \(\underline{61.1}\) |
0.125 | 0.07 | 0.2 | 1.0 | 64.7 | 88.2 | 60.6 |
0.500 | 0.05 | 0.2 | 1.0 | \(\underline{66.0}\) | \(\underline{88.4}\) | \(\mathbf {62.0}\) |
0.500 | 0.07 | 0.2 | 1.0 | 65.4 | \(\mathbf {88.6}\) | 61.0 |
SCE Baseline | 63.9 | 86.3 | 57.0 |
Supervised | \(\mathbf {72.0}\) | 87.5 | 60.1 |
Table 18
Performance of SCE for the linear evaluation protocol on Kinetics400 and finetuning on the three splits of UCF101 and HMDB51 (color figure online)
Table 19
Performance of SCE for video retrieval on the first split of UCF101 and HMDB51 (color figure online)
Action recognition We compare SCE on the linear evaluation protocol on Kinetics400 and finetuning on UCF101 and HMDB51. We kept the same implementation details as in Sect.
4.2.1. We compare our results with the state of the art in Table
18 on various architectures. To propose a fair comparison, we indicate for each approach the pretraining dataset, the number of frames and resolution used during pre-training as well as during evaluation. For the unknown parameters, we leave the cell empty. We compared with some approaches that used the other visual modalities Optical Flow and RGB difference and the different convolutional backbones S3D [
116] and R(2+1)D-18 [
117].
On ResNet3D-18 even when comparing with methods using several modalities, by using
\(8 \times 224^2\) frames we obtain state-of-the-art results on the three benchmarks with
\(\mathbf {59.8\%}\) accuracy on Kinetics400,
\(\mathbf {90.9\%}\) on UCF101,
\(\mathbf {65.7\%}\) on HMDB51. Using
\(16 \times 112^2\) frames, which is commonly used with this network, improved by
\(+0.9\)p.p on HMDB51 and decreased by
\(-3.2\)p.p on kinetics400 and
\(-1.8\) on UCF101 and keep state-of-the-art results on all benchmarks, except on UCF101 with
\(-0.5\)p.p compared with Duan et al. [
5] using RGB and RGB difference modalities.
On ResNet3D-50, we obtain state-of-the-art results using
\(16 \times 224^2\) frames on HMDB51 with
\(\mathbf {74.7\%}\) accuracy even when comparing with methods using several modalities. On UCF101, with
\(\mathbf {95.3}\)% SCE is on par with the state of the art,
\(-0.2\)p.p. than Feichtenhofer et al. [
4], but on Kinetics400
\(-1.9\)p.p for
\(\mathbf {69.6\%}\). We have the same computational budget as they use 4 views for pretraining. Using 8 frames decreased performance by
\(-2.0\)p.p.,
\(-1.2\)p.p. and
\(-4.2\)p.p on Kinetics400,UCF101 and HMDB51. It maintains results that outperform on the three benchmarks
\(\rho \)MoCo and
\(\rho \)BYOL with 2 views. It suggests that SCE is more efficient with fewer resources than these methods. By comparing our best with approaches on the S3D backbone that better fit smaller datasets, SCE has slightly lower performance than the state of the art:
\(-1.0\)p.p. on UCF101 and
\(-0.3\)p.p. on HMDB51.
Table 20
Performance of SCE in comparison with Feichtenhofer et al. [
4] for linear evaluation on Kinetics400 and finetuning on the first split of UCF101, AVA and SSv2 (color figure online)
Video retrieval We performed video retrieval on our pretrained backbones on the first split of UCF101 and HMDB51. To perform this task, we extract from the training and testing splits the features using the 30-crops procedure as for action recognition, detailed in Appendix D.3. We query for each video in the testing split the
N nearest neighbors (
\(N\in \{1,5,10\}\)) in the training split using cosine similarities. We report the recall R@
N for the different
N in Table
19.
We compare our results with the state of the art on ResNet3D-18. Our proposed SCE with
\(16 \times 112^2\) frames is Top-1 on UCF101 with
\(\mathbf {74.5\%}\),
\(\mathbf {85.6\%}\) and
\(\mathbf {90.5\%}\) for R@1, R@5 and R@10. Using
\(8 \times 224^2\) frames slightly decreases results that are still state of the art. On HMDB51, SCE with
\(8 \times 224^2\) frames outperforms the state of the art with
\(\mathbf {40.1\%}\),
\(\mathbf {63.3\%}\) and
\(\mathbf {75.4\%}\) for R@1, R@5 and R@10. Using
\(16 \times 112^2\) frames decreased results that are competitive with the previous state-of-the-art approach [
114] for
\(-2.3\)p.p.,
\(+1.5\)p.p. and
\(-1.4\)p.p. on R@1, R@5 and R@10.
We provide results using the larger architecture ResNet3d-50 which increases our performance on both benchmarks and outperforms the state of the art on all metrics to reach \(\mathbf {83.9\%}\), \(\mathbf {92.2\%}\) and \(\mathbf {94.9\%}\) for R@1, R@5 and R@10 on UCF101 as well as \(\mathbf {45.9\%}\), \(\mathbf {69.9\%}\) and \(\mathbf {80.5\%}\) for R@1, R@5 and R@10 on HMDB51. Our soft contrastive learning approach makes our representation learn features that cluster similar instances even for generalization.
Generalization to downstream tasks. We follow the protocol introduced by Feichtenhofer et al. [
4] to compare the generalization of our ResNet3d-50 backbone on Kinetics400, UCF101, AVA and SSv2 with
\(\rho \)SimCLR,
\(\rho \)SwAV,
\(\rho \)BYOL,
\(\rho \)MoCo and supervised learning in Table
20. To ensure a fair comparison, we provide the number of views used by each method and the number of frames per view for pretraining and evaluation.
For 2 views and 8 frames, SCE is on par with \(\rho \)MoCo with 3 views on Kinetics400, AVA and SSv2 but is worst than \(\rho \)BYOL especially on AVA. For UCF101, results are better than \(\rho \)MoCo and on par with \(\rho \)BYOL. These results indicate that our approach proves more effective than contrastive learning as it reaches similar results than \(\rho \)MoCo using one less view. Using 16 frames, SCE outperforms all approaches, including supervised training, on UCF101 and SSv2 but performs worse on AVA than \(\rho \)Byol and supervised training. This study shows that SCE can generalize to various video downstream tasks which is a criteria of a good learned representation.