1 Introduction
2 Related Work
3 Image Representations
3.1 Traditional Image Representations
3.2 Deep Learnable Image Representations
ANet (Krizhevsky et al. 2012) | CNet (Krizhevsky et al. 2012) | Vgg16 (Simonyan and Zisserman 2015) | ResN50 (He et al. 2016) | |
---|---|---|---|---|
Top-5 error | 19.6 | 19.7 | 9.9 | 7.7 |
Top-1 error | 42.6 | 42.6 | 28.5 | 24.6 |
GFLOPs | 0.727 | 0.724 | 16 | 4 |
4 Properties of Representations
4.1 Equivariance
4.2 Covering and Equivalence
5 Analysis of Equivariance
-
HOG, our representative traditional feature extractor, has a high degree of equivariance with similarity transformations (translation, rotation, flip, scale) up to limitations due to sampling artifacts.
-
Deep feature extractors such as ANet, Vgg16, and ResN50 are also highly equivariant up to layers that still preserve sufficient spatial resolution, as those better represent geometry. This is also consistent with the fact that such features can be used to perform geometric-oriented tasks, such as object detection in R-CNN and related methods.
-
We also show that equivariance in deep feature extractors reduces to invariance for those transformations such as left-right flipping that are present in data or in data augmentation during training. This effect is more pronounced as depth increases.
-
Finally, we show that simple reconstruction metrics such as the Euclidean distance between features are not necessarily predictive of classification performance; instead, using a task-oriented regression method learns better equivariant maps in most cases.
5.1 Methods
5.1.1 Learning Equivariance
5.1.2 Regularizer
5.1.3 Loss and Optimization
5.1.4 Transformation Layer
5.2 Results on Traditional Representations
5.2.1 Sparse Regression
5.2.2 Structured Sparse Regression
k
|
m
| HOG size | |||
---|---|---|---|---|---|
\(3 \times 3\)
|
\(5 \times 5\)
|
\(7 \times 7\)
|
\(9 \times 9\)
| ||
5 |
\(\infty \)
| 1.67 | 12.21 | 82.49 | 281.18 |
5 | 1 | 0.97 | 2.06 | 3.47 | 5.91 |
5 | 3 | 1.23 | 3.90 | 7.81 | 13.04 |
5 | 5 | 1.83 | 7.46 | 17.96 | 30.93 |
5.2.3 Regression Quality
5.3 Results on Deep Representations
5.3.1 Regression Methods
5.3.2 Comparing Transformation Types
5.3.3 Qualitative Evaluation
5.3.4 Geometric Invariances
Layer | Horiz. Flip | Vert. Flip | Sc. \(2^{-\frac{1}{2}}\) | Rot. \(90^{\circ } \) | ||||
---|---|---|---|---|---|---|---|---|
Num | % | Num | % | Num | % | Num | % | |
C1
| 52 | 54.17 | 53 | 55.21 | 95 | 98.96 | 42 | 43.75 |
C2
| 131 | 51.17 | 45 | 17.58 | 69 | 26.95 | 27 | 10.55 |
C3
| 238 | 61.98 | 132 | 34.38 | 295 | 76.82 | 120 | 31.25 |
C4
| 343 | 89.32 | 124 | 32.29 | 378 | 98.44 | 101 | 26.30 |
C5
| 255 | 99.61 | 47 | 18.36 | 252 | 98.44 | 56 | 21.88 |
5.4 Application to Structured-Output Regression
\(\phi (x)\)
| Bsln | HOG | C3 | C4 | C5 | ||||
---|---|---|---|---|---|---|---|---|---|
g
|
\(M_{g}\)
|
g
|
\(M_{g}\)
|
g
|
\(M_{g}\)
|
g
|
\(M_{g}\)
| ||
Rot (\(^{\circ } \)) | 23.8 | 14.9 | 17.0 | 13.3 | 11.6 | 10.5 | 11.1 | 10.1 | 13.4 |
Rot \(\circlearrowleft \) (\(^{\circ } \)) | 86.9 | 18.9 | 19.1 | 13.2 | 15.0 | 12.8 | 15.3 | 12.9 | 17.4 |
Aff (–) | 0.35 | 0.25 | 0.25 | 0.25 | 0.28 | 0.24 | 0.26 | 0.24 | 0.26 |
Time/TF (ms) | – | 18.2 | 0.8 | 59.4 | 6.9 | 65.0 | 7.0 | 70.1 | 5.7 |
Speedup (–) | – | 1 | 21.9 | 1 | 8.6 | 1 | 9.3 | 1 | 12.3 |
6 Analysis of Coverage and Equivalence
-
Different networks trained to perform the same task tend to learn representations that are approximately equivalent.
-
Deeper and larger representations tend to cover well for shallower and smaller ones, but the converse is not always true. For example, the deeper layers of ANet cover for the shallower layers of the same network, Vgg16 layers cover well for ANet layers, and ResN50 layers cover well for Vgg16 layers. However, Vgg16 layers cannot cover for ResN50 layers.
-
Coverage and equivalence tend to be better for layers whose output spatial resolution matches. In fact, a layer’s resolution is a better indicator of compatibility than its depth.
-
When the same network is trained on two different tasks, shallower layers tend to be equivalent, whereas deeper ones tend to be less so, as they become more task-specific.
6.1 Methods
6.2 Results
×
|
×
|
6.2.1 Same Architecture, Different Layers
6.2.2 Same Architecture, Different Tasks
6.2.3 Different Architectures, Same Task
×
|
×
|
×
|