1 Introduction
2 Describing Textures with Attributes
2.1 The Describable Texture Dataset
2.2 Dataset Design and Collection
2.2.1 Selecting the Describable Attributes
2.2.2 Bootstrapping the Key Images
2.2.3 Sequential Joint Annotation
2.3 Benchmark Tasks
2.4 Attributes Versus Materials
2.5 Related Work
2.5.1 Recognition of Perceptual Properties
2.5.2 Recognition of Texture Instances and Material Categories
Dataset | Size | Condition | Content | (I)nstances / | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Images | Classes | Splits | Wild | Clutter | Controlled | Attributes | Materials | Objects | (C)ategories | |
Brodatz | 999 | 111 | – | X | X | I | ||||
CUReT | 5612 | 61 | 10 | X | X | I | ||||
UIUC | 1000 | 25 | 10 | X | X | I | ||||
UMD | 1000 | 25 | 10 | X | X | I | ||||
KTH | 810 | 11 | 10 | X | X | I | ||||
Outex | – | – | – | X | X | X | I | |||
Drexel |
\(\sim \)40000 | 20 | – | X | X | I | ||||
ALOT | 25000 | 250 | 10 | X | X | I | ||||
FMD | 1000 | 10 | 14 | X | X | C | ||||
KTH-T2b | 4752 | 11 | X | X | C | |||||
DTD | 5640 | 47 | 10 | X | X | C | ||||
OS | 10422 | 22 | 1 | X | X | X (+A) | X | C |
3 Recognizing Textures in Clutter
3.1 Benchmark Tasks
4 Texture Representations
4.1 Local Image Descriptors
4.1.1 Hand-Crafted Local Descriptors
4.1.2 Learned Local Descriptors
4.2 Pooling Encoders
4.2.1 Orderless Pooling Encoders
4.2.2 Order-Sensitive Pooling Encoders
4.2.3 Post-processing
5 Plan of Experiments and Highlights
-
Orderless pooling of SIFT features (e.g. FV-SIFT) performs better than specialized texture descriptors in many texture recognition problems; performance is further improved by switching from SIFT to CNN local descriptors (FV-CNN; Sect. 6.1.3).
-
Orderless pooling of CNN descriptors using the Fisher Vector (FV-CNN) is often significantly superior than fully-connected pooling of the same descriptors (FC-CNN) in texture, scene, and object recognition (Sect. 6.1.4). This difference is more marked for deeper CNN architectures (Sect. 6.1.5) and can be partially explained by the ability of FV pooling to overfit less and to easily integrate information at multiple image scales (Sect. 6.1.6).
-
FV-CNN descriptors can be compressed to the same dimensionality of FC-CNN descriptors while preserving accuracy (Sect. 6.1.7).
-
In texture recognition in the wild, for both materials (FMD) and attributes (DTD), CNN-based descriptors substantially outperform existing methods. Depending on the dataset, FV pooling is a little or substantially better than FC pooling of CNN descriptors (Sect. 6.2.1.4). When textures are extracted from a larger cluttered scene (instead of filling the whole image), the difference between FV and FC pooling increases (Sect. 6.2.1.5).
-
In coarse object recognition (PASCAL VOC), fine-grained object recognition (CUB-200), scene recognition (MIT Indoor), and recognition of things & stuff (MSRC) fine-grained, the FV-CNN representation achieves results that are close and sometimes superior to the state of the art, while using a simple and fully generic pipeline (Sect. 6.2.3).
-
FV-CNN appears to be particularly effective in domain transfer. Sect. 6.2.3 shows in fact that FV pooling compensates for the domain gap caused by training a CNN on two very different domains, namely scene and object recognition.
6 Experiments on Semantic Recognition
6.1 Local Image Descriptors and Encoders Evaluation
6.1.1 General Experimental Setup
6.1.2 Datasets and Evaluation Measures
6.1.3 Local Image Descriptors and Kernels Comparison
Local descr. | Kernel | |||
---|---|---|---|---|
Linear | Hellinger | add-\(\chi ^2\)
| exp-\(\chi ^2\)
| |
MR8 | 20.8 \(\pm \) 0.9 | 26.2 \(\pm \) 0.8 | 29.7 \(\pm \) 0.9 | 34.3 \(\pm \) 1.1 |
LM | 26.7 \(\pm \) 0.9 | 34.8 \(\pm \) 1.2 | 39.5 \(\pm \) 1.4 | 44.0 \(\pm \) 1.4 |
Patch\(_{3\,\times \,3}\)
| 15.9 \(\pm \) 0.5 | 24.4 \(\pm \) 0.7 | 27.8 \(\pm \) 0.8 | 30.9 \(\pm \) 0.7 |
Patch\(_{7\,\times \,7}\)
| 20.7 \(\pm \) 0.8 | 30.6 \(\pm \) 1.0 | 34.8 \(\pm \) 1.0 | 37.9 \(\pm \) 0.9 |
LBP\(^{u}\)
| 8.5 \(\pm \) 0.4 | 9.3 \(\pm \) 0.5 | 12.5 \(\pm \) 0.4 | 19.4 \(\pm \) 0.7 |
LBP-VQ | 26.2 \(\pm \) 0.8 | 28.8 \(\pm \) 0.9 | 32.7 \(\pm \) 1.0 | 36.1 \(\pm \) 1.3 |
SIFT |
45.2
\(\pm \)
1.0
|
49.1
\(\pm \)
1.1
|
50.9
\(\pm \)
1.0
|
52.3
\(\pm \)
1.2
|
Conv VGG-M |
55.9
\(\pm \)
1.3
|
61.7
\(\pm \)
0.9
|
61.9
\(\pm \)
1.0
|
61.2
\(\pm \)
1.0
|
Conv VGG-VD |
64.1
\(\pm \)
1.3
|
68.8
\(\pm \)
1.3
|
69.0
\(\pm \)
0.9
|
68.8
\(\pm \)
0.9
|
6.1.4 Pooling Encoders
Dataset | Meas. | SIFT | VGG-M | VGG-M | ||||||
---|---|---|---|---|---|---|---|---|---|---|
(%) | BoVW | LLC | VLAD | IFV | BoVW | LLC | VLAD | IFV | FC | |
DTD | acc | 49.0 \(\pm \) 0.8 | 48.2 \(\pm \) 1.4 | 54.3 \(\pm \) 0.8 | 58.6 \(\pm \) 1.2 | 61.2 \(\pm \) 1.3 | 64.0 \(\pm \) 1.3 |
67.6
\(\pm \)
0.7
|
66.8
\(\pm \)
1.5
| 58.7 \(\pm \) 0.9 |
OS+R | acc | 30.0 | 30.8 | 32.5 | 39.8 | 41.3 | 45.3 | 49.7 |
52.5
| 41.3 |
KTH-T2b | acc | 57.6 \(\pm \) 1.5 | 56.8 \(\pm \) 2.0 | 64.3 \(\pm \) 1.3 | 70.2 \(\pm \) 1.6 |
73.6
\(\pm \) 2.8 |
74.0
\(\pm \) 3.3 | 72.2 \(\pm \) 4.7 |
73.3
\(\pm \)
4.8
| 71.0 \(\pm \) 2.3 |
FMD | acc | 50.5 \(\pm \) 1.7 | 48.4 \(\pm \) 2.2 | 54.0 \(\pm \) 1.3 | 59.7 \(\pm \) 1.6 | 67.9 \(\pm \) 2.2 | 71.7 \(\pm \) 2.1 |
74.2
\(\pm \)
2.0
|
73.5
\(\pm \)
2.0
| 70.3 \(\pm \) 1.8 |
VOC07 | mAP11 | 51.2 | 47.8 | 56.9 | 59.9 | 72.9 | 75.5 |
76.8
|
76.4
|
76.8
|
MIT Indoor | acc | 47.7 | 39.2 | 51.0 | 54.9 | 69.1 | 68.9 | 71.2 |
74.2
| 62.5 |
dataset | Meas. | SIFT | AlexNet | VGG-M | VGG-VD | FV-SIFT | SoA | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(%) | FV | FC | FV | FC\(+\)FV | FC | FV | FC\(+\)FV | FC | FV | FC\(+\)FV | FC\(+\)FV\(-\)VD | ||
(a) | |||||||||||||
CUReT | acc | 99.0 \(\pm \) 0.2 | 94.4 \(\pm \) 0.4 | 98.5 \(\pm \) 0.2 | 99.0 \(\pm \) 0.2 | 94.2 \(\pm \) 0.3 | 98.7 \(\pm \) 0.2 | 99.1 \(\pm \) 0.2 | 94.5 \(\pm \) 0.4 | 99.0 \(\pm \) 0.2 | 99.2 \(\pm \) 0.2 |
99.7
\(\pm \)
0.1
| 99.8 \(\pm \) 0.1 Sifre and Mallat (2013) |
UMD | acc | 99.1 \(\pm \) 0.5 | 95.9 \(\pm \) 0.9 |
99.7
\(\pm \)
0.2
|
99.7
\(\pm \)
0.3
| 97.2 \(\pm \) 0.9 |
99.9
\(\pm \)
0.1
|
99.8
\(\pm \)
0.2
| 97.7 \(\pm \) 0.7 |
99.9
\(\pm \)
0.1
|
99.9
\(\pm \)
0.1
|
99.9
\(\pm \)
0.1
| 99.7 \(\pm \) 0.3 Sifre and Mallat (2013) |
UIUC | acc | 96.6 \(\pm \) 0.8 | 91.1 \(\pm \) 1.7 | 99.2 \(\pm \) 0.4 | 99.3 \(\pm \) 0.4 | 94.5 \(\pm \) 1.4 |
99.6
\(\pm \)
0.4
| 99.6 \(\pm \) 0.3 | 97.0 \(\pm \) 0.7 |
99.9
\(\pm \)
0.1
|
99.9
\(\pm \)
0.1
|
99.9
\(\pm \)
0.1
| 99.4 \(\pm \) 0.4 Sifre and Mallat (2013) |
KT | acc |
99.5
\(\pm \)
0.5
| 95.5 \(\pm \) 1.3 |
99.6
\(\pm \)
0.4
|
99.8
\(\pm \)
0.2
| 96.1 \(\pm \) 0.9 |
99.8
\(\pm \)
0.2
|
99.9
\(\pm \)
0.1
| 97.9 \(\pm \) 0.9 |
99.8
\(\pm \)
0.2
|
99.9
\(\pm \)
0.1
|
100
| 99.4 \(\pm \) 0.4 Sifre and Mallat (2013) |
ALOT | acc | 94.6 \(\pm \) 0.3 | 86.0 \(\pm \) 0.4 |
96.7
\(\pm \)
0.3
|
97.8
\(\pm \)
0.2
| 88.7 \(\pm \) 0.5 |
97.8
\(\pm \)
0.2
|
98.4
\(\pm \)
0.1
| 90.6 \(\pm \) 0.4 |
98.5
\(\pm \)
0.1
|
99.0
\(\pm \)
0.1
|
99.3
\(\pm \)
0.1
| 95.9 \(\pm \) 0.5 Sulc and Matas (2014) |
(b) | |||||||||||||
KTH-T2b | acc | 70.8 \(\pm \) 2.7 | 71.5 \(\pm \) 1.3 | 69.7 \(\pm \) 3.2 | 72.1 \(\pm \) 2.8 | 71 \(\pm \) 2.3 | 73.3 \(\pm \) 4.7 | 73.9 \(\pm \) 4.9 | 75.4 \(\pm \) 1.5 |
81.8
\(\pm \)
2.5
|
81.1
\(\pm \)
2.4
|
81.5
\(\pm \)
2.0
| 76.0 \(\pm \) 2.9 Sulc and Matas (2014) |
FMD | acc | 59.8 \(\pm \) 1.6 | 64.8 \(\pm \) 1.8 | 67.7 \(\pm \) 1.5 | 71.4 \(\pm \) 1.7 | 70.3 \(\pm \) 1.8 | 73.5 \(\pm \) 2.0 | 76.6 \(\pm \) 1.9 | 77.4 \(\pm \) 1.8 | 79.8 \(\pm \) 1.8 |
82.4
\(\pm \)
1.5
|
82.2
\(\pm \)
1.4
| 57.7 \(\pm \) 1.7 Sharan et al. (2013) |
OS\(+\)R | acc | 39.8 | 36.8 | 46.1 | 49.8 | 41.3 | 52.5 | 54.9 | 43.4 | 59.5 | 60.9 | 58.7 | – |
(c) | |||||||||||||
DTD | acc | 58.6 \(\pm \) 1.2 | 55.1 \(\pm \) 0.6 | 62.9 \(\pm \) 1.4 | 66.5 \(\pm \) 1.1 | 58.8 \(\pm \) 0.8 |
66.8
\(\pm \)
1.6
|
69.8
\(\pm \)
1.1
| 62.9 \(\pm \) 0.8 |
72.3
\(\pm \)
1.0
|
74.7
\(\pm \)
1.0
|
75.5
\(\pm \)
0.8
| – |
DTD | mAP | 61.3 \(\pm \) 1.1 | 57.7 \(\pm \) 0.9 | 66.5 \(\pm \) 1.4 | 70.5 \(\pm \) 1.2 | 62.1 \(\pm \) 0.9 |
70.8
\(\pm \)
1.2
|
74.2
\(\pm \)
1.1
| 67.0 \(\pm \) 1.1 |
76.7
\(\pm \)
0.8
|
79.1
\(\pm \)
0.8
|
80.4
\(\pm \)
0.9
| – |
DTD-J | mAP | 59.6 \(\pm \) 0.6 | 58.4 \(\pm \) 0.7 | 65.0 \(\pm \) 0.9 | 68.3 \(\pm \) 0.9 | 62.8 \(\pm \) 0.7 |
69.8
\(\pm \)
0.9
|
72.9
\(\pm \)
0.9
| 67.3 \(\pm \) 0.9 |
75.8
\(\pm \)
0.6
|
77.5
\(\pm \)
0.8
|
78.9
\(\pm \)
0.7
| – |
OSA\(+\)R | mAP | 56.5 | 53.9 | 62.1 | 64.6 | 54.3 | 65.2 | 67.9 | 49.7 | 67.2 | 67.9 | 68.2 | – |
(d) | |||||||||||||
MSRC\(+\)R | acc | 85.7 | 83.6 | 91.7 | 94.9 | 85.0 | 95.4 | 96.9 | 79.4 | 97.7 | 98.8 | 99.1 | – |
MSRC\(+\)R | msrc-acc | 92.0 | 84.1 | 95.0 | 97.3 | 84.0 | 97.6 | 98.1 | 82.0 | 99.2 | 99.6 | 99.5 | – |
VOC07 | mAP11 | 59.9 | 74.0 | 73.1 | 76.8 | 76.8 | 76.4 | 79.5 | 81.7 | 84.9 | 85.1 | 84.5 | 85.2 Wei et al. (2014) |
VOC07 | mAP | 60.2 | 76.0 | 75.0 | 79.0 | 79.2 | 78.7 | 82.3 | 84.6 | 88.6 | 88.5 | 87.9 | 85.2 Wei et al. (2014) |
MIT Ind. | acc | 54.9 | 58.6 | 69.7 | 71.6 | 62.5 | 74.2 | 74.4 | 67.6 |
81.0
| 80.3 | 80.0 | 70.8 Zhou et al. (2014) |
CUB | acc | 17.5 | 45.8 | 49 | 54.1 | 46.1 | 49.9 | 54.9 | 54.6 |
66.7
|
67.3
|
65.4
| 73.9\(^*\) Zhang et al. (2014) |
CUB\(+\)R | acc | 27.7 | 54.5 | 62.6 | 65.2 | 56.5 | 65.5 | 68.1 | 62.8 | 73.0 | 74.9 | 73.6 | 76.37 Zhang et al. (2014) |
6.1.5 CNN Descriptor Variants Comparison
conv3
) and then the rate tapers off somewhat. The performance of the earlier levels in VGG-VD is much worse than the corresponding layers in VGG-M. In fact, the performance of VGG-VD matches the performance of the deepest (fifth) layer in VGG-M in correspondence of conv5_1
, which has depth 13.
6.1.6 FV Pooling Versus FC Pooling
dataset | Meas. | VGG-M | VGG-VD | ||||||
---|---|---|---|---|---|---|---|---|---|
(%) | FC (SS) | FC (MS) | FV (SS) | FV (MS) | FC (SS) | FC (MS) | FV (SS) | FV (MS) | |
KTH-T2b | acc | 71 \(\pm \) 2.3 | 68.9 \(\pm \) 3.9 | 69.0 \(\pm \) 2.8 | 73.3 \(\pm \) 4.7 | 75.4 \(\pm \) 1.5 | 75.1 \(\pm \) 3.8 | 74.5 \(\pm \) 4.4 | 81.8 \(\pm \) 2.5 |
FMD | acc | 70.3 \(\pm \) 1.8 | 69.3 \(\pm \) 1.8 | 71.6 \(\pm \) 2.4 | 73.5 \(\pm \) 2.0 | 77.4 \(\pm \) 1.8 | 78.1 \(\pm \) 1.7 | 79.4 \(\pm \) 2.5 | 79.8 \(\pm \) 1.8 |
DTD | acc | 58.8 \(\pm \) 0.8 | 59.9 \(\pm \) 1.1 | 62.8 \(\pm \) 1.5 | 66.8 \(\pm \) 1.6 | 62.9 \(\pm \) 0.8 | 65.3 \(\pm \) 1.5 | 69.2 \(\pm \) 0.8 | 72.3 \(\pm \) 1.0 |
VOC07 | mAP11 | 76.8 | 78 | 74.8 | 76.4 | 81.7 | 83.2 | 84.7 | 84.9 |
MIT Ind. | acc | 62.5 | 66.1 | 68.1 | 74.2 | 67.6 | 75.3 | 76.8 | 81.0 |
6.1.7 Dimensionality Reduction of the CNN Descriptors
6.1.8 Visualization of Descriptors
6.2 Evaluating Texture Representations on Different Domains
6.2.1 Texture Recognition
6.2.2 Object and Scene Recognition
6.2.3 Domain Transfer
CNN | Accuracy (%) | ||
---|---|---|---|
FC-CNN | FV-CNN | FC+FV-CNN | |
PLACES | 65.0 | 67.6 | 73.1 |
CAFFE | 58.6 | 69.7 | 71.6 |
VGG-M | 62.5 | 74.2 | 74.4 |
VGG-VD | 67.6 | 81.0 | 80.3 |
7 Experiments on Semantic Segmentation
7.1 Experimental Setup
7.2 Dense-CRF Post-processing
7.3 Analysis
Dataset | Measure (%) | VGG-M | VGG-VD | ||||||
---|---|---|---|---|---|---|---|---|---|
FC-CNN | FV-CNN | FV+FC-CNN | FC-CNN | FV-CNN | FC+FV-CNN | CRF | SoA | ||
OS | pp-acc | 36.0 | 48.6 (46.9) | 49.8 | 38.5 | 55.5 (55.7) | 55.9 | 56.5 | – |
OSA | acc-osa (2) | 42.8 | 66.0 | 63.4 | 42.1 | 67.9 | 64.6 | 68.9 | – |
MSRC | acc-msrc (3) | 56.1 | 82.3 | 75.2 | 57.7 | 86.9 | 81.5 | 90.2 | 86.5 Ladicky et al. (2010) |
8 Applications of Describable Texture Attributes
8.1 Describable Attributes as Generic Texture Descriptors
DTD classifier | KTH-T2b | FMD | ||
---|---|---|---|---|
Method | Linear | RBF | Linear | RBF |
FV-SIFT | 64.74 \(\pm \) 2.36 | 67.75 \(\pm \) 2.89 | 49.24 \(\pm \) 1.73 | 52.53 \(\pm \) 1.26 |
FV-CNN | 67.39 \(\pm \) 3.75 | 67.66 \(\pm \) 3.30 | 62.81 \(\pm \) 1.33 | 64.69 \(\pm \) 1.41 |
FV-CNN-VD | 74.59 \(\pm \) 2.45 | 74.71 \(\pm \) 1.96 |
70.81
\(\pm \)
1.39
|
73.09
\(\pm \)
1.35
|
FV-SIFT \(+\) FC-CNN | 73.98 \(\pm \) 1.24 | 74.53 \(\pm \) 1.14 | 64.20 \(\pm \) 1.65 | 67.13 \(\pm \) 1.95 |
FV-SIFT \(+\) FC-CNN-VD |
74.52
\(\pm \)
2.31
|
77.14
\(\pm \)
1.36
| 69.21 \(\pm \) 1.77 | 72.17 \(\pm \) 1.66 |
Previous best | 76.0 \(\pm \) 2.9 Sulc and Matas (2014) |