1 Introduction
2 Related works
3 The proposed approach
3.1 Hierarchical deep feature extraction module
-
SIFT-mask specifically, letthe SIFT feature locations extracted from an image with the size of \(W_I x H_I\), each location on the spatial grid WxH is the location of a local deep convolutional feature. Based on the property that convolutional layers preserve the spatial information of the input image [49], we select a subset of locations on the spatial grid which correspond to locations of SIFT key-points. In this way, we discard features from background objects, and we consider the foreground ones.$$\begin{aligned} S = {(x^i,y^i)_{i=1}^n} \end{aligned}$$where$$\begin{aligned} M_{\mathrm{SIFT}} = \bigg \{ \left( x_{\mathrm{SIFT}}^{(i)}, y_{\mathrm{SIFT}}^{(i)} \right) \bigg \} \end{aligned}$$$$\begin{aligned} x_{\mathrm{SIFT}}^{(i)} = \mathrm{round}\bigg (\frac{x^{i}W}{W_{I}} \bigg ) \quad and \quad y_{\mathrm{SIFT}}^{(i)} = \mathrm{round}\bigg (\frac{y^{i}H}{H_{I}} \bigg ) \end{aligned}$$
-
Max-mask we define it as:We select a subset of local convolutional features which contain high activations values for all visual contents. The goal is to select the local features that capture the most prominent object structures in the input images. Specifically, we assess each feature map and select the location corresponding to the max activation value on that feature map.$$\begin{aligned} M_{\mathrm{MAX}}= & {} \bigg \{ \bigg ( x^{(k)}_{\mathrm{MAX}}, y^{(k)}_{\mathrm{MAX}} \bigg ) \bigg \} \quad k=1,..,K\\ \bigg (x^{(k)}_{\mathrm{MAX}}, y^{(k)}_{\mathrm{MAX}} \bigg )= & {} \hbox {arg\,max}_{(x,y)} F_{(x,y)}^{k} \end{aligned}$$
-
SUM-mask based on the idea that a local convolutional feature is more informative if it has high values in more feature maps. Therefore, the sum of values of this kind of local feature is higher. In other words, if many channels are activated in the same image region, there is a high probability that an object of interest is in that region. We define SUM-Mask as follows:where$$\begin{aligned} M_{\mathrm{SUM}} = \left\{ \left( (x,y) \vert \sum _{(x,y)}^{M} F \ge \alpha \right) \right\} \end{aligned}$$(2)$$\begin{aligned} \sum _{(x,y)}F = \sum _{K=1}^K F_{(x,y)}^{k}\end{aligned}$$In the evaluation section, we report results considering both values of \(\alpha \).$$\begin{aligned} \alpha = \hbox {median}\left( \sum {F}\right) \quad \hbox {or} \quad \alpha = \hbox {average}\left( \sum {F}\right) \end{aligned}$$
3.2 Ontology alignment
Segmentation | List of vertices for segmentation |
---|---|
‘Area’ | Image total area |
‘Iscrowd’ | If only one object if represented it takes value 0 otherwise 1 |
‘Bbox’ | Bounding box coordinates |
‘Category_id’ | Category identifier |
‘Image_id’ | Image identifier |
‘Id’ | Annotation identifier |
-
Intrinsic terminological techniques: for each word, we perform a steaming operation.
-
Extrinsic terminological techniques: we use WordNet as a thesaurus to keep track of lexical variations of the same term.
-
Structural techniques: we use hyponym relation.
4 Evaluation strategy
Dataset | NTot img | NObj cat | NImg cat | NImg test | NImg train |
---|---|---|---|---|---|
Corel-10 | 10,000 | 10 | 100 | 1000 | 9000 |
Caltech-101 | 8677 | 101 | 40–800 | 404 | 8273 |
Dataset | NCoarse cat | NSpecies cat | NTot img | NImg test | NImg train |
---|---|---|---|---|---|
Stanford-dog | 1 | 120 | 20.580 | 481 | 20.099 |
Oxford-dog | 2 | 25/12 | 10.000 | 148 | 9852 |
4.1 Feature extraction evaluation
Pooling | General dataset | Fine-grained dataset | ||||
---|---|---|---|---|---|---|
Caltech-101 | Corel-10 | Avg | Stanford-dog | Oxford-pet | Avg | |
Max | 0.621 | 0.888 | 0.7545 | 0.211 | 0.469 | 0.34 |
Average | 0.528 | 0.874 | 0.701 | 0.249 | 0.67 | 0.4595 |
Avg and max | 0.52 | 0.833 | 0.6765 | 0.359 | 0.353 | 0.356 |
Sum | 0.211 | 0.33 | 0.2705 | 0.299 | 0.2 | 0.2495 |
Pooling | General dataset | Fine-grained dataset | ||||
---|---|---|---|---|---|---|
Caltech-101 | Corel-10 | Avg | Stanford-dog | Oxford-pet | Avg | |
Max | 0.888 | 0.966 | 0.927 | 0.643 | 0.744 | 0.6935 |
Average | 0.844 | 0.97 | 0.907 | 0.67 | 0.764 | 0.717 |
Avg and max | 0.856 | 0.96 | 0.908 | 0.647 | 0.751 | 0.699 |
Sum | 0.332 | 0.522 | 0.427 | 0.21 | 0.212 | 0.211 |
-
\( pool_5 \) layer features extraction;
-
max-pooling aggregation method.
Pooling | Dataset | Mask type | |||
---|---|---|---|---|---|
MAX | SUM(Mean) | SUM(Median) | SIFT | ||
Max | Stanford-dog | 0.656 | 0.682 | 0.68 | 0.21 |
Oxford-pet | 0.469 | 0.475 | 0.474 | 0.27 | |
Average | Stanford-dog | 0.678 | 0.714 | 0.712 | 0.134 |
Oxford-pet | 0.462 | 0.49 | 0.488 | 0.28 | |
Max and average | Stanford-dog | 0.658 | 0.688 | 0.685 | 0.103 |
Oxford-pet | 0.47 | 0.48 | 0.478 | 0.04 | |
Sum | Stanford-dog | 0.678 | 0.714 | 0.712 | 0.28 |
Oxford-pet | 0.462 | 0.494 | 0.488 | 0.11 |
Pooling | Dataset | Mask type | |||
---|---|---|---|---|---|
MAX | SUM (Mean) | SUM (Median) | SIFT | ||
Max | Stanford-dog | 0.85 | 0.9 | 0.925 | 0.17 |
Oxford-pet | 0.844 | 0.81 | 0.837 | 0.2 | |
Average | Stanford-dog | 0.864 | 0.851 | 0.837 | 0.12 |
Oxford-pet | 0.925 | 0.925 | 0.95 | 0.1 | |
Max and average | Stanford-dog | 0.85 | 0.81 | 0.844 | 0.028 |
Oxford-pet | 0.85 | 0.9 | 0.95 | 0.05 | |
Sum | Stanford-dog | 0.925 | 0.925 | 0.95 | 0.05 |
Oxford-pet | 0.864 | 0.851 | 0.837 | 0.002 |
4.2 Ontology population evaluation
SynsID | N7846 | N15388 | N17222 | N19128 | N21939 | N9287968 | Avg |
---|---|---|---|---|---|---|---|
Accuracy | 0.92 | 0.999 | 0.865 | 0.917 | 0.82 | 0.56 | 0.846 |
SynsID | N/15388 | N/523513 | N/1299868 | Avg |
---|---|---|---|---|
Accuracy | 0.641 | 0.998 | 0.858 | 0.832 |
Synset | SynsID | NNodes |
---|---|---|
Sport | N00523513 | 18 |
Fungus | N12992868 | 24 |
Animal | N00015388 | 28 |
Plant | N00017222 | 7 |
Artifact | N00021939 | 187 |
Natural object | n00019128 | 7 |
Geological formation | n09287968 | 3 |
Person | n00007846 | 2 |