1 Introduction
2 Related work
2.1 Hubness and its reduction
2.2 Previous comparisons of hubness reduction
3 Hubness reduction methods
3.1 Measuring hubness
3.1.1 k-occurrence
3.1.2 Hubness
3.2 Methods based on repairing asymmetric relations
3.2.1 Local scaling and the non-iterative contextual dissimilarity measure
3.2.2 Global scaling: mutual proximity
3.2.3 Shared nearest neighbors and simhub
3.3 Methods based on spatial centrality reduction and density gradient flattening
3.3.1 Centering and localized centering
3.3.2 Global and local dissimilarity measures
3.4 Hubness-resistant dissimilarity measures
3.4.1 Choosing \(\ell ^{p}\) norms and the \(m_p\)-dissimilarity measure
3.5 Time and space complexity
Input data | Time complexity | Time (s) \(B_{1000}\) | Time (s) \(B_{10000}\) | Parameters | |
---|---|---|---|---|---|
Eucl | Vectors |
\(\mathcal {O}(n^2 m)\)
| 0.1 | 26 | – |
cos | Vectors |
\(\mathcal {O}(n^2 m)\)
| 0.3 | 48 | – |
MP | Distances |
\(\mathcal {O}(n^3)\)
| 1.6 | 1382 | – |
MP\(^\mathrm{GaussI}\) | Distances |
\(\mathcal {O}(n^2)\)
| 0.2 | 8 | – |
LS | Distances |
\(\mathcal {O}(n^2)\)
|
\(<\, 0.1\)
| 4 |
\(k=10\)
|
NICDM | Distances |
\(\mathcal {O}(n^2)\)
|
\(<\, 0.1\)
| 3 |
\(k=10\)
|
SNN | Distances |
\(\mathcal {O}(n^3)\)
| 0.7 | 754 |
\(k=10\)
|
simhub\(^\mathrm{IN}\) | Distances |
\(\mathcal {O}(n^3)\)
| 2.5 | 2510 |
\(s=10\)
|
CENT | Vectors |
\(\mathcal {O}(n^2 m)\)
| 0.1 | 33 | – |
LCENT | Vectors |
\(\mathcal {O}(n^2 m)\)
| 0.5 | 71 |
\(\kappa =10, \gamma =1.5\)
|
DSG | Vectors |
\(\mathcal {O}(n^2 m)\)
| 0.2 | 46 | – |
DSL | Vectors |
\(\mathcal {O}(n^2 m)\)
| 0.3 | 51 |
\(k=10\)
|
\(\ell ^p\) norm | Vectors |
\(\mathcal {O}(n^2 m)\)
| 29.8 | 30018 |
\(p=1.5\)
|
\(m_p\)-dissim | Vectors |
\(\mathcal {O}(n^2 m)\)
| 75.8 | 63143 | \(p=1.5\), |
\(n_\text {bins}=200\)
|
3.6 OFAI Hub-Toolbox
4 Evaluation
4.1 Evaluation scheme
Parameter 1 | Range | Parameter 2 | Range | |
---|---|---|---|---|
MP | – | – | – | – |
MP\(^\mathrm{GaussI}\) | – | – | – | – |
LS | k | [1, 50] | – | – |
NICDM | k | [1, 50] | – | – |
SNN | k | [1, 50] | – | – |
simhub\(^\mathrm{IN}\) | s | [1, n] | – | – |
CENT | – | – | – | – |
LCENT |
\(\kappa \)
| [5, 100] |
\(\gamma \)
| ]0., 5.] |
DSG | – | – | – | – |
DSL |
\(\kappa \)
| [1, n] | – | – |
\(\ell ^p\) norm | p | ]0., 5.] | – | – |
\(m_p\)-dissim | p | ]0., 5.] | nbins | [10, \(\min (\frac{n}{2}, 200)\)] |
\({{\mathrm{kNN}}}\)
| k | [1, 10], 15, 20, 25, 30, 40, 50, 100 | Weights | {Uniform, distance} |
# | Source | Name | Cls. |
n
|
m
|
\(m_{mle}\)
|
d
|
\(C^{{{\mathrm{kNN}}}}\)
|
\(S^{k=10}\)
|
---|---|---|---|---|---|---|---|---|---|
1 | UCI | Opportunity activity recognition | 5 | \(^\mathrm{*}\)2000 | 238 | 2 |
\(\ell ^2\)
| 0.8995 | − 0.1156 |
2 | LibSVM | Fourclass (sc) | 2 | 862 | 2 | 2 |
\(\ell ^2\)
| 1.0000 | 0.1528 |
3 | LibSVM | Liver-disorders (sc) | 2 | 345 | 6 | 4 |
\(\ell ^2\)
| 0.6260 | 0.1795 |
4 | OpenML | Spectrometer | 2 | 531 | 101 | 5 |
\(\ell ^2\)
| 0.9661 | 0.1903 |
5 | UCI | Mice protein expression | 8 | 1080 | 77 | 2 | cos | 0.9787 | 0.1961 |
6 | LibSVM | Australian | 2 | 690 | 14 | 2 |
\(\ell ^2\)
| 0.6928 | 0.2011 |
7 | UCI | Chronic kidney disease | 2 | 400 | 24 | 2 | cos | 0.7400 | 0.2745 |
8 | UCI | Parkinson speech | 2 | 1208 | 26 | 3 |
\(\ell ^2\)
| 0.6755 | 0.3932 |
9 | UCI | Arcene | 2 | 100 | 10000 | 8 |
\(\ell ^2\)
| 0.7500 | 0.4428 |
10 | LibSVM | Breast-cancer (sc) | 2 | 683 | 10 | 1 |
\(\ell ^2\)
| 0.9693 | 0.5686 |
11 | LibSVM | Heart (sc) | 2 | 270 | 13 | 3 |
\(\ell ^2\)
| 0.8222 | 0.5734 |
12 | LibSVM | Colon-cancer | 2 | 62 | 2000 | 10 |
\(\ell ^2\)
| 0.7742 | 0.5950 |
13 | LibSVM | Diabetes (sc) | 2 | 768 | 8 | 5 | cos | 0.7656 | 0.5950 |
14 | UCI | mfeat-karhunen | 10 | 2000 | 64 | 7 |
\(\ell ^2\)
| 0.9755 | 0.6898 |
15 | KR | Ovarian-61902 | 2 | 253 | 15154 | 8 |
\(\ell ^2\)
| 0.9447 | 0.7603 |
16 | OpenML | semeion | 10 | 1593 | 256 | 12 |
\(\ell ^2\)
| 0.9165 | 0.8012 |
17 | UCI | mfeat-factors | 10 | 2000 | 216 | 6 |
\(\ell ^2\)
| 0.9570 | 0.8193 |
18 | LibSVM | Duke (train) | 2 | 38 | 7129 | 11 |
\(\ell ^2\)
| 0.7105 | 0.8275 |
19 | CP | c224a web | 14 | 224 | 1244 | 21 | cos | 0.9286 | 0.8345 |
20 | LibSVM | German numbers (sc) | 2 | 1000 | 24 | 5 |
\(\ell ^2\)
| 0.7320 | 0.8835 |
21 | UCI | mfeat-pixels | 10 | 2000 | 240 | 9 |
\(\ell ^2\)
| 0.9785 | 0.9642 |
22 | KR | amlall | 2 | 72 | 7129 | 11 |
\(\ell ^2\)
| 0.9167 | 1.1655 |
23 | LibSVM | Sonar (sc) | 2 | 208 | 60 | 5 |
\(\ell ^2\)
| 0.8462 | 1.1798 |
24 | UCI | Student alcohol consumption | 5 | 1044 | 56 | 4 |
\(\ell ^2\)
| 0.4157 | 1.2105 |
25 | LibSVM | Splice (sc) | 2 | 1000 | 60 | 7 | cos | 0.7720 | 1.2288 |
26 | KR | Lungcancer | 2 | 181 | 12533 | 10 |
\(\ell ^2\)
| 1.0000 | 1.2483 |
27 | UCI | p53 mutants | 2 | 1430 | 5408 | 6 | cos | 0.9294 | 1.2574 |
28 | Corel | corel1000 | 10 | 1000 | 192 | 7 |
\(\ell ^2\)
| 0.6820 | 1.4179 |
29 | UCI | Arrhythmia | 13 | 452 | 279 | 10 |
\(\ell ^2\)
| 0.5951 | 1.4994 |
30 | LibSVM | rcv1.multiclass | 42 | \(^\mathrm{*}\)2000 | 47236 | 10 | cos | 0.7550 | 1.5359 |
31 | UCI | Reuters-transcribed | 10 | 201 | 2730 | 23 |
\(\ell ^2\)
| 0.5672 | 1.6147 |
32 | LibSVM | Ionosphere (sc) | 2 | 351 | 34 | 5 |
\(\ell ^2\)
| 0.8917 | 1.6763 |
33 | OpenML | OVA uterus | 2 | 1545 | 10936 | 13 |
\(\ell ^2\)
| 0.9346 | 1.7759 |
34 | OpenML | Lymphoma | 11 | 96 | 4026 | 9 |
\(\ell ^2\)
| 0.8646 | 1.8785 |
35 | UCI | Farm ads | 2 | 4143 | 54877 | 1 | cos | 0.8938 | 1.9327 |
36 | UCI | Gisette | 2 | 6000 | 5000 | 51 | cos | 0.9783 | 1.9667 |
37 | OpenML | AP breast ovary | 2 | 542 | 10936 | 14 |
\(\ell ^2\)
| 0.9004 | 1.9812 |
38 | OpenML | Hepatitis C | 3 | 283 | 54621 | 25 |
\(\ell ^2\)
| 0.8975 | 2.1221 |
39 | UCI | Dorothea | 2 | 800 | 100000 | 253 | cos | 0.9375 | 2.3578 |
40 | UCI | CNAE-9 | 9 | 1080 | 856 | 5 | cos | 0.8713 | 2.5492 |
41 | UCI | Dexter | 2 | 300 | 20000 | 34 |
\(\ell ^2\)
| 0.8667 | 3.3307 |
42 | MLDATA | DMOZ | 5 | 1329 | 10630 | 5 | cos | 0.4981 | 3.6351 |
43 | UCI | Amazon commerce reviews | 50 | 1500 | 10000 | 11 | cos | 0.3573 | 4.1013 |
44 | LibSVM | Protein | 3 | \(^\mathrm{*}\)2000 | 357 | 34 | cos | 0.5845 | 4.2757 |
45 | PaBo | Movie-reviews | 2 | 2000 | 10382 | 44 |
\(\ell ^2\)
| 0.7935 | 4.3452 |
46 | UCI | Mini-newsgroups | 20 | 2000 | 8811 | 16 |
\(\ell ^2\)
| 0.8330 | 4.3705 |
47 | LibSVM | Sector | 104 | \(^\mathrm{*}\)2000 | 55197 | 11 | cos | 0.7015 | 5.5795 |
48 | OpenML | Wap.wc | 20 | 1560 | 8460 | 12 | cos | 0.5045 | 9.3380 |
49 | CP | c1ka-twitter | 17 | 969 | 49820 | 44 | cos | 0.3633 | 10.7119 |
50 | LibSVM | dna | 3 | 2000 | 180 | 5 |
\(\ell ^2\)
| 0.8790 | 15.5188 |
4.2 Data sets
-
Already used in a previous study [49]: arcene, amlall, gisette, mfeat-factors, mfeat-karhunen, mfeat-pixels, heart, sonar, dexter, mini-newsgroups, dorothea, lungcancer, reuters-transcribed, ovarian 61902, australian, diabetes, german numbers, liver-disorders, breast-cancer, duke (train), colon-cancer, fourclass, ionosphere, splice, c1ka-twitter, c224a-web, corel1000, movie-reviews. Please note that ballroom and ismir2004 were omitted, because they use symmetrized Kullback–Leibler divergence, which is non-trivial to combine with some of the methods evaluated here.
-
UCI Machine Learning Repository [38]: Parkinson Speech Dataset with Multiple Types of Sound Recordings [46], Amazon Commerce reviews [55], p53 Mutants [13], CNAE-9 [12], Student Alcohol Consumption [42], Arrhythmia [24], Farm Ads [41], Mice Protein Expression [28], Opportunity Activity Recognition [11], Chronic Kidney Disease [53]