1 Introduction
2 Background and Related Work
Notation | Description |
---|---|
n
| Number of observations |
p
| Number of variables |
q
| Number of clusters |
X
| \(\{x_{ij}\} , i = 1, 2, \ldots , n ,\, j = 1, 2, \ldots , p\), n × p data matrix of p variables measured on n independent observations |
\({\bar{X}}\)
| q × p matrix of cluster means |
\({\bar{x}}\)
| Centroid of data matrix X |
\(n_{k}\)
| Number of objects in cluster \(C_{k}\) |
\(c_{k}\)
| Centroid of cluster \(C_{k}\) |
\(x_{i}\)
| p-dimensional vector of observations of the ith object in cluster \(C_{k}\) |
\(\parallel {x}\parallel\)
|
\((x^{T}x)^{1/2}\)
|
\(W_{q}=\sum _{k=1}^{q} \sum _{i\in C_{k}}(x_{i}-c_{k})(x_{i}-c_{k})^{T}\)
| Within-group dispersion matrix for data clustered into q clusters |
\(B_{q} =\sum _{k=1}^{q}n_{k}(c_{k}-{\bar{x}})(c_{k}-{\bar{x}})^{T}\)
| Between-group dispersion matrix for data clustered into q clusters |
\(T=\sum _{i=1}^{n}(x_{i}-{\bar{x}})(x_{i}-{\bar{x}})^{T}\)
| Total dispersion matrix of the data |
2.1 Calinski and Harabasz (CH) Index
2.2 Krzanowski and Lai (KL) Index
-
\({\mathrm{DIFF}}_{q} = {(q-1)^{2/p} {\mathrm{trace}}(W_{q-1})-q^{2/p}{\mathrm{trace}}(W_{q})}\).
2.3 Silhouette Index
-
\(S(i)=\frac{b(i)-a(i)}{max\{a(i);b(i)\}} ,\)
-
\(a(i)=\frac{\sum _{j\in \{C_{r} \smallsetminus i\} }d_{ij}}{n_{r}-1}\) is the average dissimilarity of the ith object to all other objects of cluster \(C_{r},\)
-
\(b(i)=min_{_{s\ne r}}\{d_{iC_{s}}\} ,\)
-
\(d_{iC_{s}}=\frac{\sum _{j\in C_{s}}d_{ij}}{n_{s}}\) is the average dissimilarity of the ith object to all objects of cluster \(C_{s}\)
2.4 Gap Index
-
\(s_{q} = sd_{q}\sqrt{1+1/B} ,\)
-
\(sd_{q}\) is the standard deviation of \(\{log W_{qb}\}\), \(b = 1, \ldots , B,\)
-
\(sd_{q} =\sqrt{\frac{1}{B}\sum _{b=1}^{B}(log W_{qb}-{\bar{l}})^{2}},\)
-
\({\bar{l}}=\frac{1}{B}\sum _{b=1}^{B}log W_{qb}.\)
3 Proposed Method
3.1 Data Depth
3.2 DeD Method
4 Experimental Results
4.1 Synthetic Datasets
Dataset | CH index | KL index | Silhouette index | GAP index | DeD |
---|---|---|---|---|---|
A1 | 3 | 3 | 5 | 2 | 3 |
A2 | 3 | 3 | 3 | 2 | 2 |
A3 | 2 | 10 | 2 | 2 | 3 |
A4 | 2 | 14 | 2 | 2 | 3 |
A5 | 2 | 15 | 2 | 2 | 2 |
Overall | 2/5 | 2/5 | 1/5 | 0/5 | 3/5 |
Dataset | CH index | KL index | Silhouette index | GAP index | DeD |
---|---|---|---|---|---|
B1 | 2 | 2 | 2 | 2 | 3 |
B2 | 2 | 2 | 2 | 2 | 2 |
B3 | 2 | 2 | 2 | 2 | 3 |
B4 | 2 | 2 | 2 | 2 | 3 |
B5 | 2 | 2 | 2 | 2 | 2 |
Overall | 0/5 | 0/5 | 0/5 | 0/5 | 3/5 |
4.2 Real-World Datasets
Dataset | No. of instances | No. of attributes | No. of clusters |
---|---|---|---|
Face images (FI) | 100 | 90 | 10 |
Iris (IR) | 150 | 5 | 3 |
Wine (WI) | 178 | 13 | 3 |
Seed (SE) | 210 | 8 | 3 |
Flame (FL) | 240 | 3 | 2 |
Pathbased (PA) | 300 | 3 | 3 |
Spiral (SP) | 312 | 3 | 3 |
Stampout (ST) | 340 | 10 | 2 |
Jain (JA) | 373 | 3 | 3 |
R15 (R15) | 600 | 3 | 15 |
Breast Cancer (BC) | 699 | 10 | 2 |
Pima (PI) | 768 | 9 | 2 |
Aggregation (AG) | 788 | 3 | 7 |
Pen (PE) | 809 | 17 | 2 |
Dim032 (D32) | 1024 | 33 | 16 |
Shapes (SH) | 5000 | 3 | 4 |
LandArea (LA) | 10,546 | 29 | 6 |
Shuttle (SL) | 24,917 | 10 | 2 |
Dataset | k (known) | CH index | KL index | Silhouette index | GAP index | DeD |
---|---|---|---|---|---|---|
\(k_{\rm est}\)
|
\(k_{\rm est}\)
|
\(k_{\rm est}\)
|
\(k_{\rm est}\)
|
\(k_{\rm est}\)
| ||
FI | 10 | 10 | 8 | 12 | 2 | 8 |
IR | 3 | 3 | 18 | 2 | 2 | 3 |
WI | 3 | 4 | 11 | 2 | 2 | 2 |
SE | 3 | 3 | 3 | 2 | 2 | 3 |
FL | 2 | 8 | 18 | 4 | 2 | 2 |
PA | 3 | 17 | 19 | 3 | 2 | 2 |
SP | 3 | 17 | 3 | 20 | 2 | 2 |
ST | 2 | 2 | 7 | 2 | 2 | 2 |
JA | 2 | 10 | 10 | 2 | 2 | 2 |
R15 | 15 | 19 | 19 | 16 | 2 | 15 |
BC | 2 | 2 | 2 | 2 | 2 | 3 |
PI | 2 | 3 | 11 | 2 | 2 | 2 |
AG | 7 | 20 | 16 | 4 | 2 | 7 |
PE | 2 | 3 | 4 | 3 | 2 | 2 |
D32 | 16 | 17 | 15 | 15 | 2 | 11 |
SH | 4 | 9 | 4 | 4 | 2 | 4 |
LA | 6 | 2 | 11 | 2 | 2 | 10 |
SL | 2 | Err. | Err. | Err. | 2 | 2 |
Overall | 5/18 (28%) | 4/18 (22%) | 6/18 (33%) | 7/18 (39%) | 11/18 (61%) |
4.3 Relative Error
Dataset | CH index | KL index | Silhouette index | GAP index | DeD |
---|---|---|---|---|---|
FI | 0.00 | 0.20 | 0.20 | 0.80 | 0.20 |
IR | 0.00 | 5.00 | 0.33 | 0.33 | 0.00 |
WI | 0.33 | 2.67 | 0.33 | 0.33 | 0.33 |
SE | 0.00 | 0.00 | 0.33 | 0.33 | 0.00 |
FL | 3.00 | 8.00 | 1.00 | 0.00 | 0.00 |
PA | 4.67 | 5.33 | 0.00 | 0.33 | 0.33 |
SP | 4.67 | 0.00 | 5.67 | 0.33 | 0.33 |
ST | 0.00 | 2.50 | 0.00 | 0.00 | 0.00 |
JA | 4.00 | 4.00 | 0.00 | 0.00 | 0.00 |
R15 | 0.27 | 0.27 | 0.07 | 0.87 | 0.00 |
BC | 0.00 | 0.00 | 0.00 | 0.00 | 0.50 |
PI | 0.50 | 4.50 | 0.00 | 0.00 | 0.00 |
AG | 1.86 | 1.29 | 0.43 | 0.71 | 0.00 |
PE | 0.50 | 1.00 | 0.50 | 0.00 | 0.00 |
D32 | 0.06 | 0.06 | 0.06 | 0.88 | 0.31 |
SH | 1.25 | 0.00 | 0.00 | 0.50 | 0.00 |
LA | 0.67 | 0.83 | 0.67 | 0.67 | 0.66 |
SL | Err. | Err. | Err. | 0.00 | 0.00 |
Average | 1.21 | 1.98 | 0.53 | 0.34 | 0.15 |
4.4 Adjusted Rand Index (ARI)
-
\(n_{ij}=|S_{i} \cap S_{j}|,\)
-
\(a_{i}=\sum _{j=1}^{k}|S_{i} \cap S_{j}|\)
-
\(b_{i}=\sum _{i=1}^{k}|S_{i} \cap S_{j}|\).
Dataset | CH index | KL index | Silhouette index | GAP index | DeD |
---|---|---|---|---|---|
FI | 0.81 | 0.78 | 0.90 | 0.12 | 0.78 |
IR | 0.73 | 0.23 | 0.54 | 0.54 | 0.73 |
WI | 0.13 | 0.20 | 0.07 | 0.07 | 0.07 |
SE | 0.71 | 0.71 | 0.47 | 0.47 | 0.72 |
FL | 0.21 | 0.09 | 0.43 | 0.45 | 0.45 |
PA | 0.22 | 0.20 | 0.46 | 0.40 | 0.40 |
SP | 0.11 | − 0.01 | 0.11 | 0.00 | 0.00 |
ST | 0.10 | 0.07 | 0.10 | 0.10 | 0.10 |
JA | 0.13 | 0.13 | 0.32 | 0.32 | 0.32 |
R15 | 0.92 | 0.92 | 0.88 | 0.12 | 0.90 |
BC | 0.84 | 0.84 | 0.84 | 0.84 | 0.79 |
PI | 0.05 | 0.04 | 0.07 | 0.07 | 0.07 |
AG | 0.33 | 0.42 | 0.76 | 0.35 | 0.74 |
PE | 0.06 | 0.02 | 0.06 | − 0.01 | 0.34 |
D32 | 0.86 | 0.82 | 0.82 | 0.10 | 0.70 |
SH | 0.65 | 0.91 | 0.91 | 0.49 | 1.00 |
LA | 0.01 | 0.07 | 0.01 | 0.01 | 0.06 |
SL | Err. | Err. | Err. | 0.00 | 0.20 |
Average | 0.38 | 0.36 | 0.43 | 0.27 | 0.47 |
4.5 Computation Times
Dataset | CH index | KL index | Silhouette index | GAP index | DeD |
---|---|---|---|---|---|
FI | 0.83 | 1.21 | 0.65 | 3.64 | 0.60 |
IR | 0.22 | 0.25 | 0.60 | 0.86 | 0.28 |
WI | 0.26 | 0.32 | 0.72 | 1.16 | 0.25 |
SE | 0.21 | 0.27 | 0.87 | 0.95 | 0.27 |
FL | 0.19 | 0.22 | 0.95 | 0.81 | 0.25 |
PA | 0.22 | 0.26 | 1.24 | 0.82 | 0.24 |
SP | 0.23 | 0.26 | 1.29 | 0.84 | 0.29 |
ST | 0.35 | 0.42 | 1.48 | 1.21 | 0.29 |
JA | 0.24 | 0.28 | 1.53 | 0.88 | 0.29 |
R15 | 0.45 | 0.50 | 2.87 | 0.93 | 0.29 |
BC | 1.62 | 1.74 | 4.73 | 2.52 | 0.32 |
PI | 0.91 | 0.97 | 4.54 | 1.86 | 0.31 |
AG | 0.68 | 0.76 | 4.24 | 1.02 | 0.28 |
PE | 1.10 | 1.27 | 5.00 | 2.98 | 0.33 |
D32 | 2.03 | 2.22 | 7.36 | 6.54 | 0.48 |
SH | 30.71 | 30.85 | 116.20 | 3.61 | 0.58 |
LA | 198 | 192 | 610 | 66 | 2.28 |
SL | Err. | Err. | Err. | 86 | 2.14 |