1 Introduction
-
has a trivial implementation (under 20 lines of code in Python),
-
learns rapidly,
-
is well suited for unbalanced problems,
-
constructs nonlinear hypothesis,
-
scales very well (better than SVM, LS-SVM and ELM),
-
has a few hyperparameters which are easy to optimize.
2 Preliminaries
2.1 Extreme learning machines
2.1.1 Optimization problem: extreme learning machine
2.2 Support vector machines and least squares support vector machines
2.2.1 Optimization problem: support vector machine
2.2.2 Optimization problem: kernel support vector machine
2.2.3 Optimization problem: least squares support vector machine
3 Extreme entropy machines
-
regression based (like in neural networks or ELM),
-
probabilistic (like in the case of Naive Bayes),
-
geometric (like in SVM),
-
information theoretic (entropy models).
-
true datasets are discrete, so we do not know actual densities f and g,
-
statistical density estimators require rather large sample sizes and are very computationally expensive.
Dataset | 1 | 10 | 100 | 200 | 500 |
---|---|---|---|---|---|
australian
| 0.928 |
\(-\)0.022 | 0.295 | 0.161 | 0.235 |
breast-cancer
| 0.628 | 0.809 | 0.812 | 0.858 | 0.788 |
diabetes
|
\(-\)0.983 |
\(-\)0.976 |
\(-\)0.941 |
\(-\)0.982 |
\(-\)0.952 |
german.numer
| 0.916 | 0.979 | 0.877 | 0.873 | 0.839 |
heart
| 0.964 | 0.829 | 0.931 | 0.91 | 0.893 |
ionosphere
| 0.999 | 0.988 | 0.98 | 0.978 | 0.984 |
liver disorders
| 0.232 | 0.308 | 0.363 | 0.33 | 0.312 |
sonar
|
\(-\)0.31 |
\(-\)0.542 |
\(-\)0.41 |
\(-\)0.407 |
\(-\)0.381 |
splice
|
\(-\)0.284 |
\(-\)0.036 |
\(-\)0.165 |
\(-\)0.118 |
\(-\)0.101 |
abalone7
| 1.0 | 0.999 | 0.999 | 0.999 | 0.998 |
arythmia
| 1.0 | 1.0 | 0.999 | 1.0 | 1.0 |
balance
| 1.0 | 0.998 | 0.998 | 0.999 | 0.998 |
car evaluation
| 1.0 | 0.998 | 0.998 | 0.997 | 0.997 |
ecoli
| 0.964 | 0.994 | 0.995 | 0.998 | 0.995 |
libras move
| 1.0 | 0.999 | 0.999 | 1.0 | 1.0 |
oil spill
| 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
sick euthyroid
| 1.0 | 0.999 | 1.0 | 1.0 | 1.0 |
solar flare
| 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
spectrometer
| 1.0 | 1.0 | 0.999 | 0.999 | 0.999 |
forest cover
| 0.988 | 0.997 | 0.997 | 0.992 | 0.988 |
isolet
| 0.784 | 1.0 | 0.997 | 0.997 | 0.999 |
mammography
| 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
protein homology
| 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
webpages
| 1.0 | 1.0 | 1.0 | 0.999 | 0.999 |
3.1 Optimization problem: extreme entropy machine
-
For Extreme Entropy Machine (EEM), we use the random projection technique, exactly the same as the one used in the ELM. In other words, given some generalized activation function \(\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} ,b) : \mathcal {X}\times \mathcal {X}\times \mathbb {R} \rightarrow \mathbb {R}\) and a constant h denoting number of hidden neurons:where \(w_i\) are random vectors and \(b_i\) are random biases.$$\begin{aligned} \varphi : \mathcal {X}\ni {\mathbf{x}} \rightarrow [\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} _1,b_1),\dots ,\mathrm {G}({\mathbf{x}} ,{\mathbf{w}} _h,b_h)]^\mathrm{T} \in \mathbb {R}^h \end{aligned}$$
-
For Extreme Entropy Kernel Machine (EEKM), we use the randomized kernel approximation technique [9], which spans our Hilbert space on randomly selected subset of training vectors. In other words, given valid kernel \(\mathrm {K}(\cdot ,\cdot ) : \mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}_+\) and size of the kernel space base h:where \(\mathbf{X}^{[h]}\) is a h element random subset of \(\mathbf{X}\). It is easy to verify that such low rank approximation truly behaves as a kernel, in the sense that for \(\varphi _\mathrm {K}({\mathbf{x}} _i), \varphi _\mathrm {K}({\mathbf{x}} _j) \in \mathbb {R}^{h}\)$$\begin{aligned} \varphi _\mathrm {K}: \mathcal {X}\ni {\mathbf{x}} \rightarrow (\mathrm {K}({\mathbf{x}} ,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2})^\mathrm{T} \in \mathbb {R}^h \end{aligned}$$Given true kernel projection \(\phi _\mathrm {K}\) such that \(\mathrm {K}({\mathbf{x}} _i,{\mathbf{x}} _j)=\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j)\), we have$$\begin{aligned}&\varphi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\varphi _\mathrm {K}({\mathbf{x}} _j) \\&\quad = ((\mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2})^\mathrm{T})^\mathrm{T} \\&\qquad \times ( \mathrm {K}({\mathbf{x}} _j,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2} )^\mathrm{T} \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2} \\&\qquad \times ( \mathrm {K}({\mathbf{x}} _j,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2} )^\mathrm{T} \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2}\\&\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1/2} \mathrm {K}^\mathrm{T}({\mathbf{x}} _j,\mathbf{X}^{[h]}) \\&\quad = \mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1} \mathrm {K}(\mathbf{X}^{[h]},{\mathbf{x}} _j). \end{aligned}$$Thus, for the whole samples’ set, we have$$\begin{aligned}&\mathrm {K}({\mathbf{x}} _i,\mathbf{X}^{[h]})\mathrm {K}(\mathbf{X}^{[h]},\mathbf{X}^{[h]})^{-1} \mathrm {K}(\mathbf{X}^{[h]},{\mathbf{x}} _j) \\&\quad =\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}) \\&\qquad \times (\phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}))^{-1}\\&\qquad \times \phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j) \\&\quad =\phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}(\mathbf{X}^{[h]}) \phi _\mathrm {K}(\mathbf{X}^{[h]})^{-1}\\&\qquad \times (\phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T})^{-1} \phi _\mathrm {K}(\mathbf{X}^{[h]})^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j) \\&\quad = \phi _\mathrm {K}({\mathbf{x}} _i)^\mathrm{T}\phi _\mathrm {K}({\mathbf{x}} _j)\\&\quad = \mathrm {K}({\mathbf{x}} _i,{\mathbf{x}} _j). \end{aligned}$$which is a complete Gram matrix.$$\begin{aligned} \varphi _\mathrm {K}(\mathbf{X})^\mathrm{T} \varphi _\mathrm {K}(\mathbf{X}) = \mathrm {K}(\mathbf{X},\mathbf{X}), \end{aligned}$$
4 Theory: density estimation in the kernel case
5 Theory: learning capabilities
6 Practical considerations
-
feature projection function \(\varphi\),
-
linear operator \(\varvec{\beta }\),
-
classification rule \(\mathrm {F}\).
ELM | SVM | LS-SVM | EE(K)M | |
---|---|---|---|---|
Optimization method | Linear regression | Quadratic programming | Linear system | Fisher discriminant |
Nonlinearity | Random projection | Kernel | Kernel | Random (kernel) projection |
Closed-form | Yes | No | Yes | Yes |
Balanced | No\(^{\mathrm{a}}\)
| No\(^{\mathrm{a}}\)
| No\(^{\mathrm{a}}\)
| Yes |
Regression | Yes | No\(^{\mathrm{a}}\)
| Yes | No |
Criterion | Mean squared error | Hinge loss | Mean squared error | Entropy optimization |
Learning theory | Huang et al. [11] | Vapnik et al. [4] | Suykens et al. [23] | This paper |
No. of thresholds | 1 | 1 | 1 | 1 or 2 |
Problem type | Regression | Classification | Regression | Classification |
Model learning | Discriminative | Discriminative | Discriminative | Generative |
Direct probability estimates | No | No | No | Yes |
Training complexity |
\(\mathcal {O}(Nh^2)\)
|
\(\mathcal {O}(N^3)\)
|
\(\mathcal {O}(N^{2.34})\)
|
\(\mathcal {O}(Nh^2)\)
|
Resulting model complexity |
hd
| |SV|d, \(|SV|\ll N\)
|
\(Nd+1\)
|
\(hd+4\)
|
Memory requirements |
\(\mathcal {O}(Nd)\)
|
\(\mathcal {O}(Nd)\)
|
\(\mathcal {O}(N^2)\)
|
\(\mathcal {O}(Nd)\)
|
Source of regularization | Moore–Penrose pseudoinverse | Margin maximization | Quadratic loss penalty term | Ledoit–Wolf estimator |
Hyperparameters |
h, G
|
C, K
|
C, K
|
h, G or h, K
|
Number of classes | Any | 2 | 2 | 2 |
7 Evaluation
-
sigmoid (sig): \(\tfrac{1}{1+\exp (-\langle {\mathbf{w}} ,{\mathbf{x}} \rangle + b)}\),
-
normalized sigmoid (nsig): \(\tfrac{1}{1+\exp (-\langle {\mathbf{w}} ,{\mathbf{x}} \rangle /d + b)}\),
-
radial basis function (rbf): \(\exp (-b \Vert {\mathbf{w}} - {\mathbf{x}} \Vert ^2 )\).
Dataset |
d
|
\(|\mathbf{X}^-|\)
|
\(|\mathbf{X}^+|\)
|
---|---|---|---|
australian
| 14 | 383 | 307 |
breast cancer
| 9 | 444 | 239 |
diabetes
| 8 | 268 | 500 |
german numer
| 24 | 700 | 300 |
heart
| 13 | 150 | 120 |
liver disorders
| 6 | 145 | 200 |
sonar
| 60 | 111 | 97 |
splice
| 60 | 483 | 517 |
abalone7
| 10 | 3786 | 391 |
arythmia
| 261 | 427 | 25 |
car evaluation
| 21 | 1594 | 134 |
ecoli
| 7 | 301 | 35 |
libras move
| 90 | 336 | 24 |
oil spill
| 48 | 896 | 41 |
sick euthyroid
| 42 | 2870 | 293 |
solar flare
| 32 | 1321 | 68 |
spectrometer
| 93 | 486 | 45 |
forest cover
| 54 | 571519 | 9493 |
isolet
| 617 | 7197 | 600 |
mammography
| 6 | 10923 | 260 |
protein homology
| 74 | 144455 | 1296 |
webpages
| 300 | 33799 | 981 |
7.1 Basic UCI datasets
WELM\(_{\mathrm {sig}}\)
| EEM\(_{\mathrm {sig}}\)
| WELM\(_{\mathrm {nsig}}\)
| EEM\(_{\mathrm {nsig}}\)
| WELM\(_{\mathrm {rbf}}\)
| EEM\(_{\mathrm {rbf}}\)
| LS-SVM\(_{\mathrm {rbf}}\)
| EEKM\(_{\mathrm {rbf}}\)
| SVM\(_{\mathrm {rbf}}\)
| |
---|---|---|---|---|---|---|---|---|---|
australian
| 86.3 \(\pm\,4.5\)
|
87.0
\(\pm\,4.0\)
| 85.9 \(\pm\,4.4\)
| 86.5 \(\pm\,3.2\)
| 85.8 \(\pm\,4.9\)
| 86.9 \(\pm\,4.4\)
| 86.9 \(\pm\,4.1\)
| 86.8 \(\pm\,3.8\)
| 86.8 \(\pm\,3.7\)
|
breast-cancer
| 96.9 \(\pm\,1.7\)
| 97.3 \(\pm\,1.2\)
| 97.6 \(\pm\,1.5\)
| 97.4 \(\pm\,1.2\)
| 96.6 \(\pm\,1.8\)
| 97.3 \(\pm\,1.1\)
| 97.6 \(\pm\,1.3\)
|
97.8
\(\pm\,1.1\)
| 96.8 \(\pm\,1.7\)
|
diabetes
| 74.2 \(\pm\,4.6\)
| 74.5 \(\pm\,4.6\)
| 74.1 \(\pm\,5.5\)
| 74.9 \(\pm\,5.0\)
| 73.2 \(\pm\,5.6\)
| 74.9 \(\pm\,5.9\)
| 75.5 \(\pm\,5.6\)
|
75.7
\(\pm\,5.6\)
| 74.8 \(\pm\,3.5\)
|
german
| 68.8 \(\pm\,6.9\)
| 71.3 \(\pm\,4.1\)
| 70.7 \(\pm\,6.1\)
| 72.4 \(\pm\,5.4\)
| 71.1 \(\pm\,6.1\)
| 72.2 \(\pm\,5.7\)
| 73.2 \(\pm\,4.5\)
| 72.9 \(\pm\,5.3\)
|
73.4
\(\pm\,5.4\)
|
heart
| 78.8 \(\pm\,6.3\)
| 82.5 \(\pm\,7.4\)
| 78.1 \(\pm\,7.0\)
| 83.7 \(\pm\,7.2\)
| 80.2 \(\pm\,8.9\)
| 81.9 \(\pm\,6.9\)
| 83.7 \(\pm\,8.5\)
| 83.6 \(\pm\,7.5\)
|
84.6
\(\pm\,7.0\)
|
ionosphere
| 71.5 \(\pm\,9.5\)
| 77.0 \(\pm\,12.8\)
| 82.7 \(\pm\,7.8\)
| 84.6 \(\pm\,9.1\)
| 85.6 \(\pm\,8.4\)
| 90.8 \(\pm\,5.2\)
| 91.2 \(\pm\,5.5\)
| 93.4 \(\pm\,4.3\)
|
94.7
\(\pm\,3.9\)
|
liver disorders
| 68.1 \(\pm\,8.0\)
| 68.6 \(\pm\,8.9\)
| 66.3 \(\pm\,8.2\)
| 62.1 \(\pm\,8.1\)
| 67.2 \(\pm\,5.9\)
| 71.4 \(\pm\,7.0\)
| 71.1 \(\pm\,8.3\)
| 70.2 \(\pm\,6.9\)
|
72.3
\(\pm\,6.2\)
|
sonar
| 66.7 \(\pm\,10.1\)
| 70.1 \(\pm\,11.5\)
| 80.2 \(\pm\,7.4\)
| 78.3 \(\pm\,11.2\)
| 83.2 \(\pm\,6.9\)
| 82.8 \(\pm\,5.2\)
| 86.5 \(\pm\,5.4\)
|
87.0
\(\pm\,7.5\)
| 83.0 \(\pm\,7.1\)
|
splice
| 64.7 \(\pm\,2.8\)
| 49.4 \(\pm\,5.5\)
| 81.8 \(\pm\,3.2\)
| 80.9 \(\pm\,2.7\)
| 75.5 \(\pm\,3.9\)
| 82.2 \(\pm\,3.5\)
|
89.9
\(\pm\,3.0\)
| 88.0 \(\pm\,4.0\)
| 88.0 \(\pm\,2.2\)
|
abalone7
| 79.7 \(\pm\,2.3\)
| 79.8 \(\pm\,3.5\)
| 80.0 \(\pm\,2.8\)
| 76.1 \(\pm\,3.7\)
| 80.1 \(\pm\,3.2\)
| 79.7 \(\pm\,3.6\)
|
80.2
\(\pm\,3.4\)
| 79.9 \(\pm\,3.4\)
| 79.7 \(\pm\,2.7\)
|
arythmia
| 28.3 \(\pm\,35.4\)
| 40.3 \(\pm\,20.9\)
| 64.2 \(\pm\,24.6\)
|
85.6
\(\pm\,10.3\)
| 66.9 \(\pm\,25.8\)
| 79.4 \(\pm\,12.5\)
| 84.4 \(\pm\,10.0\)
| 85.2 \(\pm\,10.6\)
| 80.9 \(\pm\,11.8\)
|
car evaluation
| 99.1 \(\pm\,0.3\)
| 98.9 \(\pm\,0.4\)
| 99.0 \(\pm\,0.3\)
| 97.9 \(\pm\,0.6\)
| 99.0 \(\pm\,0.3\)
| 98.5 \(\pm\,0.3\)
| 99.5 \(\pm\,0.2\)
| 99.2 \(\pm\,0.3\)
|
100.0
\(\pm\,0.0\)
|
ecoli
| 86.9 \(\pm\,6.5\)
| 88.3 \(\pm\,7.1\)
| 86.9 \(\pm\,6.8\)
| 88.6 \(\pm\,6.9\)
| 86.4 \(\pm\,7.0\)
| 88.8 \(\pm\,7.2\)
| 89.2 \(\pm\,6.3\)
|
89.4
\(\pm\,6.9\)
| 88.5 \(\pm\,6.2\)
|
libras move
| 65.5 \(\pm\,10.7\)
| 19.3 \(\pm\,8.1\)
| 82.5 \(\pm\,12.0\)
| 93.0 \(\pm\,11.8\)
| 89.6 \(\pm\,11.9\)
| 93.9 \(\pm\,11.9\)
| 96.5 \(\pm\,8.6\)
|
96.6
\(\pm\,8.7\)
| 91.6 \(\pm\,11.9\)
|
oil spill
| 86.0 \(\pm\,6.9\)
|
88.8
\(\pm\,6.5\)
| 83.8 \(\pm\,7.6\)
| 84.7 \(\pm\,8.7\)
| 85.8 \(\pm\,9.3\)
| 88.1 \(\pm\,6.1\)
| 86.7 \(\pm\,8.4\)
| 87.2 \(\pm\,4.9\)
| 85.7 \(\pm\,11.4\)
|
sick euthyroid
| 88.1 \(\pm\,1.7\)
| 87.9 \(\pm\,2.4\)
| 88.5 \(\pm\,2.1\)
| 81.7 \(\pm\,2.7\)
| 89.1 \(\pm\,1.9\)
| 88.2 \(\pm\,2.4\)
| 89.5 \(\pm\,1.7\)
| 89.3 \(\pm\,1.9\)
|
90.9
\(\pm\,2.0\)
|
solar flare
| 60.4 \(\pm\,16.8\)
| 63.7 \(\pm\,12.9\)
| 61.3 \(\pm\,10.8\)
| 67.4 \(\pm\,9.0\)
| 60.3 \(\pm\,14.8\)
| 68.9 \(\pm\,9.3\)
| 67.3 \(\pm\,8.8\)
| 67.3 \(\pm\,9.0\)
|
70.9
\(\pm\,8.5\)
|
spectrometer
| 82.9 \(\pm\,13.0\)
| 87.3 \(\pm\,7.8\)
| 88.0 \(\pm\,10.8\)
| 90.2 \(\pm\,8.6\)
| 86.6 \(\pm\,8.2\)
| 93.0 \(\pm\,14.6\)
| 94.6 \(\pm\,8.4\)
| 93.5 \(\pm\,14.7\)
|
95.4
\(\pm\,5.1\)
|
forest cover
| 90.8 \(\pm\,0.3\)
| 90.5 \(\pm\,0.3\)
| 90.7 \(\pm\,0.3\)
| 85.1 \(\pm\,0.4\)
| 90.9 \(\pm\,0.3\)
| 87.1 \(\pm\,0.0\)
| – |
91.8
\(\pm\,0.3\)
| – |
isolet
| 0.0 \(\pm\,0.0\)
| 0.0 \(\pm\,0.0\)
| 96.3 \(\pm\,0.7\)
| 95.6 \(\pm\,1.1\)
| 93.0 \(\pm\,0.9\)
| 91.4 \(\pm\,1.0\)
|
98.0
\(\pm\,0.7\)
| 97.4 \(\pm\,0.6\)
| 97.6 \(\pm\,0.6\)
|
mammography
| 90.4 \(\pm\,2.8\)
| 89.0 \(\pm\,3.2\)
| 90.7 \(\pm\,3.3\)
| 87.2 \(\pm\,3.0\)
| 89.9 \(\pm\,3.8\)
| 89.5 \(\pm\,3.1\)
|
91.0
\(\pm\,3.1\)
| 89.5 \(\pm\,3.1\)
| 89.8 \(\pm\,3.8\)
|
protein homology
| 95.3 \(\pm\,0.8\)
| 94.9 \(\pm\,0.8\)
| 95.1 \(\pm\,0.9\)
| 94.2 \(\pm\,1.3\)
| 95.0 \(\pm\,1.0\)
| 95.1 \(\pm\,1.1\)
| – |
95.7
\(\pm\,0.9\)
| – |
webpages
| 72.0 \(\pm\,0.0\)
| 73.1 \(\pm\,2.0\)
| 93.0 \(\pm\,1.8\)
|
93.1
\(\pm\,1.7\)
| 86.7 \(\pm\,0.0\)
| 84.4 \(\pm\,1.6\)
| – |
93.1
\(\pm\,1.7\)
|
93.1
\(\pm\,1.7\)
|
7.2 Highly unbalanced datasets
WELM\(_{\mathrm {sig}}\)
| EEM\(_{\mathrm {sig}}\)
| WELM\(_{\mathrm {nsig}}\)
| EEM\(_{\mathrm {nsig}}\)
| WELM\(_{\mathrm {rbf}}\)
| EEM\(_{\mathrm {rbf}}\)
| LS-SVM\(_{\mathrm {rbf}}\)
| EEKM\(_{\mathrm {rbf}}\)
| SVM\(_{\mathrm {rbf}}\)
| |
---|---|---|---|---|---|---|---|---|---|
abalone7
| 1.9 | 1.2 | 2.5 | 1.6 | 1.8 |
1.2
| 20.8 | 1.9 | 4.7 |
arythmia
| 0.2 | 0.7 | 0.3 | 0.9 | 0.3 | 0.7 |
0.1
| 0.3 | 0.1 |
car evaluation
| 1.3 | 0.9 | 1.5 | 1.0 | 1.1 | 0.9 | 2.0 | 1.4 |
0.1
|
ecoli
| 0.2 | 0.8 | 0.2 | 0.8 | 0.1 | 0.7 |
0.0
| 0.1 | 0.2 |
libras move
| 0.2 | 0.9 | 0.2 | 0.8 | 0.1 | 0.7 | 0.0 | 0.1 |
0.0
|
oil spill
| 0.7 | 0.8 | 0.6 | 0.8 | 0.6 | 0.8 | 0.4 | 0.9 |
0.1
|
sick euthyroid
| 1.5 | 1.1 | 1.4 |
1.1
| 1.5 | 1.1 | 9.6 | 1.7 | 21.0 |
solar flare
|
0.7
| 0.8 | 0.7 | 0.8 | 0.8 | 0.8 | 1.1 | 1.3 | 16.1 |
spectrometer
| 0.2 | 0.7 | 0.3 | 0.7 | 0.2 | 0.7 | 0.1 | 0.3 |
0.0
|
forest cover
| 110.7 | 104.6 | 144.9 | 45.6 | 111.3 |
38.2
|
\({>}600\)
| 107.4 |
\({>}600\)
|
isolet
| 9.7 | 4.5 | 4.9 | 3.0 | 3.4 |
2.1
| 126.9 | 3.2 | 53.5 |
mammography
| 4.0 |
2.2
| 6.1 | 3.0 | 4.0 | 2.2 | 327.3 | 3.3 | 9.5 |
protein homology
| 27.6 |
21.6
| 86.3 | 27.9 | 62.5 | 22.0 |
\({>}600\)
| 30.7 |
\({>}600\)
|
webpages
| 16.0 |
6.2
| 14.5 | 8.5 | 7.1 | 6.4 |
\({>}600\)
| 9.0 | 217.0 |
7.3 Extremely unbalanced datasets
7.4 Entropy-based hyperparameters optimization
WELM\(_{\mathrm {sig}}\)
| EEM\(_{\mathrm {sig}}\)
| WELM\(_{\mathrm {nsig}}\)
| EEM\(_{\mathrm {nsig}}\)
| WELM\(_{\mathrm {rbf}}\)
| EEM\(_{\mathrm {rbf}}\)
| LS-SVM\(_{\mathrm {rbf}}\)
| EEKM\(_{\mathrm {rbf}}\)
| SVM\(_{\mathrm {rbf}}\)
| |
---|---|---|---|---|---|---|---|---|---|
(a) \({D}_\mathrm {CS}(\mathcal {N}(\varvec{\beta }^\mathrm{T} {\mathbf{m}}^+, \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^+ \varvec{\beta }),\mathcal {N}(\varvec{\beta }^\mathrm{T} {\mathbf{m}}^-, \varvec{\beta }^\mathrm{T} {\mathbf{\Sigma }}^- \varvec{\beta }))\)
| |||||||||
australian
| 51.2 \(\pm\,7.5\)
| 86.3 \(\pm\,4.8\)
| 50.3 \(\pm\,6.4\)
|
86.5
\(\pm\,3.2\)
| 50.3 \(\pm\,8.5\)
| 86.2 \(\pm\,5.3\)
| 58.5 \(\pm\,7.9\)
| 85.2 \(\pm\,5.6\)
| 85.7 \(\pm\,4.7\)
|
breast-cancer
| 83.0 \(\pm\,4.3\)
| 97.0 \(\pm\,1.6\)
| 72.0 \(\pm\,6.6\)
| 97.1 \(\pm\,1.9\)
| 77.3 \(\pm\,5.3\)
| 97.3 \(\pm\,1.1\)
| 79.2 \(\pm\,7.7\)
| 96.9 \(\pm\,1.4\)
|
97.5
\(\pm\,1.2\)
|
diabetes
| 52.3 \(\pm\,4.7\)
| 74.4 \(\pm\,4.0\)
| 51.7 \(\pm\,4.0\)
|
74.7
\(\pm\,5.2\)
| 52.1 \(\pm\,3.7\)
| 73.5 \(\pm\,5.9\)
| 60.1 \(\pm\,4.2\)
| 72.2 \(\pm\,5.4\)
| 73.2 \(\pm\,5.9\)
|
german
| 57.1 \(\pm\,4.0\)
| 69.3 \(\pm\,5.0\)
| 51.7 \(\pm\,3.0\)
|
72.4
\(\pm\,5.4\)
| 52.8 \(\pm\,6.3\)
| 70.9 \(\pm\,6.9\)
| 55.0 \(\pm\,4.3\)
| 67.8 \(\pm\,5.7\)
| 60.5 \(\pm\,4.5\)
|
heart
| 68.6 \(\pm\,5.8\)
| 79.4 \(\pm\,6.9\)
| 65.6 \(\pm\,5.9\)
|
82.9
\(\pm\,7.4\)
| 60.3 \(\pm\,9.4\)
| 77.4 \(\pm\,7.2\)
| 66.2 \(\pm\,4.2\)
| 77.7 \(\pm\,7.0\)
| 76.5 \(\pm\,6.6\)
|
ionosphere
|
\(62.7\pm\,10.6\)
|
\(77.0\pm\,12.8\)
| 68.5 \(\pm\,5.1\)
| 84.6 \(\pm\,9.1\)
| 69.5 \(\pm\,9.6\)
| 90.8 \(\pm\,5.2\)
| 72.8 \(\pm\,6.1\)
| 93.4 \(\pm\,4.2\)
|
94.7
\(\pm\,3.9\)
|
liver disorders
| 53.2 \(\pm\,7.0\)
| 68.5 \(\pm\,6.7\)
| 52.2 \(\pm\,11.8\)
| 62.1 \(\pm\,8.1\)
| 53.9 \(\pm\,8.0\)
|
71.4
\(\pm\,7.0\)
| 62.9 \(\pm\,7.8\)
| 69.6 \(\pm\,8.2\)
| 66.9 \(\pm\,8.0\)
|
sonar
| 66.3 \(\pm\,6.1\)
| 66.1 \(\pm\,15.0\)
| 80.2 \(\pm\,7.4\)
| 76.9 \(\pm\,5.2\)
| 83.2 \(\pm\,6.9\)
| 82.8 \(\pm\,5.2\)
| 85.9 \(\pm\,4.9\)
|
87.7
\(\pm\,6.1\)
| 86.6 \(\pm\,3.3\)
|
splice
| 51.8 \(\pm\,4.3\)
| 49.4 \(\pm\,5.5\)
| 64.9 \(\pm\,3.1\)
| 80.2 \(\pm\,2.6\)
| 60.8 \(\pm\,3.5\)
| 82.2 \(\pm\,3.5\)
|
89.7
\(\pm\,3.3\)
| 88.0 \(\pm\,4.0\)
| 89.5 \(\pm\,2.9\)
|
(b) \({D}_\mathrm {CS}([\![ \varvec{\beta }^\mathrm{T} {\mathbf{h}}^+ ]\!],[\![\varvec{\beta }^\mathrm{T} {\mathbf{h}}^- ]\!])\)
| |||||||||
australian
| 51.2 \(\pm\,7.5\)
| 86.3 \(\pm\,4.8\)
| 50.3 \(\pm\,6.4\)
|
86.5
\(\pm\,3.2\)
| 50.3 \(\pm\,8.5\)
| 86.2 \(\pm\,5.3\)
| 58.5 \(\pm\,7.9\)
| 85.2 \(\pm\,5.6\)
| 84.2 \(\pm\,4.1\)
|
breast-cancer
| 83.0 \(\pm\,4.3\)
| 97.0 \(\pm\,1.6\)
| 72.0 \(\pm\,6.6\)
|
97.4
\(\pm\,1.2\)
| 77.3 \(\pm\,5.3\)
| 97.3 \(\pm\,1.1\)
| 79.3 \(\pm\,7.1\)
| 96.9 \(\pm\,1.4\)
| 96.3 \(\pm\,2.4\)
|
diabetes
| 52.3 \(\pm\,4.7\)
| 74.4 \(\pm\,4.0\)
| 51.7 \(\pm\,4.0\)
|
74.7
\(\pm\,5.2\)
| 52.1 \(\pm\,3.7\)
| 73.5 \(\pm\,5.9\)
| 60.1 \(\pm\,4.2\)
| 72.2 \(\pm\,5.4\)
| 71.9 \(\pm\,5.4\)
|
german
| 57.1 \(\pm\,4.0\)
| 69.3 \(\pm\,5.0\)
| 51.7 \(\pm\,3.0\)
|
71.7
\(\pm\,5.9\)
| 52.8 \(\pm\,6.3\)
| 70.9 \(\pm\,6.9\)
| 54.4 \(\pm\,5.7\)
| 67.8 \(\pm\,5.7\)
| 59.5 \(\pm\,4.2\)
|
heart
| 60.0 \(\pm\,9.2\)
| 79.4 \(\pm\,6.9\)
| 65.6 \(\pm\,5.9\)
|
82.9
\(\pm\,7.4\)
|
\(52.6 \pm\,9.0\)
| 77.4 \(\pm\,7.2\)
| 61.9 \(\pm\,5.8\)
| 77.7 \(\pm\,7.0\)
| 76.3 \(\pm\,7.7\)
|
ionosphere
| 62.4 \(\pm\,8.1\)
|
\(77.0\pm\,12.8\)
| 68.5 \(\pm\,5.1\)
| 84.6 \(\pm\,9.1\)
|
\(67.6 \pm\,9.8\)
| 90.8 \(\pm\,5.2\)
| 67.0 \(\pm\,10.7\)
|
93.4
\(\pm\,4.2\)
| 92.3 \(\pm\,4.6\)
|
liver disorders
| 50.9 \(\pm\,11.5\)
| 68.5 \(\pm\,6.7\)
| 50.4 \(\pm\,9.2\)
| 62.1 \(\pm\,8.1\)
| 53.9 \(\pm\,8.0\)
|
71.4
\(\pm\,7.0\)
| 62.9 \(\pm\,7.8\)
| 69.6 \(\pm\,8.2\)
| 66.9 \(\pm\,8.0\)
|
sonar
| 66.3 \(\pm\,6.1\)
| 66.1 \(\pm\,15.0\)
| 80.2 \(\pm\,7.4\)
| 76.9 \(\pm\,5.2\)
| 62.9 \(\pm\,9.4\)
| 82.8 \(\pm\,5.2\)
| 83.6 \(\pm\,4.5\)
|
87.7
\(\pm\,6.1\)
| 86.6 \(\pm\,3.3\)
|
splice
| 51.8 \(\pm\,4.3\)
| 33.1 \(\pm\,6.5\)
| 64.9 \(\pm\,3.1\)
| 80.2 \(\pm\,2.6\)
| 60.8 \(\pm\,3.5\)
| 82.2 \(\pm\,3.5\)
| 85.4 \(\pm\,4.1\)
| 88.0 \(\pm\,4.0\)
|
89.5
\(\pm\,2.9\)
|
7.5 EEM stability
8 Conclusions
-
information theoretic background based on differential and Renyi’s quadratic entropies,
-
closed-form solution of the optimization problem,
-
generative training, leading to direct probability estimates,
-
small number of hyperparameters,
-
good classification results,
-
rapid training that scales well to hundreds of thousands of examples and beyond,
-
theoretical and practical similarities to the large margin classifiers and Fisher Discriminant.
-
Can one construct a closed-form entropy-based classifier with different distribution families than Gaussians? It remains an open problem whether it is possible even for a convex combination of two Gaussians.
-
Is there a theoretical justification of the stability of the extreme learning techniques? In particular, can one show whether performing random projection is equivalent to some prior on the decision function space like in the case of kernels?
-
Is it possible to further increase achieved results by performing unsupervised entropy-based optimization in the hidden layer? For Gaussian nodes one could use some GMM clustering techniques, but is there an efficient way of selecting nodes with different activation functions, such as ReLU?