1 Introduction
2 SUBiNN
2.1 Base-learners: kNN
2.2 Meta-learner: Lasso regression
2.3 Final model and predictions for new data
2.4 Empirical application
3 Pilot study: the choice of k
3.1 Setup
tune
from the ‘e1071’ package (Meyer et al. 2019) was used. The range of possible k values was set from 1 to 20.3.2 Results and conclusion
k | Accuracy | b-learners | Time |
---|---|---|---|
(a) Diabetes | |||
5 | 0.754 | 11.521 | 6.985 |
10 | 0.759 | 9.197 | 7.384 |
\(\sqrt{n}\) | 0.761 | 6.195 | 9.172 |
opt | 0.761 | 7.975 | 630.025 |
(b) Dystrophy | |||
5 | 0.892 | 6.914 | 3.248 |
10 | 0.895 | 5.386 | 3.392 |
\(\sqrt{n}\) | 0.885 | 5.552 | 3.572 |
opt | 0.889 | 6.331 | 159.266 |
4 Simulation studies
4.1 Data generation for main simulation experiments
4.2 Software implementations
mvrnorm
from the package ‘MASS’ (Venables and Ripley 2002). Any added non-informative features are drawn from a standard normal distribution, using base R’s rnorm
function.knn
from the package ‘class’ (Venables and Ripley 2002). BkNN is implemented by fitting 1001 of these kNN models where the input data is sampled with replacement. The final prediction is a majority vote of the 1001 outcomes. For these two models we used \(k = \sqrt{n}\).rknn
function from the ‘rknn’ package (Li 2015). The parameters of importance are again k, r the number of models fitted (taken again to be 1001), and mtry, the number of random features drawn at each fitting. Following Gul et al. (2016), we used a subset size of 1/3rd of the number of features, with a minimum of 2. For MFS we fit 1001 kNN model using a random draw of the features with replacement (again 1/3rd of the total number of features, with at least 2) and take the majority vote of the 1001 results.best.randomForest
from the package ‘e1071’ (Meyer et al. 2019), which also uses the function randomForest
from the package ‘randomForest’ (Liaw and Wiener 2002). The automatic best.randomForest
function does parameter selection without range specification, using 10-fold cross-validation which is implemented with the argument tunecontrol where sampling = ‘cross’
and cross = 10
.ksvm
function from the package ‘kernlab’ (Karatzoglou et al. 2004) where the kernel is said to be rbfdot
and the kpar
is set to automatic
to allow for automatic optimal parameter selection.predict.esknnClass
function.cv.glmnet
. The predict
function and the glmnet
object are used to generate predictions for the test samples.4.3 Results
4.3.1 Experiment 1: adding noise
Features | kNN | BkNN | RkNN | MFS | RF | SVM | ESkNN | SUBiNN | |
---|---|---|---|---|---|---|---|---|---|
20 | Mean | 0.039 | 0.039 | 0.043 | 0.043 | 0.044 | 0.039 | 0.055 | 0.108 |
Std | 0.006 | 0.006 | 0.007 | 0.007 | 0.006 | 0.006 | 0.008 | 0.009 | |
20 + 50 | Mean | 0.050 | 0.049 | 0.048 | 0.047 | 0.048 | 0.051 | 0.058 | 0.108 |
Std | 0.008 | 0.008 | 0.007 | 0.007 | 0.007 | 0.008 | 0.009 | 0.009 | |
20 + 100 | Mean | 0.056 | 0.055 | 0.050 | 0.050 | 0.049 | 0.055 | 0.064 | 0.107 |
Std | 0.008 | 0.008 | 0.008 | 0.008 | 0.006 | 0.007 | 0.009 | 0.009 | |
20 + 200 | Mean | 0.067 | 0.064 | 0.054 | 0.054 | 0.051 | 0.058 | 0.074 | 0.106 |
Std | 0.008 | 0.008 | 0.007 | 0.008 | 0.006 | 0.008 | 0.009 | 0.008 | |
20 + 500 | Mean | 0.097 | 0.091 | 0.064 | 0.063 | 0.054 | 0.067 | 0.100 | 0.107 |
Std | 0.013 | 0.012 | 0.010 | 0.010 | 0.007 | 0.009 | 0.010 | 0.007 |
Features | kNN | BkNN | RkNN | MFS | RF | SVM | ESkNN | SUBiNN |
---|---|---|---|---|---|---|---|---|
20 | 0.0 | 35.6 | 20.8 | 21.6 | 57.3 | 3.7 | 137.2 | 40.2 |
20 + 50 | 0.1 | 103.8 | 36.9 | 37.6 | 154.6 | 8.0 | 210.5 | 409.7 |
20 + 100 | 0.2 | 166.1 | 63.2 | 63.0 | 254.8 | 12.5 | 307.6 | 1164.4 |
20 + 200 | 0.3 | 304.0 | 108.6 | 108.2 | 469.7 | 22.5 | 530.2 | 3947.4 |
20 + 500 | 2.5 | 2245.2 | 228.4 | 227.7 | 1106.4 | 58.2 | 1073.7 | 21367.2 |
4.3.2 Experiment 2: varying covariance structure
w | kNN | BkNN | RkNN | MFS | RF | SVM | ESkNN | SUBiNN | |
---|---|---|---|---|---|---|---|---|---|
3 | Mean | 0.226 | 0.227 | 0.200 | 0.200 | 0.049 | 0.091 | 0.181 | 0.137 |
Std | 0.014 | 0.014 | 0.013 | 0.012 | 0.007 | 0.009 | 0.011 | 0.008 | |
5 | Mean | 0.291 | 0.292 | 0.228 | 0.228 | 0.025 | 0.091 | 0.211 | 0.096 |
Std | 0.013 | 0.013 | 0.010 | 0.011 | 0.005 | 0.009 | 0.009 | 0.008 | |
10 | Mean | 0.332 | 0.335 | 0.198 | 0.198 | 0.005 | 0.079 | 0.166 | 0.035 |
Std | 0.012 | 0.012 | 0.008 | 0.008 | 0.002 | 0.008 | 0.009 | 0.005 | |
15 | Mean | 0.341 | 0.343 | 0.158 | 0.158 | 0.001 | 0.072 | 0.118 | 0.016 |
Std | 0.010 | 0.010 | 0.008 | 0.008 | 0.001 | 0.007 | 0.008 | 0.003 | |
20 | Mean | 0.340 | 0.343 | 0.128 | 0.128 | 0.001 | 0.067 | 0.085 | 0.007 |
Std | 0.010 | 0.010 | 0.008 | 0.008 | 0.001 | 0.007 | 0.007 | 0.003 |
4.3.3 Experiment 3: correlation and noise
Features | kNN | BkNN | RkNN | MFS | RF | SVM | ESkNN | SUBiNN | |
---|---|---|---|---|---|---|---|---|---|
20 | Mean | 0.463 | 0.465 | 0.055 | 0.055 | 0.001 | 0.003 | 0.063 | 0.027 |
Std | 0.007 | 0.007 | 0.009 | 0.009 | 0.001 | 0.002 | 0.009 | 0.006 | |
20 + 50 | Mean | 0.435 | 0.439 | 0.181 | 0.181 | 0.005 | 0.144 | 0.176 | 0.027 |
Std | 0.031 | 0.031 | 0.031 | 0.031 | 0.003 | 0.011 | 0.018 | 0.006 | |
20 + 100 | Mean | 0.373 | 0.376 | 0.190 | 0.190 | 0.016 | 0.161 | 0.191 | 0.027 |
Std | 0.039 | 0.041 | 0.028 | 0.028 | 0.007 | 0.012 | 0.018 | 0.006 | |
20 + 200 | Mean | 0.324 | 0.323 | 0.187 | 0.186 | 0.062 | 0.174 | 0.202 | 0.027 |
Std | 0.046 | 0.048 | 0.032 | 0.032 | 0.020 | 0.014 | 0.020 | 0.007 | |
20 + 500 | Mean | 0.285 | 0.281 | 0.182 | 0.182 | 0.120 | 0.186 | 0.218 | 0.027 |
Std | 0.036 | 0.037 | 0.021 | 0.022 | 0.011 | 0.011 | 0.017 | 0.006 |
w | kNN | BkNN | RkNN | MFS | RF | SVM | ESkNN | SUBiNN | |
---|---|---|---|---|---|---|---|---|---|
3 | Mean | 0.372 | 0.374 | 0.156 | 0.155 | 0.049 | 0.140 | 0.166 | 0.049 |
Std | 0.044 | 0.047 | 0.023 | 0.023 | 0.012 | 0.011 | 0.017 | 0.007 | |
5 | Mean | 0.310 | 0.310 | 0.132 | 0.132 | 0.081 | 0.135 | 0.154 | 0.064 |
Std | 0.037 | 0.039 | 0.015 | 0.015 | 0.006 | 0.010 | 0.013 | 0.007 | |
10 | Mean | 0.207 | 0.203 | 0.106 | 0.107 | 0.087 | 0.120 | 0.130 | 0.082 |
Std | 0.033 | 0.033 | 0.010 | 0.009 | 0.007 | 0.010 | 0.013 | 0.007 | |
15 | Mean | 0.143 | 0.139 | 0.094 | 0.094 | 0.081 | 0.107 | 0.109 | 0.091 |
Std | 0.023 | 0.022 | 0.008 | 0.008 | 0.007 | 0.008 | 0.011 | 0.007 | |
20 | Mean | 0.117 | 0.114 | 0.086 | 0.087 | 0.076 | 0.100 | 0.100 | 0.097 |
Std | 0.019 | 0.019 | 0.009 | 0.009 | 0.007 | 0.009 | 0.011 | 0.008 |
4.3.4 Experiment 4: varying correlation structure
4.4 Conclusion
5 Benchmark data study
5.1 Data
Name | Sample size | Features | Num/Cat/Nom | B-learners |
---|---|---|---|---|
Haberman (haber) | 306 | 3 | 3/0/0 | 6 |
Mammography (mammo) | 830 | 4 | 1/1/2 | 66 |
Transfusion (transf) | 748 | 4 | 4/0/0 | 10 |
Phoneme (phone) | 1000 | 5 | 5/0/0 | 15 |
\(\hbox {Liver disorders}^{\mathrm{a}}\) (bupa) | 345 | 5 | 5/0/0 | 15 |
Appendicitis (appen) | 106 | 7 | 7/0/0 | 28 |
Dystrophy (dystr) | 194 | 7 | 6/0/1 | 28 |
Pima Indians diabetes (diabe) | 768 | 8 | 8/0/0 | 36 |
Biopsy (biops) | 683 | 9 | 9/0/0 | 45 |
SAHeart (heart) | 462 | 9 | 8/1/0 | 45 |
Indian liver (Indian) | 579 | 10 | 9/0/1 | 55 |
\(\hbox {Solar flare}^{\mathrm{b}}\) (solar) | 323 | 10 | 7/0/3 | 231 |
\(\hbox {Credit approval}^{\mathrm{c}}\) (credit) | 653 | 15 | 7/0/8 | 820 |
House votes (house) | 232 | 16 | 0/16/0 | 136 |
Hepatitis (hepat) | 80 | 19 | 6/12/1 | 190 |
Two norm (twono) | 1000 | 20 | 20/0/0 | 210 |
\(\hbox {Cylinder bands}^{\mathrm{d}}\) (bands) | 365 | 24 | 18/0/5 | 300 |
\(\hbox {German credit}^{\mathrm{e}}\) (german) | 1000 | 24 | 24/0/0 | 300 |
Breast cancer (wpbc) | 194 | 33 | 33/0/0 | 561 |
Sonar (sonar) | 208 | 60 | 60/0/0 | 1830 |
Glaucoma (glauc) | 153 | 66 | 66/0/0 | 2211 |
Musk (musk) | 476 | 166 | 166/0/0 | 13861 |
5.2 Models
5.3 Results
kNN | BkNN | RkNN | MFS | RF | SVM | ESkNN | SUBiNN | B-learners | |
---|---|---|---|---|---|---|---|---|---|
Mammo | 0.212 | 0.209 | 0.196 | 0.196 | 0.193 | 0.204 | 0.198 | 0.195 | 5.934 |
0.005 | 0.005 | 0.003 | 0.003 | 0.003 | 0.005 | 0.005 | 0.003 | ||
Phone | 0.202 | 0.204 | 0.202 | 0.202 | 0.143 | 0.185 | 0.211 | 0.177 | 5.908 |
0.005 | 0.004 | 0.004 | 0.005 | 0.005 | 0.005 | 0.007 | 0.004 | ||
Dystr | 0.149 | 0.147 | 0.141 | 0.141 | 0.109 | 0.110 | 0.129 | 0.115 | 5.568 |
0.009 | 0.008 | 0.009 | 0.009 | 0.008 | 0.008 | 0.013 | 0.007 | ||
Diabe | 0.253 | 0.249 | 0.261 | 0.261 | 0.235 | 0.240 | 0.247 | 0.241 | 5.957 |
0.006 | 0.006 | 0.005 | 0.005 | 0.005 | 0.006 | 0.006 | 0.006 | ||
Credi | 0.205 | 0.200 | 0.177 | 0.177 | 0.122 | 0.148 | 0.136 | 0.136 | 9.011 |
0.006 | 0.005 | 0.005 | 0.004 | 0.004 | 0.005 | 0.006 | 0.000 | ||
Hepat | 0.149 | 0.137 | 0.148 | 0.149 | 0.112 | 0.139 | 0.417 | 0.131 | 7.090 |
0.018 | 0.018 | 0.011 | 0.010 | 0.015 | 0.017 | 0.093 | 0.019 | ||
Germa | 0.272 | 0.274 | 0.296 | 0.296 | 0.235 | 0.242 | 0.271 | 0.264 | 7.052 |
0.004 | 0.004 | 0.002 | 0.002 | 0.005 | 0.005 | 0.008 | 0.005 | ||
Glauc | 0.188 | 0.178 | 0.177 | 0.177 | 0.095 | 0.150 | 0.182 | 0.068 | 5.587 |
0.014 | 0.012 | 0.011 | 0.011 | 0.007 | 0.008 | 0.019 | 0.007 | ||
Haber | 0.255 | 0.252 | 0.269 | 0.269 | 0.279 | 0.268 | 0.259 | 0.258 | 3.574 |
0.007 | 0.008 | 0.008 | 0.008 | 0.009 | 0.008 | 0.011 | 0.009 | ||
Trans | 0.208 | 0.206 | 0.231 | 0.231 | 0.245 | 0.210 | 0.234 | 0.232 | 4.902 |
0.005 | 0.005 | 0.004 | 0.004 | 0.006 | 0.004 | 0.006 | 0.005 | ||
Bupa | 0.210 | 0.212 | 0.232 | 0.232 | 0.207 | 0.196 | 0.226 | 0.221 | 5.495 |
0.006 | 0.005 | 0.005 | 0.005 | 0.009 | 0.006 | 0.011 | 0.009 | ||
Biops | 0.030 | 0.031 | 0.029 | 0.029 | 0.027 | 0.040 | 0.034 | 0.031 | 14.508 |
0.002 | 0.001 | 0.000 | 0.000 | 0.002 | 0.002 | 0.003 | 0.002 | ||
Heart | 0.288 | 0.288 | 0.305 | 0.306 | 0.313 | 0.287 | 0.302 | 0.295 | 6.333 |
0.009 | 0.007 | 0.007 | 0.007 | 0.010 | 0.008 | 0.013 | 0.009 | ||
India | 0.306 | 0.306 | 0.281 | 0.281 | 0.294 | 0.291 | 0.290 | 0.291 | 8.300 |
0.008 | 0.008 | 0.003 | 0.003 | 0.010 | 0.005 | 0.010 | 0.007 | ||
House | 0.223 | 0.219 | 0.187 | 0.187 | 0.221 | 0.203 | 0.225 | 0.201 | 4.885 |
0.016 | 0.015 | 0.004 | 0.004 | 0.013 | 0.009 | 0.034 | 0.014 | ||
Sonar | 0.276 | 0.279 | 0.259 | 0.258 | 0.161 | 0.174 | 0.227 | 0.235 | 12.873 |
0.012 | 0.013 | 0.011 | 0.012 | 0.013 | 0.013 | 0.019 | 0.016 | ||
Musk | 0.221 | 0.219 | 0.188 | 0.187 | 0.101 | 0.106 | 0.177 | 0.207 | 18.288 |
0.008 | 0.007 | 0.008 | 0.007 | 0.007 | 0.007 | 0.013 | 0.011 | ||
Appen | 0.126 | 0.126 | 0.137 | 0.136 | 0.132 | 0.134 | 0.137 | 0.139 | 4.417 |
0.011 | 0.008 | 0.006 | 0.006 | 0.007 | 0.009 | 0.016 | 0.013 | ||
Solar | 0.173 | 0.172 | 0.174 | 0.174 | 0.176 | 0.171 | 0.174 | 0.175 | 8.225 |
0.002 | 0.002 | 0.001 | 0.001 | 0.004 | 0.003 | 0.003 | 0.002 | ||
Twono | 0.026 | 0.026 | 0.027 | 0.027 | 0.033 | 0.025 | 0.044 | 0.111 | 23.024 |
0.002 | 0.002 | 0.002 | 0.002 | 0.003 | 0.002 | 0.005 | 0.007 | ||
Bands | 0.320 | 0.316 | 0.319 | 0.319 | 0.234 | 0.290 | 0.339 | 0.366 | 11.093 |
0.012 | 0.009 | 0.008 | 0.008 | 0.013 | 0.009 | 0.018 | 0.011 | ||
Wpbc | 0.241 | 0.245 | 0.239 | 0.239 | 0.208 | 0.232 | 0.233 | 0.246 | 9.157 |
0.008 | 0.008 | 0.004 | 0.004 | 0.010 | 0.008 | 0.014 | 0.012 |