Abstract
The \(k\)-Nearest Neighbour classifier is widely used and popular due to its inherent simplicity and the avoidance of model assumptions. Although the approach has been shown to yield a near-optimal classification performance for an infinite number of samples, a selection of the most decisive data points can improve the classification accuracy considerably in real settings with a limited number of samples. At the same time, a selection of a subset of representative training samples reduces the required amount of storage and computational resources. We devised a new approach that selects a representative training subset on the basis of an evolutionary optimization procedure. This method chooses those training samples that have a strong influence on the correct prediction of other training samples, in particular those that have uncertain labels. The performance of the algorithm is evaluated on different data sets. Additionally, we provide graphical examples of the selection procedure.
Similar content being viewed by others
References
Angiulli F (2005) Fast condensed nearest neighbor rule. In: De Raedt L, Wrobel S (eds) ICML, ACM pp 25–32
Bhattacharya B, Kaller D (1998) Reference set thinning for the k-nearest neighbor decision rule. In: Proceedings of the 14th international conference on pattern recognition, Vol 1- ICPR ’98. IEEE Computer Society pp 238–242
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6(2):153–172
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Dasarathy BV (1994) Minimal consistent set (MCS) identification for optimal nearest neighbor decision systems design. IEEE Trans Syst Man Cybern 24(3):511–517
Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, Berlin
Fix E, Hodges J (1951) Discriminatory analysis: nonparametric discrimination: consistency properties. Technical report project 21–49-004, Report number 4, USAF School of Aviation Medicine, Randolf Field, Texas
Frank A, Asuncion A (2010) UCI machine learning repository http://archive.ics.uci.edu/ml
Gates G (1972) The reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433
Gil-Pita R, Yao X (2007) Using a genetic algorithm for editing k-nearest neighbor classifiers. In: Yin H, Tiño P, Corchado E, Byrne W, Yao X (eds) IDEAL (Lecture Notes in Computer Science), vol 4881. Springer pp 1141–1150
Hart P (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516
Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor
Horton P, Nakai K (1996) A probabilistic classification system for predicting the cellular localization sites of proteins. In: States D, Agarwal P, Gaasterland T, Hunter L, Smith R (eds) Proceedings of the fourth international conference on intelligent systems for molecular Biology. AAAI Press pp 109–115
Kohonen T (1988) Learning vector quantization. Neural Netw 1:303
Kohonen T (1995) Learning vector quantization. In: Arbib M (ed) The handbook of brain theory and neural networks. MIT Press, Cambrige, pp 537–540
Kuncheva L (1995) Editing for the k-nearest neighbors rule by a genetic algorithm. Pattern Recognit Lett 16(8):809–814
Kuncheva L (1997) Fitness functions in editing k-nn reference set by genetic algorithms. Pattern Recognit 30(6):1041–1049
Kuncheva L, Jain L (1999) Nearest neighbor classifier: Simultaneous editing and feature selection. Pattern Recognit Lett 20(11–13):1149–1156
Langford J (2005) Tutorial on practical prediction theory for classification. J Mach Learn Res 6:273–306
Li D, Simske S (2011) Training set compression by incremental clustering. J Pattern Recognit Res 6(1): 56–63
Littlestone N, Warmuth M (1986) Relating data compression and learnability. Unpublished manuscript
Schwenker F, Kestler HA (2002) Analysis of support vectors helps to identify borderline patients in classification studies. Comput Cardiol 29:305–308
Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech Dig 10(3):262–266
Street WN, Wolberg WH, Mangasaria OL (1993) Nuclear feature extraction for breast tumor diagnosis. In: Acharya R, Goldgof D (eds) International symposium on electronic imaging: science and technology, vol 1905. SPIE Conference Series, pp 861–870
Webb A (2002) Statistical pattern recognition, 2nd edn. Wiley, London
Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421
Acknowledgments
This work is supported by the Graduate School of Mathematical Analysis of Evolution, Information and Complexity at the University of Ulm (CM, HAK) and by the German federal ministry of education and research (BMBF) within the framework of the program of medical genome research (PaCa-Net; project ID PKB-01GS08) and GerontoSys (Forschungskern SyStaR). AM was supported by the Ministry of Education and Science of the Russian Federation under the government contract 13.G25.31.0017 with the joint-stock company “Academician M.F. Reshetnev Information Satellite Systems”.
Author information
Authors and Affiliations
Corresponding author
Additional information
Ludwig Lausser and Christoph Müssel: contributed equally.
Appendix
Appendix
The accuracy term \(A\) given in Eq. 10 is inspired by the idea of sample compression bounds (Littlestone and Warmuth 1986). These bounds can be used to give an upper limit to the true classification error probability \(\mathcal{E }\) of a data-dependent classifier \(c\). A classifier is called data-dependent if it can be reconstructed directly from a certain subset of training samples \(\mathcal{R } \subseteq \mathcal{T }\) (called a compression set in this context), i.e. \(c_\mathcal{R } = c_\mathcal{T }\). This is clearly true for \(k\)-NN-like classifiers following Eq. 3, as they can be reconstructed from their reference set. A sample compression bound estimates the true error probability \(\mathcal{E }\) from the empirical error rate E calculated on the remaining samples \(\mathcal T {\setminus }\mathcal{R }\).
Theorem 1
(e.g. Langford 2005) For a random sample \(\mathcal{T }\) of iid examples drawn from an arbitrary, but fixed distribution \(\mathcal{D }\) and for all \(\delta \in \left(0,1\right]\),
Proof
Theorem 1 is a direct application of the sample compression bound given in Langford (2005). \(\square \)
The term \(\mathrm{{\overline{Bin}}}\) in Theorem 1 denotes the binomial tail inversion.
Figure 3 shows an application of the sample compression bound on a training set of size \(|\mathcal{T }|=100.\) The bound is shown for fixed empirical error rates of 0.01, 0.05 and 0.1. The bound suggests that of all \(k\)-NN classifiers achieving the same empirical error, the classifier with the smallest reference set is most reliable.
Rights and permissions
About this article
Cite this article
Lausser, L., Müssel, C., Melkozerov, A. et al. Identifying predictive hubs to condense the training set of \(k\)-nearest neighbour classifiers. Comput Stat 29, 81–95 (2014). https://doi.org/10.1007/s00180-012-0379-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-012-0379-0