New rank methods for reducing the size of the training set using the nearest neighbor rule
Highlights
► New rank methods to select the best prototypes from a training set are proposed. ► Its compute the prototype probability and help to classify correctly a new sample. ► A relevance factor from 0 to 1 is used to select the best candidates. ► The results show how to keep the final error rate while reducing the size of the training set.
Introduction
In classification problems, a statistical knowledge of the conditional density functions of each class is rarely available, so application of the optimal Bayes classification methods is not possible. The nearest neighbor (NN) rule and its extension (k-nearest neighbors) have been the most widely used non-parametric classifiers in practice.
The advantage of the NN rule lies in the fact that it combines its conceptual simplicity with an asymptotic error rate that is conveniently bounded in terms of the optimal Bayes error (Cover and Hart, 1967). However, the NN rule also presents some problems when the number of prototypes is large, since it needs to store all the examples in a memory and is sensitive to noisy instances. Many researchers have studied how to reduce the training set and obtain the same classification ability as when the whole training set is used (Wilson and Martinez, 2000, Wilson, 1972, Hart, 1968).
There are two different ways to deal with the instance reduction problem (Li et al., 2005):
- •
New prototype generation that creates a new prototype set (Chang, 1974).
- •
Prototype selection, consisting in selecting a particular subset of prototypes from the original training set:
- –
The condensing or reducing algorithms give the minimal subset of prototypes that lead to the same performance as when the whole training set is used.
- –
The editing algorithm eliminates atypical prototypes from the original set and removes overlapping among classes.
- –
For condensing algorithms, the key problem is to decide which examples should be retained. The difference between the different condensing algorithms is how the typicality of training examples is evaluated. Condensing algorithms give more emphasis to minimizing the size of the training set and its consistence, but noisy examples may often be selected for the prototype set and harm the classification accuracy. For editing algorithms, identifying these ‘bad’ training examples that harm the accuracy is the most important challenge.
In this paper, some new rank methods are proposed in order to reduce the size of the training set. This rank is based on estimating the probability that an instance participates in a correct classification using the nearest neighbor rule. So, using this methodology the user could control the size of the resulting training set.
In the second section, some classical methodologies to reduce the training set are explained. In the third section, our new methodologies are introduced with their detailed algorithms. In the fourth section, the results obtained when applying different algorithms to reduce the size of the training set and their associated error rates on some widely used collection data are shown. Finally, some conclusions and future lines of work are presented.
Section snippets
Condensed Nearest Neighbor Rule
The Condensed Nearest Neighbor Rule (CNN) (Hart, 1968) was one of the first techniques to reduce the size of the training set. This algorithm gives a subset S of the training set T such that every member of T is closer to a member of S of the same class than to a member of a different class.
The algorithm selects a sample randomly from T. If this sample is incorrectly classified using S then it is added to S. The process is repeated until all samples belonging to T are selected and correctly
Our methodology
In order to illustrate the main idea of our proposed methodology, a binary classification problem is considered and a distribution of training samples, T. In the figures shown in this section, the points of class 1 are indicated by circles and the points of class 2 by rectangles. The main idea is that the prototypes in the training set vote for the rest of the prototypes that help them to classify correctly and the method estimates a probability for each prototype that shows its importance in a
Results
The experiments were performed using two well known isolated handwritten character databases and the UCI Machine Learning Repository (Asuncion and Newman, 2007). The first is a database of uppercase characters (the NIST Special Database 3 of the National Institute of Standards and Technology) and the second contains digits (the US Post Service digit recognition corpus, USPS). In both cases, the classification task was performed using contour descriptions with Freeman codes (Freeman, 1961) and
Conclusions and future work
In this paper, new algorithms to reduce the size of the training set for use in a classification task have been presented. These algorithms give a different estimate of the probability that a new example may be classified correctly by a training set example. The results obtained for accuracy are in most cases better than those obtained using the classical algorithms compared. Note the good performance shown by 1-FN, 2-FN and 1-NE algorithms with respect to the classification accuracy and the
References (14)
- Asuncion, A., Newman, D., 2007. UCI machine learning repository. URL...
Finding prototypes for nearest neighbour classifiers
IEEE Trans. Comput.
(1974)- et al.
Nearest neighbor pattern classification
IEEE Trans. Inform. Theory
(1967) - et al.
Nearest neighbour editing and condensing tools-synergy exploitation
Pattern Anal. Appl.
(2000) - et al.
Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules
IEEE Trans. Syst. Man Cybernet. Part B
(1999) On the encoding of arbitrary geometric configurations
IRE Trans. Electron. Comput.
(1961)The reduced nearest neighbor rule
IEEE Trans. Inform. Theory IT-
(1972)
Cited by (35)
Efficient k-nearest neighbor search based on clustering and adaptive k values
2022, Pattern RecognitionExtensions to rank-based prototype selection in k-Nearest Neighbour classification
2019, Applied Soft Computing JournalCitation Excerpt :each prototype of the training set votes for the other elements which lead to its correct classification. The rank methods considered in this paper focus on the two voting heuristics proposed so far, to our best knowledge: Farthest Neighbour (FN) and Nearest to Enemy (NE) [15]. Both strategies are based on the idea that a prototype can only vote for another element, and the point is to decide which prototype the vote is given to.
Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string
2019, Pattern Recognition LettersWeighted Reward-Punishment Editing
2016, Pattern Recognition LettersCitation Excerpt :In [5] a novel clustering approach called supervised clustering is introduced that is based on the idea of replacing a dataset with a set of cluster prototypes that is determined by a supervised clustering procedure. Two prototype selection methods based on ranking are proposed in [26] and [27]. The first prototype method is based on new voting methods that compute a prototype probability that aids in the correct classification of a new sample.