New rank methods for reducing the size of the training set using the nearest neighbor rule

doi:10.1016/j.patrec.2011.07.019

Pattern Recognition Letters

Volume 33, Issue 5, 1 April 2012, Pages 654-660

https://doi.org/10.1016/j.patrec.2011.07.019 Get rights and content

Abstract

Some new rank methods to select the best prototypes from a training set are proposed in this paper in order to establish its size according to an external parameter, while maintaining the classification accuracy. The traditional methods that filter the training set in a classification task like editing or condensing have some rules that apply to the set in order to remove outliers or keep some prototypes that help in the classification. In our approach, new voting methods are proposed to compute the prototype probability and help to classify correctly a new sample. This probability is the key to sorting the training set out, so a relevance factor from 0 to 1 is used to select the best candidates for each class whose accumulated probabilities are less than that parameter. This approach makes it possible to select the number of prototypes necessary to maintain or even increase the classification accuracy. The results obtained in different high dimensional databases show that these methods maintain the final error rate while reducing the size of the training set.

Highlights

► New rank methods to select the best prototypes from a training set are proposed. ► Its compute the prototype probability and help to classify correctly a new sample. ► A relevance factor from 0 to 1 is used to select the best candidates. ► The results show how to keep the final error rate while reducing the size of the training set.

Introduction

In classification problems, a statistical knowledge of the conditional density functions of each class is rarely available, so application of the optimal Bayes classification methods is not possible. The nearest neighbor (NN) rule and its extension (k-nearest neighbors) have been the most widely used non-parametric classifiers in practice.

The advantage of the NN rule lies in the fact that it combines its conceptual simplicity with an asymptotic error rate that is conveniently bounded in terms of the optimal Bayes error (Cover and Hart, 1967). However, the NN rule also presents some problems when the number of prototypes is large, since it needs to store all the examples in a memory and is sensitive to noisy instances. Many researchers have studied how to reduce the training set and obtain the same classification ability as when the whole training set is used (Wilson and Martinez, 2000, Wilson, 1972, Hart, 1968).

There are two different ways to deal with the instance reduction problem (Li et al., 2005):

•
New prototype generation that creates a new prototype set (Chang, 1974).
•
Prototype selection, consisting in selecting a particular subset of prototypes from the original training set:
- –
  The condensing or reducing algorithms give the minimal subset of prototypes that lead to the same performance as when the whole training set is used.
- –
  The editing algorithm eliminates atypical prototypes from the original set and removes overlapping among classes.

For condensing algorithms, the key problem is to decide which examples should be retained. The difference between the different condensing algorithms is how the typicality of training examples is evaluated. Condensing algorithms give more emphasis to minimizing the size of the training set and its consistence, but noisy examples may often be selected for the prototype set and harm the classification accuracy. For editing algorithms, identifying these ‘bad’ training examples that harm the accuracy is the most important challenge.

In this paper, some new rank methods are proposed in order to reduce the size of the training set. This rank is based on estimating the probability that an instance participates in a correct classification using the nearest neighbor rule. So, using this methodology the user could control the size of the resulting training set.

In the second section, some classical methodologies to reduce the training set are explained. In the third section, our new methodologies are introduced with their detailed algorithms. In the fourth section, the results obtained when applying different algorithms to reduce the size of the training set and their associated error rates on some widely used collection data are shown. Finally, some conclusions and future lines of work are presented.

Section snippets

Condensed Nearest Neighbor Rule

The Condensed Nearest Neighbor Rule (CNN) (Hart, 1968) was one of the first techniques to reduce the size of the training set. This algorithm gives a subset S of the training set T such that every member of T is closer to a member of S of the same class than to a member of a different class.

The algorithm selects a sample randomly from T. If this sample is incorrectly classified using S then it is added to S. The process is repeated until all samples belonging to T are selected and correctly

Our methodology

In order to illustrate the main idea of our proposed methodology, a binary classification problem is considered and a distribution of training samples, T. In the figures shown in this section, the points of class 1 are indicated by circles and the points of class 2 by rectangles. The main idea is that the prototypes in the training set vote for the rest of the prototypes that help them to classify correctly and the method estimates a probability for each prototype that shows its importance in a

Results

The experiments were performed using two well known isolated handwritten character databases and the UCI Machine Learning Repository (Asuncion and Newman, 2007). The first is a database of uppercase characters (the NIST Special Database 3 of the National Institute of Standards and Technology) and the second contains digits (the US Post Service digit recognition corpus, USPS). In both cases, the classification task was performed using contour descriptions with Freeman codes (Freeman, 1961) and

Conclusions and future work

In this paper, new algorithms to reduce the size of the training set for use in a classification task have been presented. These algorithms give a different estimate of the probability that a new example may be classified correctly by a training set example. The results obtained for accuracy are in most cases better than those obtained using the classical algorithms compared. Note the good performance shown by 1-FN, 2-FN and 1-NE algorithms with respect to the classification accuracy and the

References (14)

Asuncion, A., Newman, D., 2007. UCI machine learning repository. URL...
C. Chang
Finding prototypes for nearest neighbour classifiers
IEEE Trans. Comput.
(1974)
T. Cover et al.
Nearest neighbor pattern classification
IEEE Trans. Inform. Theory
(1967)
B.V. Dasarathy et al.
Nearest neighbour editing and condensing tools-synergy exploitation
Pattern Anal. Appl.
(2000)
F.J. Ferri et al.
Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules
IEEE Trans. Syst. Man Cybernet. Part B
(1999)
H. Freeman
On the encoding of arbitrary geometric configurations
IRE Trans. Electron. Comput.
(1961)
G. Gates
The reduced nearest neighbor rule
IEEE Trans. Inform. Theory IT-
(1972)

There are more references available in the full text version of this article.

Cited by (35)

Efficient k-nearest neighbor search based on clustering and adaptive k values
2022, Pattern Recognition
The $k$ -Nearest Neighbor ( $k$ NN) algorithm is widely used in the supervised learning field and, particularly, in search and classification tasks, owing to its simplicity, competitive performance, and good statistical properties. However, its inherent inefficiency prevents its use in most modern applications due to the vast amount of data that the current technological evolution generates, being thus the optimization of $k$ NN-based search strategies of particular interest. This paper introduces the caKD+ algorithm, which tackles this limitation by combining the use of feature learning techniques, clustering methods, adaptive search parameters per cluster, and the use of pre-calculated K-Dimensional Tree structures, and results in a highly efficient search method. This proposal has been evaluated using 10 datasets and the results show that caKD+ significantly outperforms 16 state-of-the-art efficient search methods while still depicting such an accurate performance as the one by the exhaustive $k$ NN search.
Extensions to rank-based prototype selection in k-Nearest Neighbour classification
2019, Applied Soft Computing Journal
Citation Excerpt :
each prototype of the training set votes for the other elements which lead to its correct classification. The rank methods considered in this paper focus on the two voting heuristics proposed so far, to our best knowledge: Farthest Neighbour (FN) and Nearest to Enemy (NE) [15]. Both strategies are based on the idea that a prototype can only vote for another element, and the point is to decide which prototype the vote is given to.
The $k$ -nearest neighbour rule is commonly considered for classification tasks given its straightforward implementation and good performance in many applications. However, its efficiency represents an obstacle in real-case scenarios because the classification requires computing a distance to every single prototype of the training set. Prototype Selection (PS) is a typical approach to alleviate this problem, which focuses on reducing the size of the training set by selecting the most interesting prototypes. In this context, rank methods have been postulated as a good solution: following some heuristics, these methods perform an ordering of the prototypes according to their relevance in the classification task, which is then used to select the most relevant ones. This work presents a significant improvement of existing rank methods by proposing two extensions: (i) a greater robustness against noise at label level by considering the parameter ‘k’ of the classification in the selection process; and (ii) a new parameter-free rule to select the prototypes once they have been ordered. The experiments performed in different scenarios and datasets demonstrate the goodness of these extensions. Also, it is empirically proved that the new full approach is competitive with respect to existing PS algorithms.
Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string
2019, Pattern Recognition Letters
Different pattern recognition techniques such as clustering, k-nearest neighbors classification or instance reduction algorithms require prototypes to represent pattern classes. In many applications, strings are used to encode instances, for example, in contour representations or in biological data such as DNA, RNA and protein sequences. Median strings have been used as representatives of a set of strings in different domains. Finding the median string is an NP-Complete problem for several formulations. Alternatively, heuristic approaches that iteratively refine an initial coarse solution by applying edit operations have been proposed. We propose here a novel algorithm that outperforms state of the art heuristic approximations to the median string in terms of convergence speed by estimating the effect of a perturbation in the minimization of the expressions that define the median strings. We present comparative experiments to validate these results.
Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation
2018, Pattern Recognition
While standing as one of the most widely considered and successful supervised classification algorithms, the k-nearest Neighbor (kNN) classifier generally depicts a poor efficiency due to being an instance-based method. In this sense, Approximated Similarity Search (ASS) stands as a possible alternative to improve those efficiency issues at the expense of typically lowering the performance of the classifier. In this paper we take as initial point an ASS strategy based on clustering. We then improve its performance by solving issues related to instances located close to the cluster boundaries by enlarging their size and considering the use of Deep Neural Networks for learning a suitable representation for the classification task at issue. Results using a collection of eight different datasets show that the combined use of these two strategies entails a significant improvement in the accuracy performance, with a considerable reduction in the number of distances needed to classify a sample in comparison to the basic kNN rule.
On the suitability of Prototype Selection methods for kNN classification with distributed data
2016, Neurocomputing
In the current Information Age, data production and processing demands are ever increasing. This has motivated the appearance of large-scale distributed information. This phenomenon also applies to Pattern Recognition so that classic and common algorithms, such as the k-Nearest Neighbour, are unable to be used. To improve the efficiency of this classifier, Prototype Selection (PS) strategies can be used. Nevertheless, current PS algorithms were not designed to deal with distributed data, and their performance is therefore unknown under these conditions. This work is devoted to carrying out an experimental study on a simulated framework in which PS strategies can be compared under classical conditions as well as those expected in distributed scenarios. Our results report a general behaviour that is degraded as conditions approach to more realistic scenarios. However, our experiments also show that some methods are able to achieve a fairly similar performance to that of the non-distributed scenario. Thus, although there is a clear need for developing specific PS methodologies and algorithms for tackling these situations, those that reported a higher robustness against such conditions may be good candidates from which to start.
Weighted Reward-Punishment Editing
2016, Pattern Recognition Letters
Citation Excerpt :
In [5] a novel clustering approach called supervised clustering is introduced that is based on the idea of replacing a dataset with a set of cluster prototypes that is determined by a supervised clustering procedure. Two prototype selection methods based on ranking are proposed in [26] and [27]. The first prototype method is based on new voting methods that compute a prototype probability that aids in the correct classification of a new sample.
The Nearest Neighbor classifier is a popular nonparametric classification method that has been successfully applied to many pattern recognition problems. Its usefulness has been limited, however, because of its computational complexity and sensitivity to outliers in the training set. Computational complexity is becoming less of an issue thanks to the availability of inexpensive memory and high processing speeds. To overcome the second limitation, sensitivity to outliers in the training set, researchers have developed editing and condensing techniques that are aimed at selecting a proper set of prototypes from the training set. In this work, we propose a new editing technique based on the idea of rewarding those patterns that make a contribution to a correct classification while punishing those patterns that provide an incorrect classification. This criterion is coupled with an approach for selecting the edited patterns based on the minimization of a criteria index related to the distances in the training patterns and on the calculation of edge and border patterns for determining the number of edited patterns. Extensive experiments were conducted across several classification problems that evaluated both the efficacy of the proposed technique with respect to other editing approaches and the advantages of using our proposed editing technique either in combination with condensing techniques or as a method utilized in the preprocessing stage. Moreover, the proposed editing approach is shown to be particularly effective when combined with a support vector machine (SVM) classifier. The MATLAB code used in the proposed paper will be available at https://www.dei.unipd.it/node/2357.

View all citing articles on Scopus

View full text

New rank methods for reducing the size of the training set using the nearest neighbor rule

Abstract

Highlights

Introduction

Section snippets

Condensed Nearest Neighbor Rule

Our methodology

Results

Conclusions and future work

Finding prototypes for nearest neighbour classifiers

IEEE Trans. Comput.

Nearest neighbor pattern classification

IEEE Trans. Inform. Theory

Nearest neighbour editing and condensing tools-synergy exploitation

Pattern Anal. Appl.

Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules

IEEE Trans. Syst. Man Cybernet. Part B

On the encoding of arbitrary geometric configurations

IRE Trans. Electron. Comput.

The reduced nearest neighbor rule

IEEE Trans. Inform. Theory IT-