Hubness-aware kNN classification of high-dimensional data in presence of label noise
Introduction
Designing effective and robust supervised learning algorithms for classification in the presence of label noise is an important practical issue, as obtaining reliable data labels is often expensive or simply infeasible due to data size in large-scale systems [18].
Classification noise can be random, feature-dependent or adversarial. Label flip probabilities can be either uniform and symmetric or depend on particular classes and class pairs. The simple random classification noise (RCN) model was first introduced in [2]. It is a model of how non-adversarial noise might affect the data. Given a training set of labeled examples and a value , denotes the distribution corresponding to T corrupted with random classification noise at rate η. A draw from is equivalent to a uniformly random draw from T where the label y of the selected (x,y) is randomly flipped with probability η.
The issue of unreliable and noisy labels can be approached in two ways: by trying to identify and correct/eliminate suspect data points and by incorporating noise into the learning model. Neither approach is trivial, as it is not always easy to distinguish mislabeled examples from the exceptions to general rules, atypical data points. When an instance lies far from its class interior and in proximity of instances from different classes, it can sometimes be mistaken for a mislabeled point [43]. Yet, atypical points sometimes hold valuable discriminative information, as they might help in defining proper class boundaries for classification. Additionally, many filtering approaches assume that all the data is available at the filtering stage and not prohibitively large [66].
Instead of filtering or explicit noise source modeling, it is also possible to design learning techniques that exhibit implicit robustness to high rates of label noise. In this paper, we will demonstrate that the recently proposed hubness-aware k-nearest neighbor classification methods [38], [53], [52], [49] can be used for robust classification of intrinsically high-dimensional data under the assumption of label noise. This robustness is a consequence of the fact that, unlike in most kNN approaches, neighbor instances do not vote by their label at classification time. Instead, their vote is determined by their past occurrences on the training data.
Hubness [39] is a ubiquitous property of intrinsically high-dimensional data. With increasing dimensionality, the degree distribution of the kNN graph becomes increasingly skewed and hubs emerge as central and influential points among the data. The kNN graph itself assumes a scale-free-like topology. This has multiple consequences for similarity-based learning methods and k-nearest neighbor methods in particular. Additionally, it changes how random labeling noise affects the learning process. Errors in hub point labels can induce severe mislabeling while errors in orphan points or regular points have little influence on the classification accuracy in kNN methods.
In this paper, we introduce the concept of hubness-proportional random label noise as an adversarial noise model where the most influential points in the data are most likely to be corrupted. The probability of a label flip is set to be proportional to the neighbor occurrence frequency of the data point. Hubness-proportional random label noise models how a potentially successful malicious attempt can compromise the most relevant and influential neighbor points in order to disrupt kNN-based retrieval, recommendation or prediction systems.
To our knowledge, this paper is the first detailed study dedicated to examining the influence of hubness on kNN classification with uncertain data labels.
The paper is organized as follows: Section 2 summarizes the related work and the existing approaches for dealing with label noise. Consequences of hubness in intrinsically high-dimensional data are discussed in Section 3. Neighbor occurrence models for hubness-aware kNN classification of high-dimensional data are described in detail in Section 4, followed by examples that demonstrate their potential for learning under label noise. Section 5 introduces the concept of hubness-proportional random label noise and gives practical examples of the susceptibility of kNN methods to label noise under high data hubness. The data used in the experiments is described in Section 6, followed by experimental results and summaries in Section 7. In Section 8, the main contributions of the paper are summarized and several directions for future work are proposed.
Section snippets
Related work
Label noise often occurs in large-scale problems where labelling is crowdsourced to a large number of non-experts instead of having the domain experts carefully label each data instance, for instance via Amazon׳s Mechanical Turk. In such cases, it has been shown to be beneficial to obtain multiple labels for each data point or carefully selected subsets of data points [24], [42]. Evaluating the labeling accuracy of individual experts and non-experts can also be used in order to improve label
Hubness in intrinsically high-dimensional data
Hubness is a consequence of high intrinsic data dimensionality related to the degree distribution of the kNN graph [39]. Hub points arise as centers of influence, as they occur very frequently as nearest neighbors. In fact, the entire neighbor occurrence frequency distribution becomes skewed and most points become anti-hubs or orphans, i.e. they occur rarely or never as neighbors to other points. Hubs often exhibit a detrimental influence by inducing many label mismatches in kNN sets and they
Neighbor occurrence models
Section 4.1 describes the basic ideas behind the neighbor occurrence models in hubness-aware classifiers and Section 4.2 gives examples of how the use of neighbor occurrence models might improve classification performance under label noise.
Hubness-proportional random label noise
In adversarial classification tasks like intrusion detection or spam filtering, malicious adversaries may manipulate data labels in order to affect the classification outcome. As hubs are the centers of influence in kNN classification, we postulate that most damage could potentially be done to kNN-based learning systems by targeting hub points specifically with label noise.
Uniform random label noise is, therefore, insufficient to properly estimate the pessimistic scenarios of potentially
Data
The experiments were performed on several data domains. The benchmark consists of quantized image representations, high-dimensional Gaussian mixtures, UCI datasets,4 as well as UCR time series datasets.5 Images and Gaussian data exhibited substantial hubness, while the selected UCI and UCR datasets exhibited low-to-medium hubness on average.
Image data used in the experiments consists of several high-hubness
Experiments
All experiments were run as 10-times 10-fold cross-validation. Corrected re-sampled t-test was used for statistical significance comparisons [8].
Most experiments were performed in three main experimental setups: learning and classification with correct data labels; learning and classification under uniform random label noise rate and learning and classification under hubness-proportional random label noise rate . The influence of varying the noise levels is discussed in the
Conclusions and future work
High data dimensionality poses significant challenges for k-nearest neighbor classification. We have examined the influence of hubness as an aspect of the curse of dimensionality [5] on the problem of kNN classification with label noise. The emergence of hubs induces a change in the distribution of influence that affects the susceptibility of k-nearest neighbor methods to mislabeled training examples. Mislabeled hub points can potentially induce severe misclassification.
In order to evaluate the
Acknowledgments
This paper was supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences. Research partially performed within the framework of the grant of the Hungarian Scientific Research Fund – OTKA Grant no. 111710.
Nenad Tomašev obtained his PhD degree from the Artificial Intelligence Laboratory at Jožef Stefan Institute in Ljubljana. He was graduated with honors in 2008 from the Department of Mathematics and Informatics at the University of Novi Sad. His research focus is in the area of Machine Learning and Data Mining, as well as Stochastic Optimization and Artificial Life. For his PhD, he worked on exploring the role of hub-points in high-dimensional data analysis, including clustering, classification,
References (66)
- et al.
Learning kernel logistic regression in the presence of class label noise
Pattern Recognit.
(2014) - et al.
Estimating mutual information for feature selection in the presence of label noise
Comput. Stat. Data Anal.
(2014) - et al.
Using dynamic time warping for online temporal fusion in multisensor systems
Inf. Fusion: Special Iss. Distrib. Sensor Netw.
(2008) Neighbor-weighted k-nearest neighbor for unbalanced text corpus
Expert Syst. Appl.
(2005)- et al.
Class imbalance and the curse of minority hubs
Knowl. Based Syst.
(2013) - et al.
Improving nearest neighbor rule with a simple adaptive distance measure
Pattern Recognit. Lett.
(2007) - et al.
Dynamic time warping constraint learning for large margin nearest neighbor classification
Inf. Sci.
(2011) - T.A. Almeida, J.M.G. Hidalgo, A. Yamakami, Contributions to the study of sms spam filtering: new collection and...
- et al.
Learning from noisy examples
Mach. Learn.
(1988) - J. Aucouturier, Ten experiments on the modelling of polyphonic timbre (Doctoral dissertation), University of Paris,...
Improving timbre similarityhow high is the sky?
J. Negative Results Speech Audio Sci.
Adaptive Control Processes—A Guided Tour
Identifying mislabeled training data
J. Artif. Intell. Res.
Investigating the Impact of Hubness on SVM Classifiers
Identifying mislabeled training data with the aid of unlabeled data
Appl. Intell.
Repeated labeling using multiple noisy labelers
Data Min. Knowl. Dis.
A boosting approach to remove class label noise
Int. J. Hybrid Intell. Syst.
A fuzzy k-nearest-neighbor algorithm
IEEE Trans. Syst. Man Cybern.
Detecting noisy instances with the rule-based classification model
Intell. Data Anal.
Cited by (30)
A two-level hybrid approach for intrusion detection
2016, NeurocomputingCitation Excerpt :Moreover, an anomaly detection method based on the change of position of cluster centres is proposed and is used to construct the anomaly detection component in stage 1. The k-nearest neighbour (k-NN) algorithm [10,11] is employed to build the two detection components in stage 2. The proposed hybrid approach is evaluated by performing experiments on the KDD'99 benchmark dataset [12].
Nearest neighbor regression in the presence of bad hubs
2015, Knowledge-Based SystemsCitation Excerpt :Additionally, we evaluate the proposed approaches in the presence of label noise because, on the one hand, tolerance to noise is one of the most relevant aspects from the point of view of real-world applications, on the other hand, the selection of appropriate noise models allow us to simulate the increased presence of bad hubs. In particular, we perform experiments under the assumption of two types of noise: we consider conventional Gaussian label noise and an adapted version of the recently proposed hubness-proportional random label noise [25]. This adaptation is one of the minor contributions of the paper and it is necessary because hubness-proportional random label noise was originally introduced for classification problems.
Fast Hubness-Reduced Nearest Neighbor Search for Entity Alignment in Knowledge Graphs
2022, SN Computer Science
Nenad Tomašev obtained his PhD degree from the Artificial Intelligence Laboratory at Jožef Stefan Institute in Ljubljana. He was graduated with honors in 2008 from the Department of Mathematics and Informatics at the University of Novi Sad. His research focus is in the area of Machine Learning and Data Mining, as well as Stochastic Optimization and Artificial Life. For his PhD, he worked on exploring the role of hub-points in high-dimensional data analysis, including clustering, classification, anomaly detection, information retrieval, feature selection, metric learning and re-ranking. He received the Best Research Paper Runner-up Award at PAKDD 2011 for his work on hubness-based clustering and the Best Paper Award at MLDM 2011 for his work on fuzzy hubness-aware classification. He authored two software packages for hubness-aware machine learning, Hub Miner and Image Hub Explorer. He has actively participated as a teaching assistant in Petnica Science Center and Višnjan Summer School. He was previously employed as a software engineering intern at Google and DMS Group.
Krisztian Buza (born 1984) obtained his Diploma in Computer Science (Informatics Engineer) in 2007 from the Faculty of Electrical Engineering and Informatics of the Budapest University of Technology and Economics. Between 2007 and 2011 he was a research assistant and PhD student at the University of Hildesheim, where he obtained his PhD in 2011. Since 2011, he gave several courses related to data mining at the Budapest University of Technology and Economics. In the last 5 years, he co-authored more than 20 publications and participated in several research projects in cooperation with industrial partners such as Rolls Royce, Morgan Stanley and Capgemini. His work on time series classification was honored by the Best Paper Award of IEEE׳s renowned conference on Computational Science and Engineering (2010) and his joint work with Gabor Nagy was nominated for the Best Paper Award of the 12th Industrial Conference on Data Mining (Berlin, 2012). His main research interests are data mining and machine learning with special focus on hybrid models, time-series classification and their applications.