Hubness-aware kNN classification of high-dimensional data in presence of label noise

doi:10.1016/j.neucom.2014.10.084

Neurocomputing

Volume 160, 21 July 2015, Pages 157-172

https://doi.org/10.1016/j.neucom.2014.10.084 Get rights and content

Abstract

Learning with label noise is an important issue in classification, since it is not always possible to obtain reliable data labels. In this paper we explore and evaluate a new approach to learning with label noise in intrinsically high-dimensional data, based on using neighbor occurrence models for hubness-aware k-nearest neighbor classification. Hubness is an important aspect of the curse of dimensionality that has a negative effect on many types of similarity-based learning methods. As we will show, the emergence of hubs as centers of influence in high-dimensional data affects the learning process in the presence of label noise. We evaluate the potential impact of hub-centered noise by defining a hubness-proportional random label noise model that is shown to induce a significantly higher kNN misclassification rate than the uniform random label noise. Real-world examples are discussed where hubness-correlated noise arises either naturally or as a consequence of an adversarial attack. Our experimental evaluation reveals that hubness-based fuzzy k-nearest neighbor classification and Naive Hubness-Bayesian k-nearest neighbor classification might be suitable for learning under label noise in intrinsically high-dimensional data, as they exhibit robustness to high levels of random label noise and hubness-proportional random label noise. The results demonstrate promising performance across several data domains.

Introduction

Designing effective and robust supervised learning algorithms for classification in the presence of label noise is an important practical issue, as obtaining reliable data labels is often expensive or simply infeasible due to data size in large-scale systems [18].

Classification noise can be random, feature-dependent or adversarial. Label flip probabilities can be either uniform and symmetric or depend on particular classes and class pairs. The simple random classification noise (RCN) model was first introduced in [2]. It is a model of how non-adversarial noise might affect the data. Given a training set $T = (X, Y)$ of labeled examples and a value $η \in (0, 1 / 2)$ , $D_{η, T}$ denotes the distribution corresponding to T corrupted with random classification noise at rate η. A draw from $D_{η, T}$ is equivalent to a uniformly random draw from T where the label y of the selected (x,y) is randomly flipped with probability η.

The issue of unreliable and noisy labels can be approached in two ways: by trying to identify and correct/eliminate suspect data points and by incorporating noise into the learning model. Neither approach is trivial, as it is not always easy to distinguish mislabeled examples from the exceptions to general rules, atypical data points. When an instance lies far from its class interior and in proximity of instances from different classes, it can sometimes be mistaken for a mislabeled point [43]. Yet, atypical points sometimes hold valuable discriminative information, as they might help in defining proper class boundaries for classification. Additionally, many filtering approaches assume that all the data is available at the filtering stage and not prohibitively large [66].

Instead of filtering or explicit noise source modeling, it is also possible to design learning techniques that exhibit implicit robustness to high rates of label noise. In this paper, we will demonstrate that the recently proposed hubness-aware k-nearest neighbor classification methods [38], [53], [52], [49] can be used for robust classification of intrinsically high-dimensional data under the assumption of label noise. This robustness is a consequence of the fact that, unlike in most kNN approaches, neighbor instances do not vote by their label at classification time. Instead, their vote is determined by their past occurrences on the training data.

Hubness [39] is a ubiquitous property of intrinsically high-dimensional data. With increasing dimensionality, the degree distribution of the kNN graph becomes increasingly skewed and hubs emerge as central and influential points among the data. The kNN graph itself assumes a scale-free-like topology. This has multiple consequences for similarity-based learning methods and k-nearest neighbor methods in particular. Additionally, it changes how random labeling noise affects the learning process. Errors in hub point labels can induce severe mislabeling while errors in orphan points or regular points have little influence on the classification accuracy in kNN methods.

In this paper, we introduce the concept of hubness-proportional random label noise as an adversarial noise model where the most influential points in the data are most likely to be corrupted. The probability of a label flip is set to be proportional to the neighbor occurrence frequency of the data point. Hubness-proportional random label noise models how a potentially successful malicious attempt can compromise the most relevant and influential neighbor points in order to disrupt kNN-based retrieval, recommendation or prediction systems.

To our knowledge, this paper is the first detailed study dedicated to examining the influence of hubness on kNN classification with uncertain data labels.

The paper is organized as follows: Section 2 summarizes the related work and the existing approaches for dealing with label noise. Consequences of hubness in intrinsically high-dimensional data are discussed in Section 3. Neighbor occurrence models for hubness-aware kNN classification of high-dimensional data are described in detail in Section 4, followed by examples that demonstrate their potential for learning under label noise. Section 5 introduces the concept of hubness-proportional random label noise and gives practical examples of the susceptibility of kNN methods to label noise under high data hubness. The data used in the experiments is described in Section 6, followed by experimental results and summaries in Section 7. In Section 8, the main contributions of the paper are summarized and several directions for future work are proposed.

Section snippets

Related work

Label noise often occurs in large-scale problems where labelling is crowdsourced to a large number of non-experts instead of having the domain experts carefully label each data instance, for instance via Amazon׳s Mechanical Turk. In such cases, it has been shown to be beneficial to obtain multiple labels for each data point or carefully selected subsets of data points [24], [42]. Evaluating the labeling accuracy of individual experts and non-experts can also be used in order to improve label

Hubness in intrinsically high-dimensional data

Hubness is a consequence of high intrinsic data dimensionality related to the degree distribution of the kNN graph [39]. Hub points arise as centers of influence, as they occur very frequently as nearest neighbors. In fact, the entire neighbor occurrence frequency distribution becomes skewed and most points become anti-hubs or orphans, i.e. they occur rarely or never as neighbors to other points. Hubs often exhibit a detrimental influence by inducing many label mismatches in kNN sets and they

Neighbor occurrence models

Section 4.1 describes the basic ideas behind the neighbor occurrence models in hubness-aware classifiers and Section 4.2 gives examples of how the use of neighbor occurrence models might improve classification performance under label noise.

Hubness-proportional random label noise

In adversarial classification tasks like intrusion detection or spam filtering, malicious adversaries may manipulate data labels in order to affect the classification outcome. As hubs are the centers of influence in kNN classification, we postulate that most damage could potentially be done to kNN-based learning systems by targeting hub points specifically with label noise.

Uniform random label noise is, therefore, insufficient to properly estimate the pessimistic scenarios of potentially

Data

The experiments were performed on several data domains. The benchmark consists of quantized image representations, high-dimensional Gaussian mixtures, UCI datasets,⁴ as well as UCR time series datasets.⁵ Images and Gaussian data exhibited substantial hubness, while the selected UCI and UCR datasets exhibited low-to-medium hubness on average.

Image data used in the experiments consists of several high-hubness

Experiments

All experiments were run as 10-times 10-fold cross-validation. Corrected re-sampled t-test was used for statistical significance comparisons [8].

Most experiments were performed in three main experimental setups: learning and classification with correct data labels; learning and classification under uniform random label noise rate $η = 0.3$ and learning and classification under hubness-proportional random label noise rate $η = 0.3$ . The influence of varying the noise levels is discussed in the

Conclusions and future work

High data dimensionality poses significant challenges for k-nearest neighbor classification. We have examined the influence of hubness as an aspect of the curse of dimensionality [5] on the problem of kNN classification with label noise. The emergence of hubs induces a change in the distribution of influence that affects the susceptibility of k-nearest neighbor methods to mislabeled training examples. Mislabeled hub points can potentially induce severe misclassification.

In order to evaluate the

Acknowledgments

This paper was supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences. Research partially performed within the framework of the grant of the Hungarian Scientific Research Fund – OTKA Grant no. 111710.

References (66)

J. Bootkrajang et al.
Learning kernel logistic regression in the presence of class label noise
Pattern Recognit.
(2014)
B. Frenay et al.
Estimating mutual information for feature selection in the presence of label noise
Comput. Stat. Data Anal.
(2014)
M.H. Ko et al.
Using dynamic time warping for online temporal fusion in multisensor systems
Inf. Fusion: Special Iss. Distrib. Sensor Netw.
(2008)
S. Tan
Neighbor-weighted k-nearest neighbor for unbalanced text corpus
Expert Syst. Appl.
(2005)
N. Tomašev et al.
Class imbalance and the curse of minority hubs
Knowl. Based Syst.
(2013)
J. Wang et al.
Improving nearest neighbor rule with a simple adaptive distance measure
Pattern Recognit. Lett.
(2007)
D. Yu et al.
Dynamic time warping constraint learning for large margin nearest neighbor classification
Inf. Sci.
(2011)
T.A. Almeida, J.M.G. Hidalgo, A. Yamakami, Contributions to the study of sms spam filtering: new collection and...
D. Angluin et al.
Learning from noisy examples
Mach. Learn.
(1988)
J. Aucouturier, Ten experiments on the modelling of polyphonic timbre (Doctoral dissertation), University of Paris,...

J. Aucouturier et al.

Improving timbre similarityhow high is the sky?

J. Negative Results Speech Audio Sci.

(2004)

R.E. Bellman

Adaptive Control Processes—A Guided Tour

(1961)

B. Biggio, B. Nelson, P. Laskov, Support vector machines under adversarial label noise, in: ACML, 2011, pp....

R. Bouckaert, E. Frank, Evaluating the replicability of significance tests for comparing learning algorithms, in:...

C.E. Brodley et al.

Identifying mislabeled training data

J. Artif. Intell. Res.

(1999)

K. Buza, A. Nanopoulos, L. Schmidt-Thieme, Time-series classification based on individualised error prediction, in:...

I. Cantador, J. Dorronsoro, Boosting parallel perceptrons for label noise reduction in classification problems, in:...

G.V. Cormack, J.M. Gómez Hidalgo, E.P. Sánz, Spam filtering for short messages, in: Proceedings of the 16th ACM...

G.V. Cormack, J.M.G. Hidalgo, E.P. Sánz, Feature engineering for mobile (sms) spam filtering, in: Proceedings of the...

J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in:...

P. Donmez, J.G. Carbonell, J. Schneider, Efficiently learning the accuracy of labeling sources for selective sampling,...

A. Flexer, M. Gasser, D. Schnitzer, Limitations of interactive music recommendation based on audio content, in:...

B. Frenay, M. Verleysen, Classification in the presence of label noise: a survey, IEEE Trans. Neural Netw. Learn. Syst....

M. Gasser, A. Flexer, S.D.: hubs and orphans—an explorative approach, in: Proceedings of the Seventh Sound and Music...

K. Georgios

Investigating the Impact of Hubness on SVM Classifiers

(2011)

N. Görnitz, A.K. Porbadnigk, A. Binder, C. Sannelli, M. Braun, K.R. Müller, M. Kloft, Learning and evaluation in...

D. Guan et al.

Identifying mislabeled training data with the aid of unlabeled data

Appl. Intell.

(2011)

J.V. Hulse, T. Khoshgoftaar, A. Napolitano, Skewed class distributions and mislabeled examples, in: Data Mining...

P. Ipeirotis et al.

Repeated labeling using multiple noisy labelers

Data Min. Knowl. Dis.

(2014)

R. Jin, Z. Ghahramani, Learning with Multiple Labels,...

A. Karmaker et al.

A boosting approach to remove class label noise

Int. J. Hybrid Intell. Syst.

(2006)

J.E. Keller et al.

A fuzzy k-nearest-neighbor algorithm

IEEE Trans. Syst. Man Cybern.

(1985)

T.M. Khoshgoftaar et al.

Detecting noisy instances with the rule-based classification model

Intell. Data Anal.

(2005)

Cited by (30)

A two-level hybrid approach for intrusion detection
2016, Neurocomputing
Citation Excerpt :
Moreover, an anomaly detection method based on the change of position of cluster centres is proposed and is used to construct the anomaly detection component in stage 1. The k-nearest neighbour (k-NN) algorithm [10,11] is employed to build the two detection components in stage 2. The proposed hybrid approach is evaluated by performing experiments on the KDD'99 benchmark dataset [12].
To exploit the strengths of misuse detection and anomaly detection, an intensive focus on intrusion detection combines the two. From a novel perspective, in this paper, we proposed a hybrid approach toward achieving a high detection rate with a low false positive rate. The approach is a two-level hybrid solution consisting of two anomaly detection components and a misuse detection component. In stage 1, an anomaly detection method with low computing complexity is developed and employed to build the detection component. The k-nearest neighbors algorithm becomes crucial in building the two detection components for stage 2. In this hybrid approach, all of the detection components are well-coordinated. The detection component of stage 1 becomes involved in the course of building the two detection components of stage 2 that reduce the false positives and false negatives generated by the detection component of stage 1. Experimental results on the KDD'99 dataset and the Kyoto University Benchmark dataset confirm that the proposed hybrid approach can effectively detect network anomalies with a low false positive rate.
Nearest neighbor regression in the presence of bad hubs
2015, Knowledge-Based Systems
Citation Excerpt :
Additionally, we evaluate the proposed approaches in the presence of label noise because, on the one hand, tolerance to noise is one of the most relevant aspects from the point of view of real-world applications, on the other hand, the selection of appropriate noise models allow us to simulate the increased presence of bad hubs. In particular, we perform experiments under the assumption of two types of noise: we consider conventional Gaussian label noise and an adapted version of the recently proposed hubness-proportional random label noise [25]. This adaptation is one of the minor contributions of the paper and it is necessary because hubness-proportional random label noise was originally introduced for classification problems.
Prediction on a numeric scale, i.e., regression, is one of the most prominent machine learning tasks with various applications in finance, medicine, social and natural sciences. Due to its simplicity, theoretical performance guarantees and successful real-world applications, one of the most popular regression techniques is the k nearest neighbor regression. However, k nearest neighbor approaches are affected by the presence of bad hubs, a recently observed phenomenon according to which some of the instances are similar to surprisingly many other instances and have a detrimental effect on the overall prediction performance. This paper is the first to study bad hubs in context of regression. We propose hubness-aware nearest neighbor regression schemes. We evaluate our approaches on publicly available real-world datasets from various domains. Our results show that the proposed approaches outperform various other regressions schemes such as kNN regression, regression trees and neural networks. We also evaluate the proposed approaches in the presence of label noise because tolerance to noise is one of the most relevant aspects from the point of view of real-world applications. In particular, we perform experiments under the assumption of conventional Gaussian label noise and an adapted version of the recently proposed hubness-proportional random label noise.
Ensemble of Distance and Similarity Measures for Noise Detection Using Learning Automata Particle Swarm Optimization
2023, SSRN
Noise simulation in classification with the noisemodel R package: Applications analyzing the impact of errors with chemical data
2023, Journal of Chemometrics
Fast Hubness-Reduced Nearest Neighbor Search for Entity Alignment in Knowledge Graphs
2022, SN Computer Science
Noise Models in Classification: Unified Nomenclature, Extended Taxonomy and Pragmatic Categorization
2022, Mathematics

View all citing articles on Scopus

Nenad Tomašev obtained his PhD degree from the Artificial Intelligence Laboratory at Jožef Stefan Institute in Ljubljana. He was graduated with honors in 2008 from the Department of Mathematics and Informatics at the University of Novi Sad. His research focus is in the area of Machine Learning and Data Mining, as well as Stochastic Optimization and Artificial Life. For his PhD, he worked on exploring the role of hub-points in high-dimensional data analysis, including clustering, classification, anomaly detection, information retrieval, feature selection, metric learning and re-ranking. He received the Best Research Paper Runner-up Award at PAKDD 2011 for his work on hubness-based clustering and the Best Paper Award at MLDM 2011 for his work on fuzzy hubness-aware classification. He authored two software packages for hubness-aware machine learning, Hub Miner and Image Hub Explorer. He has actively participated as a teaching assistant in Petnica Science Center and Višnjan Summer School. He was previously employed as a software engineering intern at Google and DMS Group.

Krisztian Buza (born 1984) obtained his Diploma in Computer Science (Informatics Engineer) in 2007 from the Faculty of Electrical Engineering and Informatics of the Budapest University of Technology and Economics. Between 2007 and 2011 he was a research assistant and PhD student at the University of Hildesheim, where he obtained his PhD in 2011. Since 2011, he gave several courses related to data mining at the Budapest University of Technology and Economics. In the last 5 years, he co-authored more than 20 publications and participated in several research projects in cooperation with industrial partners such as Rolls Royce, Morgan Stanley and Capgemini. His work on time series classification was honored by the Best Paper Award of IEEE׳s renowned conference on Computational Science and Engineering (2010) and his joint work with Gabor Nagy was nominated for the Best Paper Award of the 12th Industrial Conference on Data Mining (Berlin, 2012). His main research interests are data mining and machine learning with special focus on hybrid models, time-series classification and their applications.

View full text