Skip to main content
Top

01-09-2008 | Theoretical Advances

On the k-NN performance in a challenging scenario of imbalance and overlapping

Authors: V. García, R. A. Mollineda, J. S. Sánchez

Published in: Pattern Analysis and Applications | Issue 3-4/2008

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

A two-class data set is said to be imbalanced when one (minority) class is heavily under-represented with respect to the other (majority) class. In the presence of a significant overlapping, the task of learning from imbalanced data can be a very difficult problem. Additionally, if the overall imbalance ratio is different from local imbalance ratios in overlap regions, the task can become in a major challenge. This paper explains the behaviour of the k-nearest neighbour (k-NN) rule when learning from such a complex scenario. This local model is compared to other machine learning algorithms, attending to how their behaviour depends on a number of data complexity features (global imbalance, size of overlap region, and its local imbalance). As a result, several conclusions useful for classifier design are inferred.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th international conference on knowledge discovery and data mining, pp 155–164 Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th international conference on knowledge discovery and data mining, pp 155–164
2.
go back to reference Gordon DF, Perlis D (1989) Explicitly biased generalization. Comput Intell 5:67–81CrossRef Gordon DF, Perlis D (1989) Explicitly biased generalization. Comput Intell 5:67–81CrossRef
3.
go back to reference Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Proceedings 11th international conference on machine learning, pp 217–225 Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Proceedings 11th international conference on machine learning, pp 217–225
4.
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH
5.
go back to reference Kubat M, Matwin S (1997) Adressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning, pp 179–186 Kubat M, Matwin S (1997) Adressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning, pp 179–186
6.
go back to reference Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36:849–851CrossRef Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36:849–851CrossRef
7.
go back to reference Fawcett T, Provost F (1996) Adaptive fraud detection. Data Mining Knowl Discov 1:291–316CrossRef Fawcett T, Provost F (1996) Adaptive fraud detection. Data Mining Knowl Discov 1:291–316CrossRef
8.
go back to reference Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal J 6(5):429–450MATH Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal J 6(5):429–450MATH
9.
go back to reference Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6:40–49CrossRef Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6:40–49CrossRef
10.
go back to reference Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, Rutgers University Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, Rutgers University
11.
go back to reference Prati RC, Batista GE, Monard MC (2004) Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings of the 3rd Mexican international conference on artificial intelligence, pp 312–321 Prati RC, Batista GE, Monard MC (2004) Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings of the 3rd Mexican international conference on artificial intelligence, pp 312–321
12.
go back to reference Orriols A, Bernardó E (2005) The class imbalance problem in learning classifier systems: a preliminary study. In: Proceedings of conference on genetic and evolutionary computation, pp 74–78 Orriols A, Bernardó E (2005) The class imbalance problem in learning classifier systems: a preliminary study. In: Proceedings of conference on genetic and evolutionary computation, pp 74–78
13.
go back to reference Visa S, Ralescu A (2003) Learning from imbalanced and overlapped data using fuzzy sets. In: Proceedings of ICML-2003 workshop: learning with imbalanced data sets II, pp 97–104 Visa S, Ralescu A (2003) Learning from imbalanced and overlapped data using fuzzy sets. In: Proceedings of ICML-2003 workshop: learning with imbalanced data sets II, pp 97–104
14.
go back to reference Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27MATHCrossRef Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27MATHCrossRef
15.
go back to reference Dasarathy BV (1991) Nearest neighbor norms: NN pattern classification techniques. IEEE Computer Society Press, Los Alamos Dasarathy BV (1991) Nearest neighbor norms: NN pattern classification techniques. IEEE Computer Society Press, Los Alamos
16.
go back to reference Devijver PA, Kittler J (1992) Pattern recognition: a statistical approach. Prentice Hall, Englewood Cliffs Devijver PA, Kittler J (1992) Pattern recognition: a statistical approach. Prentice Hall, Englewood Cliffs
17.
go back to reference Hand DJ, Vinciotti V (2003) Choosing k for two-class nearest neighbour classifiers with unbalanced classes. Pattern Recognit Lett 24:1555–1562MATHCrossRef Hand DJ, Vinciotti V (2003) Choosing k for two-class nearest neighbour classifiers with unbalanced classes. Pattern Recognit Lett 24:1555–1562MATHCrossRef
18.
go back to reference Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets II, pp 42–48 Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets II, pp 42–48
19.
go back to reference Duda RO, Hart PE, Stork DG (2001) Pattern classification and scene analysis. Wiley, New York Duda RO, Hart PE, Stork DG (2001) Pattern classification and scene analysis. Wiley, New York
20.
go back to reference Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, USA Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, USA
21.
go back to reference Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo
22.
go back to reference Buhmann M, Albowitz M (2003) Radial basis functions: theory and implementations. Cambridge University Press, USAMATH Buhmann M, Albowitz M (2003) Radial basis functions: theory and implementations. Cambridge University Press, USAMATH
23.
24.
25.
go back to reference Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New YorkMATH Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New YorkMATH
26.
go back to reference Kubat M, Chen WK (1998) Weighted projection in nearest-neighbor classifiers. In: Proceedings of 1st southern symposium on computing, pp 27–34 Kubat M, Chen WK (1998) Weighted projection in nearest-neighbor classifiers. In: Proceedings of 1st southern symposium on computing, pp 27–34
27.
go back to reference Friedman JH (1997) On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining Knowl Discov 1:55–77CrossRef Friedman JH (1997) On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining Knowl Discov 1:55–77CrossRef
28.
go back to reference Sánchez JS, Barandela R, Marqués AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022CrossRef Sánchez JS, Barandela R, Marqués AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022CrossRef
29.
go back to reference Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is "nearest neighbor" meaningful?. In: Proceedings of 7th international conference on database theory, pp 217–235 Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is "nearest neighbor" meaningful?. In: Proceedings of 7th international conference on database theory, pp 217–235
30.
go back to reference Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 43–48 Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 43–48
31.
go back to reference Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Engng 17:299–310CrossRef Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Engng 17:299–310CrossRef
32.
go back to reference Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20:381–417CrossRef Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20:381–417CrossRef
33.
go back to reference Landgrebe TCW, Paclick P, Duin RPW (2006) Precision-recall operating characteristic (P-ROC) curves in imprecise environments. In: Proceedings of 18th international conference on pattern recognition, pp 123–127 Landgrebe TCW, Paclick P, Duin RPW (2006) Precision-recall operating characteristic (P-ROC) curves in imprecise environments. In: Proceedings of 18th international conference on pattern recognition, pp 123–127
34.
go back to reference Fawcett T (2006) ROC graphs with instance-varying costs. Pattern Recognit Lett 27:882–891CrossRef Fawcett T (2006) ROC graphs with instance-varying costs. Pattern Recognit Lett 27:882–891CrossRef
35.
go back to reference Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, USAMATH Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, USAMATH
Metadata
Title
On the k-NN performance in a challenging scenario of imbalance and overlapping
Authors
V. García
R. A. Mollineda
J. S. Sánchez
Publication date
01-09-2008
Publisher
Springer-Verlag
Published in
Pattern Analysis and Applications / Issue 3-4/2008
Print ISSN: 1433-7541
Electronic ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-007-0087-5

Premium Partner