Skip to main content
Erschienen in: Pattern Analysis and Applications 3-4/2008

01.09.2008 | Theoretical Advances

On the k-NN performance in a challenging scenario of imbalance and overlapping

verfasst von: V. García, R. A. Mollineda, J. S. Sánchez

Erschienen in: Pattern Analysis and Applications | Ausgabe 3-4/2008

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A two-class data set is said to be imbalanced when one (minority) class is heavily under-represented with respect to the other (majority) class. In the presence of a significant overlapping, the task of learning from imbalanced data can be a very difficult problem. Additionally, if the overall imbalance ratio is different from local imbalance ratios in overlap regions, the task can become in a major challenge. This paper explains the behaviour of the k-nearest neighbour (k-NN) rule when learning from such a complex scenario. This local model is compared to other machine learning algorithms, attending to how their behaviour depends on a number of data complexity features (global imbalance, size of overlap region, and its local imbalance). As a result, several conclusions useful for classifier design are inferred.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th international conference on knowledge discovery and data mining, pp 155–164 Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th international conference on knowledge discovery and data mining, pp 155–164
2.
Zurück zum Zitat Gordon DF, Perlis D (1989) Explicitly biased generalization. Comput Intell 5:67–81CrossRef Gordon DF, Perlis D (1989) Explicitly biased generalization. Comput Intell 5:67–81CrossRef
3.
Zurück zum Zitat Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Proceedings 11th international conference on machine learning, pp 217–225 Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Proceedings 11th international conference on machine learning, pp 217–225
4.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH
5.
Zurück zum Zitat Kubat M, Matwin S (1997) Adressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning, pp 179–186 Kubat M, Matwin S (1997) Adressing the curse of imbalanced training sets: one-sided selection. Proceedings of the 14th international conference on machine learning, pp 179–186
6.
Zurück zum Zitat Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36:849–851CrossRef Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36:849–851CrossRef
7.
Zurück zum Zitat Fawcett T, Provost F (1996) Adaptive fraud detection. Data Mining Knowl Discov 1:291–316CrossRef Fawcett T, Provost F (1996) Adaptive fraud detection. Data Mining Knowl Discov 1:291–316CrossRef
8.
Zurück zum Zitat Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal J 6(5):429–450MATH Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal J 6(5):429–450MATH
9.
Zurück zum Zitat Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6:40–49CrossRef Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor 6:40–49CrossRef
10.
Zurück zum Zitat Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, Rutgers University Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, Rutgers University
11.
Zurück zum Zitat Prati RC, Batista GE, Monard MC (2004) Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings of the 3rd Mexican international conference on artificial intelligence, pp 312–321 Prati RC, Batista GE, Monard MC (2004) Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings of the 3rd Mexican international conference on artificial intelligence, pp 312–321
12.
Zurück zum Zitat Orriols A, Bernardó E (2005) The class imbalance problem in learning classifier systems: a preliminary study. In: Proceedings of conference on genetic and evolutionary computation, pp 74–78 Orriols A, Bernardó E (2005) The class imbalance problem in learning classifier systems: a preliminary study. In: Proceedings of conference on genetic and evolutionary computation, pp 74–78
13.
Zurück zum Zitat Visa S, Ralescu A (2003) Learning from imbalanced and overlapped data using fuzzy sets. In: Proceedings of ICML-2003 workshop: learning with imbalanced data sets II, pp 97–104 Visa S, Ralescu A (2003) Learning from imbalanced and overlapped data using fuzzy sets. In: Proceedings of ICML-2003 workshop: learning with imbalanced data sets II, pp 97–104
14.
Zurück zum Zitat Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27MATHCrossRef Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27MATHCrossRef
15.
Zurück zum Zitat Dasarathy BV (1991) Nearest neighbor norms: NN pattern classification techniques. IEEE Computer Society Press, Los Alamos Dasarathy BV (1991) Nearest neighbor norms: NN pattern classification techniques. IEEE Computer Society Press, Los Alamos
16.
Zurück zum Zitat Devijver PA, Kittler J (1992) Pattern recognition: a statistical approach. Prentice Hall, Englewood Cliffs Devijver PA, Kittler J (1992) Pattern recognition: a statistical approach. Prentice Hall, Englewood Cliffs
17.
Zurück zum Zitat Hand DJ, Vinciotti V (2003) Choosing k for two-class nearest neighbour classifiers with unbalanced classes. Pattern Recognit Lett 24:1555–1562MATHCrossRef Hand DJ, Vinciotti V (2003) Choosing k for two-class nearest neighbour classifiers with unbalanced classes. Pattern Recognit Lett 24:1555–1562MATHCrossRef
18.
Zurück zum Zitat Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets II, pp 42–48 Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets II, pp 42–48
19.
Zurück zum Zitat Duda RO, Hart PE, Stork DG (2001) Pattern classification and scene analysis. Wiley, New York Duda RO, Hart PE, Stork DG (2001) Pattern classification and scene analysis. Wiley, New York
20.
Zurück zum Zitat Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, USA Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, USA
21.
Zurück zum Zitat Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo
22.
Zurück zum Zitat Buhmann M, Albowitz M (2003) Radial basis functions: theory and implementations. Cambridge University Press, USAMATH Buhmann M, Albowitz M (2003) Radial basis functions: theory and implementations. Cambridge University Press, USAMATH
23.
Zurück zum Zitat Cover TM (1968) Estimation by the nearest neighbor rule. IEEE Trans Inf Theory 14:50–55MATHCrossRef Cover TM (1968) Estimation by the nearest neighbor rule. IEEE Trans Inf Theory 14:50–55MATHCrossRef
24.
Zurück zum Zitat Okamoto S, Yugami N (2003) Effects of domain characteristics on instance-based learning algorithms. Theor Comput Sci 298:207–233MATHCrossRefMathSciNet Okamoto S, Yugami N (2003) Effects of domain characteristics on instance-based learning algorithms. Theor Comput Sci 298:207–233MATHCrossRefMathSciNet
25.
Zurück zum Zitat Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New YorkMATH Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New YorkMATH
26.
Zurück zum Zitat Kubat M, Chen WK (1998) Weighted projection in nearest-neighbor classifiers. In: Proceedings of 1st southern symposium on computing, pp 27–34 Kubat M, Chen WK (1998) Weighted projection in nearest-neighbor classifiers. In: Proceedings of 1st southern symposium on computing, pp 27–34
27.
Zurück zum Zitat Friedman JH (1997) On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining Knowl Discov 1:55–77CrossRef Friedman JH (1997) On bias, variance, 0/1-loss, and the curse of dimensionality. Data Mining Knowl Discov 1:55–77CrossRef
28.
Zurück zum Zitat Sánchez JS, Barandela R, Marqués AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022CrossRef Sánchez JS, Barandela R, Marqués AI, Alejo R, Badenas J (2003) Analysis of new techniques to obtain quality training sets. Pattern Recognit Lett 24:1015–1022CrossRef
29.
Zurück zum Zitat Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is "nearest neighbor" meaningful?. In: Proceedings of 7th international conference on database theory, pp 217–235 Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is "nearest neighbor" meaningful?. In: Proceedings of 7th international conference on database theory, pp 217–235
30.
Zurück zum Zitat Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 43–48 Provost F, Fawcett T (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: Proceedings of 3rd international conference on knowledge discovery and data mining, pp 43–48
31.
Zurück zum Zitat Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Engng 17:299–310CrossRef Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Engng 17:299–310CrossRef
32.
Zurück zum Zitat Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20:381–417CrossRef Daskalaki S, Kopanas I, Avouris N (2006) Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 20:381–417CrossRef
33.
Zurück zum Zitat Landgrebe TCW, Paclick P, Duin RPW (2006) Precision-recall operating characteristic (P-ROC) curves in imprecise environments. In: Proceedings of 18th international conference on pattern recognition, pp 123–127 Landgrebe TCW, Paclick P, Duin RPW (2006) Precision-recall operating characteristic (P-ROC) curves in imprecise environments. In: Proceedings of 18th international conference on pattern recognition, pp 123–127
34.
Zurück zum Zitat Fawcett T (2006) ROC graphs with instance-varying costs. Pattern Recognit Lett 27:882–891CrossRef Fawcett T (2006) ROC graphs with instance-varying costs. Pattern Recognit Lett 27:882–891CrossRef
35.
Zurück zum Zitat Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, USAMATH Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, USAMATH
Metadaten
Titel
On the k-NN performance in a challenging scenario of imbalance and overlapping
verfasst von
V. García
R. A. Mollineda
J. S. Sánchez
Publikationsdatum
01.09.2008
Verlag
Springer-Verlag
Erschienen in
Pattern Analysis and Applications / Ausgabe 3-4/2008
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-007-0087-5

Weitere Artikel der Ausgabe 3-4/2008

Pattern Analysis and Applications 3-4/2008 Zur Ausgabe