Skip to main content
Erschienen in: Knowledge and Information Systems 3/2017

27.05.2016 | Regular Paper

DBMUTE: density-based majority under-sampling technique

verfasst von: Chumphol Bunkhumpornpat, Krung Sinapiromsaran

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Class imbalance is a challenging problem that demonstrates the unsatisfactory classification performance of a minority class. A trivial classifier is biased toward minority instances because of their tiny fraction. In this paper, our density function is defined as the distance along the shortest path between each majority instance and a minority-cluster pseudo-centroid in an underlying cluster graph. A short path implies highly overlapping dense minority instances. In contrast, a long path indicates a sparsity of instances. A new under-sampling algorithm is proposed to eliminate majority instances with low distances because these instances are insignificant and obscure the classification boundary in the overlapping region. The results show predictive improvements on a minority class from various classifiers on different UCI datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(6):1145–1159CrossRef Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(6):1145–1159CrossRef
3.
Zurück zum Zitat Buckland M, Gey F (1994) The relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19CrossRef Buckland M, Gey F (1994) The relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19CrossRef
4.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36(3):664–684CrossRef Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36(3):664–684CrossRef
5.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran, K, Lursinsap C (2011) MUTE: majority under-sampling technique. In: The 8th international conference on information, communications, and signal processing, Singapore Bunkhumpornpat C, Sinapiromsaran, K, Lursinsap C (2011) MUTE: majority under-sampling technique. In: The 8th international conference on information, communications, and signal processing, Singapore
6.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) The 13th Pacific-Asia conference on knowledge discovery and data Mining, Bangkok, Thailand, vol 5476., Lecture notes in artificial intelligence. Springer, Heidelberg, pp 475–482 Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) The 13th Pacific-Asia conference on knowledge discovery and data Mining, Bangkok, Thailand, vol 5476., Lecture notes in artificial intelligence. Springer, Heidelberg, pp 475–482
7.
Zurück zum Zitat Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 875–886 Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 875–886
8.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:341–378MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:341–378MATH
9.
Zurück zum Zitat Chiang I-J, Shieh M-J, Hsu JY, Wong J-M (2005) Building a medical decision support system for colon polyp screening by using fuzzy classification trees. Special issue: foundations and advances in data mining. Appl Intell 22(1):61–75CrossRef Chiang I-J, Shieh M-J, Hsu JY, Wong J-M (2005) Building a medical decision support system for colon polyp screening by using fuzzy classification trees. Special issue: foundations and advances in data mining. Appl Intell 22(1):61–75CrossRef
10.
Zurück zum Zitat Cohen WW (1995) Fast effective rule induction. In: 12th international conference on machine learning. Lake Tahoe, CA, pp 115–123 Cohen WW (1995) Fast effective rule induction. In: 12th international conference on machine learning. Lake Tahoe, CA, pp 115–123
11.
Zurück zum Zitat Cover T, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27CrossRefMATH Cover T, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27CrossRefMATH
12.
Zurück zum Zitat Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH
13.
Zurück zum Zitat Drummond C, Holte RC (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: the ICML 2003 workshop on learning from imbalanced datasets II, Washington, DC, pp 1–8 Drummond C, Holte RC (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: the ICML 2003 workshop on learning from imbalanced datasets II, Washington, DC, pp 1–8
14.
Zurück zum Zitat Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The 2nd international conference on knowledge discovery and data mining. Portland, Oregon, USA, pp 226–231 Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The 2nd international conference on knowledge discovery and data mining. Portland, Oregon, USA, pp 226–231
15.
Zurück zum Zitat Frank E, Bouckaert RR (2006) Naive Bayes for text classification with unbalanced classes. In: The 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, pp 503–510 Frank E, Bouckaert RR (2006) Naive Bayes for text classification with unbalanced classes. In: The 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, pp 503–510
16.
Zurück zum Zitat Garcia V, Sánchez JS, Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. IV Taller Nacional de Minería de Datos y Aprendizaje (TAMIDA 2007). Zaragoza, Spain, pp 283–291 Garcia V, Sánchez JS, Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. IV Taller Nacional de Minería de Datos y Aprendizaje (TAMIDA 2007). Zaragoza, Spain, pp 283–291
17.
Zurück zum Zitat Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) The 2005 international conference on intelligent computing, Hefei, China. Lecture notes in computer science, vol 3644. Springer, Heidelberg, pp 878–887 Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) The 2005 international conference on intelligent computing, Hefei, China. Lecture notes in computer science, vol 3644. Springer, Heidelberg, pp 878–887
18.
Zurück zum Zitat Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, BurlingtonMATH Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, BurlingtonMATH
19.
Zurück zum Zitat Hu X (2005) A data mining approach for retailing bank customer attrition analysis. Special issue: foundations and advances in data mining. Appl Intell 22(1):47–60CrossRef Hu X (2005) A data mining approach for retailing bank customer attrition analysis. Special issue: foundations and advances in data mining. Appl Intell 22(1):47–60CrossRef
20.
Zurück zum Zitat Japkowicz N (2000) The class imbalance problem: significance and strategies. In: The 2000 international conference on artificial intelligence, Las Vegas, NV, pp 111–117 Japkowicz N (2000) The class imbalance problem: significance and strategies. In: The 2000 international conference on artificial intelligence, Las Vegas, NV, pp 111–117
21.
Zurück zum Zitat Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH
22.
Zurück zum Zitat Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2012) A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl Intell 36(2):320–329CrossRef Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2012) A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl Intell 36(2):320–329CrossRef
23.
Zurück zum Zitat Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215CrossRef Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215CrossRef
24.
Zurück zum Zitat Murphey YL, Chen ZH, Feldkamp LA (2008) An incremental neural learning framework and its application to vehicle diagnostics. Appl Intell 28(1):29–49CrossRef Murphey YL, Chen ZH, Feldkamp LA (2008) An incremental neural learning framework and its application to vehicle diagnostics. Appl Intell 28(1):29–49CrossRef
25.
Zurück zum Zitat Murty MN, Devi VS (2012) Pattern recognition: an algorithmic approach. Springer, BerlinMATH Murty MN, Devi VS (2012) Pattern recognition: an algorithmic approach. Springer, BerlinMATH
26.
Zurück zum Zitat Nguyen GH, Bouzerdoum A, Phung SL (2009) Learning Pattern Classification Tasks with Imbalanced Data Sets. In: Yin P (ed) Pattern recognition. In-Teh, Vukovar, pp 193–208 Nguyen GH, Bouzerdoum A, Phung SL (2009) Learning Pattern Classification Tasks with Imbalanced Data Sets. In: Yin P (ed) Pattern recognition. In-Teh, Vukovar, pp 193–208
27.
Zurück zum Zitat Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy R, Arroyo G, Sucar LE, Sossa H (eds.) The 3rd Mexican international conference on artificial intelligence, Mexico City, Mexico. Lecture Notes in artificial intelligence, vol 2972, pp 312–321 Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy R, Arroyo G, Sucar LE, Sossa H (eds.) The 3rd Mexican international conference on artificial intelligence, Mexico City, Mexico. Lecture Notes in artificial intelligence, vol 2972, pp 312–321
28.
Zurück zum Zitat Quinlan JR (1992) C4.5: programs for machine learning. Morgan Kaufmann, Burlington Quinlan JR (1992) C4.5: programs for machine learning. Morgan Kaufmann, Burlington
29.
Zurück zum Zitat Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies. 1. Comparison of overfitting and overtraining. J Chem Inf Comput Sci 35(5):826–833CrossRef Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies. 1. Comparison of overfitting and overtraining. J Chem Inf Comput Sci 35(5):826–833CrossRef
31.
Zurück zum Zitat Wang S, Li Z, Chao W-H, Cao Q (2012) Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In: IJCNN, pp 1–8 Wang S, Li Z, Chao W-H, Cao Q (2012) Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In: IJCNN, pp 1–8
32.
Zurück zum Zitat Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington
Metadaten
Titel
DBMUTE: density-based majority under-sampling technique
verfasst von
Chumphol Bunkhumpornpat
Krung Sinapiromsaran
Publikationsdatum
27.05.2016
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-016-0957-5

Weitere Artikel der Ausgabe 3/2017

Knowledge and Information Systems 3/2017 Zur Ausgabe

Premium Partner