nach oben

Knowledge and Information Systems

Erschienen in:

27.05.2016 | Regular Paper

DBMUTE: density-based majority under-sampling technique

verfasst von: Chumphol Bunkhumpornpat, Krung Sinapiromsaran

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Class imbalance is a challenging problem that demonstrates the unsatisfactory classification performance of a minority class. A trivial classifier is biased toward minority instances because of their tiny fraction. In this paper, our density function is defined as the distance along the shortest path between each majority instance and a minority-cluster pseudo-centroid in an underlying cluster graph. A short path implies highly overlapping dense minority instances. In contrast, a long path indicates a sparsity of instances. A new under-sampling algorithm is proposed to eliminate majority instances with low distances because these instances are insignificant and obscure the classification boundary in the overlapping region. The results show predictive improvements on a minority class from various classifiers on different UCI datasets.

Vorheriger Artikel Toward value difference metric with attribute weighting

Nächster Artikel DASC: data aware algorithm for scalable clustering

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Blake CL, Merz CJ (1998) The UC Irvine machine learning repository. http://archive.ics.uci.edu/ml/. University of California, Irvine, CA

Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(6):1145–1159CrossRef

Buckland M, Gey F (1994) The relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19CrossRef

Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36(3):664–684CrossRef

Bunkhumpornpat C, Sinapiromsaran, K, Lursinsap C (2011) MUTE: majority under-sampling technique. In: The 8th international conference on information, communications, and signal processing, Singapore

Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) The 13th Pacific-Asia conference on knowledge discovery and data Mining, Bangkok, Thailand, vol 5476., Lecture notes in artificial intelligence. Springer, Heidelberg, pp 475–482

Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 875–886

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:341–378MATH

Chiang I-J, Shieh M-J, Hsu JY, Wong J-M (2005) Building a medical decision support system for colon polyp screening by using fuzzy classification trees. Special issue: foundations and advances in data mining. Appl Intell 22(1):61–75CrossRef

10.

Cohen WW (1995) Fast effective rule induction. In: 12th international conference on machine learning. Lake Tahoe, CA, pp 115–123

11.

Cover T, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27CrossRefMATH

12.

Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH

13.

Drummond C, Holte RC (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: the ICML 2003 workshop on learning from imbalanced datasets II, Washington, DC, pp 1–8

14.

Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The 2nd international conference on knowledge discovery and data mining. Portland, Oregon, USA, pp 226–231

15.

Frank E, Bouckaert RR (2006) Naive Bayes for text classification with unbalanced classes. In: The 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, pp 503–510

16.

Garcia V, Sánchez JS, Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. IV Taller Nacional de Minería de Datos y Aprendizaje (TAMIDA 2007). Zaragoza, Spain, pp 283–291

17.

Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) The 2005 international conference on intelligent computing, Hefei, China. Lecture notes in computer science, vol 3644. Springer, Heidelberg, pp 878–887

18.

Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, BurlingtonMATH

19.

Hu X (2005) A data mining approach for retailing bank customer attrition analysis. Special issue: foundations and advances in data mining. Appl Intell 22(1):47–60CrossRef

20.

Japkowicz N (2000) The class imbalance problem: significance and strategies. In: The 2000 international conference on artificial intelligence, Las Vegas, NV, pp 111–117

21.

Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH

22.

Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2012) A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl Intell 36(2):320–329CrossRef

23.

Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215CrossRef

24.

Murphey YL, Chen ZH, Feldkamp LA (2008) An incremental neural learning framework and its application to vehicle diagnostics. Appl Intell 28(1):29–49CrossRef

25.

Murty MN, Devi VS (2012) Pattern recognition: an algorithmic approach. Springer, BerlinMATH

26.

Nguyen GH, Bouzerdoum A, Phung SL (2009) Learning Pattern Classification Tasks with Imbalanced Data Sets. In: Yin P (ed) Pattern recognition. In-Teh, Vukovar, pp 193–208

27.

Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy R, Arroyo G, Sucar LE, Sossa H (eds.) The 3rd Mexican international conference on artificial intelligence, Mexico City, Mexico. Lecture Notes in artificial intelligence, vol 2972, pp 312–321

28.

Quinlan JR (1992) C4.5: programs for machine learning. Morgan Kaufmann, Burlington

29.

Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies. 1. Comparison of overfitting and overtraining. J Chem Inf Comput Sci 35(5):826–833CrossRef

30.

Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772MathSciNetCrossRefMATH

31.

Wang S, Li Z, Chao W-H, Cao Q (2012) Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In: IJCNN, pp 1–8

32.

Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington

Titel: DBMUTE: density-based majority under-sampling technique
verfasst von: Chumphol Bunkhumpornpat
Krung Sinapiromsaran
Publikationsdatum: 27.05.2016
Verlag: Springer London
Erschienen in: Knowledge and Information Systems / Ausgabe 3/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-016-0957-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2017

Event-based summarization using a centrality-as-relevance model

Context-aware query expansion method using Language Models and Latent Semantic Analyses

Toward value difference metric with attribute weighting

Active inference for dynamic Bayesian networks with an application to tissue engineering

Towards efficient top-k reliability search on uncertain graphs

DASC: data aware algorithm for scalable clustering

Premium Partner