nach oben

Soft Computing

Erschienen in:

24.10.2014 | Methodologies and Application

ur-CAIM: improved CAIM discretization for unbalanced and balanced data

Erschienen in: Soft Computing | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Supervised discretization is one of basic data preprocessing techniques used in data mining. CAIM (class-attribute interdependence maximization) is a discretization algorithm of data for which the classes are known. However, new arising challenges such as the presence of unbalanced data sets, call for new algorithms capable of handling them, in addition to balanced data. This paper presents a new discretization algorithm named ur-CAIM, which improves on the CAIM algorithm in three important ways. First, it generates more flexible discretization schemes while producing a small number of intervals. Second, the quality of the intervals is improved based on the data classes distribution, which leads to better classification performance on balanced and, especially, unbalanced data. Third, the runtime of the algorithm is lower than CAIM’s. The algorithm has been designed free-parameter and it self-adapts to the problem complexity and the data class distribution. The ur-CAIM was compared with 9 well-known discretization methods on 28 balanced, and 70 unbalanced data sets. The results obtained were contrasted through non-parametric statistical tests, which show that our proposal outperforms CAIM and many of the other methods on both types of data but especially on unbalanced data, which is its significant advantage.

Vorheriger Artikel On investigation of interdependence between sub-problems of the Travelling Thief Problem

Nächster Artikel On interval-valued hesitant fuzzy rough approximation operators

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The data sets description along with their partitions, the ur-CAIM source code and WEKA plugin, the experimental settings and results for all data sets and algorithms are fully described and publicly available to facilitate the replicability of the experiments and future comparisons at the website: http://www.uco.es/grupos/kdis/wiki/ur-CAIM.

Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Analysis framework. J Mult Valued Logic Soft Comput 17:255–287

Alcalá-Fdez J, Sánchez L, García S, del Jesus M, Ventura S, Garrell J, Otero J, Romero C, Bacardit J, Rivas V, Fernández J, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13:307–318CrossRef

Bache K, Lichman M (2013) UCI machine learning repository (University of California, School of Information and Computer Science). Irvine, CA. http://archive.ics.uci.edu/ml

Ben-David A (2008a) About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell 21(6):874–882

Ben-David A (2008b) Comparison of classification accuracy using Cohen’s weighted kappa. Expert Syst Appl 34(2):825–832

Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1):131–165CrossRef

Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef

Breiman L (2001) Random forests. Mach Learn 45:5–32MATHCrossRef

Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of machine learning, EWSL91, Lecture notes in computer science, vol 482. pp 164–178

Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27CrossRef

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling TEchnique. Artif Intell Res 16:321–357MATH

Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15:319–331MATHCrossRef

Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach. Springer, New York

Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, pp 115–123

Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27MATHCrossRef

Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet

Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evolut Comput 1(1):3–18CrossRef

Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th international conference machine learning, pp 194–202

Elomaa T, Rousu J (1999) General and efficient multisplitting of numerical attributes. Mach Learn 36(3):201–244MATHCrossRef

Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8:87–102MATH

Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on uncertainly in artificial intelligence, pp 1022–1029

Fernández A, del Jesus MJ, Herrera F (2010) On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets. Inf Sci 180(8):1268–1291CrossRef

Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: Proceedings of the 15th international conference on machine learning, pp 144–151

Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–156

Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701CrossRef

Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid- based approaches. IEEE Trans Syst Man Cybern Part C Appl Revi 42(4):463–484

García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRef

García S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH

García S, Luengo J, Saez J, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750CrossRef

García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21CrossRef

Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA (2009) Ameva: an autonomous discretization algorithm. Expert Syst Appl 36:5327–5332CrossRef

Grzymala-Busse JW (2009) A multiple scanning strategy for entropy based discretization. In: Proceedings of foundations of intelligent systems, Lecture notes in computer science, vol 5722. pp 25–34

Grzymala-Busse JW (2013) Discretization based on entropy and multiple scanning. Entropy 15:1486–1502MathSciNetCrossRef

Hall M, Frank E, Holmes G, Pfahringer B, Reutemannr P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18CrossRef

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef

Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70MATHMathSciNet

Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRef

Huang W (1997) Discretization of continuous attributes for inductive machine learning. University of Toledo

Janssens D, Brijs T, Vanhoof K, Wets G (2006) Evaluating the performance of cost-based discretization versus entropy- and error-based discretization. Comput Op Res 33(11):3107–3123MATHCrossRef

John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence, pp 338–345

Kaizhu H, Haiqin Y, Irwinng K, Lyu MR (2006) Imbalanced learning with a biased minimax probability machine. IEEE Trans Syst Man Cybern Part B Cybernetics 36(4):913–923CrossRef

Kerber R (1992) ChiMerge: discretization of numeric attributes. In: Proceedings of the 10th national conference on artificial intelligence, pp 123–128

Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. pp 1137–1143

Kotsiantis S, Kanellopoulos D (2006) Discretization techniques: a recent survey. GESTS Int Trans Comput Sci Eng 32(1):47–58

Kurgan LA, Cios KJ (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153CrossRef

Kurgan LA, Cios KJ, Dick S (2006) Highly scalable and robust rule learner: performance evaluation and comparison. IEEE Trans Syst Man Cybern Part B Cybern 36(1):32–53CrossRef

Landgrebe T, Paclik P, Tax D, Verzakov S, Duin R (2004) Cost-based classifier evaluation for imbalanced problems. Lect Notes Comput Sci 3138:762–770CrossRef

Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9:642–645CrossRef

López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRef

Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15:1909–1936

Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kauffman Publishers, Burlington

Ruiz FJ, Angulo C, Agell N (2008) IDD: a supervised interval distance-based method for discretization. IEEE Trans Knowl Data Eng 20(9):1230–1238CrossRef

Tay F, Shen L (2002) A modified Chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 14(2):666–670CrossRef

Tsai CJ, Lee CI, Yang WP (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731CrossRef

Wiens TS, Dale BC, Boyce MS, Kershaw GP (2008) Three way \(k\)-fold cross-validation of resource selection functions. Ecol Model 212(3–4):244–255CrossRef

Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef

Wong A, Liu TS (1975) Typicality, diversity, and feature pattern of an ensemble. IEEE Trans Comput 24(2):158–181MATHMathSciNetCrossRef

Yang P, Li JS, Huang YX (2011) HDD: a hypercube division-based algorithm for discretisation. Int J Syst Sci 42(4):557–566MATHMathSciNetCrossRef

Yang Y, Webb GI, Wu X (2010) Discretization methods. In: Proceedings of data mining and knowledge discovery handbook, pp 101–116

Titel: ur-CAIM: improved CAIM discretization for unbalanced and balanced data
Publikationsdatum: 24.10.2014
Erschienen in: Soft Computing / Ausgabe 1/2016
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI: https://doi.org/10.1007/s00500-014-1488-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2016

Sine neural network (SNN) with double-stage weights and structure determination (DS-WASD)

Quantum-based secure communications with no prior key distribution

Possibilistic AIRS induction from uncertain data

Evolutionary K-Means with pair-wise constraints

Involutive right-residuated l-groupoids

Is a comparison of results meaningful from the inexact replications of computational experiments?