Skip to main content
Erschienen in: Soft Computing 1/2016

24.10.2014 | Methodologies and Application

ur-CAIM: improved CAIM discretization for unbalanced and balanced data

Erschienen in: Soft Computing | Ausgabe 1/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Supervised discretization is one of basic data preprocessing techniques used in data mining. CAIM (class-attribute interdependence maximization) is a discretization algorithm of data for which the classes are known. However, new arising challenges such as the presence of unbalanced data sets, call for new algorithms capable of handling them, in addition to balanced data. This paper presents a new discretization algorithm named ur-CAIM, which improves on the CAIM algorithm in three important ways. First, it generates more flexible discretization schemes while producing a small number of intervals. Second, the quality of the intervals is improved based on the data classes distribution, which leads to better classification performance on balanced and, especially, unbalanced data. Third, the runtime of the algorithm is lower than CAIM’s. The algorithm has been designed free-parameter and it self-adapts to the problem complexity and the data class distribution. The ur-CAIM was compared with 9 well-known discretization methods on 28 balanced, and 70 unbalanced data sets. The results obtained were contrasted through non-parametric statistical tests, which show that our proposal outperforms CAIM and many of the other methods on both types of data but especially on unbalanced data, which is its significant advantage.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The data sets description along with their partitions, the ur-CAIM source code and WEKA plugin, the experimental settings and results for all data sets and algorithms are fully described and publicly available to facilitate the replicability of the experiments and future comparisons at the website: http://​www.​uco.​es/​grupos/​kdis/​wiki/​ur-CAIM.
 
Literatur
Zurück zum Zitat Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Analysis framework. J Mult Valued Logic Soft Comput 17:255–287 Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Analysis framework. J Mult Valued Logic Soft Comput 17:255–287
Zurück zum Zitat Alcalá-Fdez J, Sánchez L, García S, del Jesus M, Ventura S, Garrell J, Otero J, Romero C, Bacardit J, Rivas V, Fernández J, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13:307–318CrossRef Alcalá-Fdez J, Sánchez L, García S, del Jesus M, Ventura S, Garrell J, Otero J, Romero C, Bacardit J, Rivas V, Fernández J, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13:307–318CrossRef
Zurück zum Zitat Ben-David A (2008a) About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell 21(6):874–882 Ben-David A (2008a) About the relationship between ROC curves and Cohen’s kappa. Eng Appl Artif Intell 21(6):874–882
Zurück zum Zitat Ben-David A (2008b) Comparison of classification accuracy using Cohen’s weighted kappa. Expert Syst Appl 34(2):825–832 Ben-David A (2008b) Comparison of classification accuracy using Cohen’s weighted kappa. Expert Syst Appl 34(2):825–832
Zurück zum Zitat Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1):131–165CrossRef Boullé M (2006) MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 65(1):131–165CrossRef
Zurück zum Zitat Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef
Zurück zum Zitat Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of machine learning, EWSL91, Lecture notes in computer science, vol 482. pp 164–178 Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Proceedings of machine learning, EWSL91, Lecture notes in computer science, vol 482. pp 164–178
Zurück zum Zitat Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27CrossRef Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27CrossRef
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling TEchnique. Artif Intell Res 16:321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling TEchnique. Artif Intell Res 16:321–357MATH
Zurück zum Zitat Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15:319–331MATHCrossRef Chmielewski MR, Grzymala-Busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 15:319–331MATHCrossRef
Zurück zum Zitat Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach. Springer, New York Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining: a knowledge discovery approach. Springer, New York
Zurück zum Zitat Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, pp 115–123 Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning, pp 115–123
Zurück zum Zitat Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27MATHCrossRef Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27MATHCrossRef
Zurück zum Zitat Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet
Zurück zum Zitat Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evolut Comput 1(1):3–18CrossRef Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evolut Comput 1(1):3–18CrossRef
Zurück zum Zitat Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th international conference machine learning, pp 194–202 Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th international conference machine learning, pp 194–202
Zurück zum Zitat Elomaa T, Rousu J (1999) General and efficient multisplitting of numerical attributes. Mach Learn 36(3):201–244MATHCrossRef Elomaa T, Rousu J (1999) General and efficient multisplitting of numerical attributes. Mach Learn 36(3):201–244MATHCrossRef
Zurück zum Zitat Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8:87–102MATH Fayyad U, Irani K (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8:87–102MATH
Zurück zum Zitat Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on uncertainly in artificial intelligence, pp 1022–1029 Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on uncertainly in artificial intelligence, pp 1022–1029
Zurück zum Zitat Fernández A, del Jesus MJ, Herrera F (2010) On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets. Inf Sci 180(8):1268–1291CrossRef Fernández A, del Jesus MJ, Herrera F (2010) On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets. Inf Sci 180(8):1268–1291CrossRef
Zurück zum Zitat Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: Proceedings of the 15th international conference on machine learning, pp 144–151 Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: Proceedings of the 15th international conference on machine learning, pp 144–151
Zurück zum Zitat Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–156 Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, pp 148–156
Zurück zum Zitat Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701CrossRef Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701CrossRef
Zurück zum Zitat Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid- based approaches. IEEE Trans Syst Man Cybern Part C Appl Revi 42(4):463–484 Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid- based approaches. IEEE Trans Syst Man Cybern Part C Appl Revi 42(4):463–484
Zurück zum Zitat García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRef García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRef
Zurück zum Zitat García S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH García S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH
Zurück zum Zitat García S, Luengo J, Saez J, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750CrossRef García S, Luengo J, Saez J, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750CrossRef
Zurück zum Zitat García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21CrossRef García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21CrossRef
Zurück zum Zitat Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA (2009) Ameva: an autonomous discretization algorithm. Expert Syst Appl 36:5327–5332CrossRef Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA (2009) Ameva: an autonomous discretization algorithm. Expert Syst Appl 36:5327–5332CrossRef
Zurück zum Zitat Grzymala-Busse JW (2009) A multiple scanning strategy for entropy based discretization. In: Proceedings of foundations of intelligent systems, Lecture notes in computer science, vol 5722. pp 25–34 Grzymala-Busse JW (2009) A multiple scanning strategy for entropy based discretization. In: Proceedings of foundations of intelligent systems, Lecture notes in computer science, vol 5722. pp 25–34
Zurück zum Zitat Hall M, Frank E, Holmes G, Pfahringer B, Reutemannr P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18CrossRef Hall M, Frank E, Holmes G, Pfahringer B, Reutemannr P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18CrossRef
Zurück zum Zitat He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef
Zurück zum Zitat Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70MATHMathSciNet Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70MATHMathSciNet
Zurück zum Zitat Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRef Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRef
Zurück zum Zitat Huang W (1997) Discretization of continuous attributes for inductive machine learning. University of Toledo Huang W (1997) Discretization of continuous attributes for inductive machine learning. University of Toledo
Zurück zum Zitat Janssens D, Brijs T, Vanhoof K, Wets G (2006) Evaluating the performance of cost-based discretization versus entropy- and error-based discretization. Comput Op Res 33(11):3107–3123MATHCrossRef Janssens D, Brijs T, Vanhoof K, Wets G (2006) Evaluating the performance of cost-based discretization versus entropy- and error-based discretization. Comput Op Res 33(11):3107–3123MATHCrossRef
Zurück zum Zitat John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence, pp 338–345 John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence, pp 338–345
Zurück zum Zitat Kaizhu H, Haiqin Y, Irwinng K, Lyu MR (2006) Imbalanced learning with a biased minimax probability machine. IEEE Trans Syst Man Cybern Part B Cybernetics 36(4):913–923CrossRef Kaizhu H, Haiqin Y, Irwinng K, Lyu MR (2006) Imbalanced learning with a biased minimax probability machine. IEEE Trans Syst Man Cybern Part B Cybernetics 36(4):913–923CrossRef
Zurück zum Zitat Kerber R (1992) ChiMerge: discretization of numeric attributes. In: Proceedings of the 10th national conference on artificial intelligence, pp 123–128 Kerber R (1992) ChiMerge: discretization of numeric attributes. In: Proceedings of the 10th national conference on artificial intelligence, pp 123–128
Zurück zum Zitat Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. pp 1137–1143 Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. pp 1137–1143
Zurück zum Zitat Kotsiantis S, Kanellopoulos D (2006) Discretization techniques: a recent survey. GESTS Int Trans Comput Sci Eng 32(1):47–58 Kotsiantis S, Kanellopoulos D (2006) Discretization techniques: a recent survey. GESTS Int Trans Comput Sci Eng 32(1):47–58
Zurück zum Zitat Kurgan LA, Cios KJ (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153CrossRef Kurgan LA, Cios KJ (2004) CAIM discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153CrossRef
Zurück zum Zitat Kurgan LA, Cios KJ, Dick S (2006) Highly scalable and robust rule learner: performance evaluation and comparison. IEEE Trans Syst Man Cybern Part B Cybern 36(1):32–53CrossRef Kurgan LA, Cios KJ, Dick S (2006) Highly scalable and robust rule learner: performance evaluation and comparison. IEEE Trans Syst Man Cybern Part B Cybern 36(1):32–53CrossRef
Zurück zum Zitat Landgrebe T, Paclik P, Tax D, Verzakov S, Duin R (2004) Cost-based classifier evaluation for imbalanced problems. Lect Notes Comput Sci 3138:762–770CrossRef Landgrebe T, Paclik P, Tax D, Verzakov S, Duin R (2004) Cost-based classifier evaluation for imbalanced problems. Lect Notes Comput Sci 3138:762–770CrossRef
Zurück zum Zitat Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9:642–645CrossRef Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9:642–645CrossRef
Zurück zum Zitat López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRef López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRef
Zurück zum Zitat Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15:1909–1936 Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15:1909–1936
Zurück zum Zitat Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kauffman Publishers, Burlington Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kauffman Publishers, Burlington
Zurück zum Zitat Ruiz FJ, Angulo C, Agell N (2008) IDD: a supervised interval distance-based method for discretization. IEEE Trans Knowl Data Eng 20(9):1230–1238CrossRef Ruiz FJ, Angulo C, Agell N (2008) IDD: a supervised interval distance-based method for discretization. IEEE Trans Knowl Data Eng 20(9):1230–1238CrossRef
Zurück zum Zitat Tay F, Shen L (2002) A modified Chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 14(2):666–670CrossRef Tay F, Shen L (2002) A modified Chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 14(2):666–670CrossRef
Zurück zum Zitat Tsai CJ, Lee CI, Yang WP (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731CrossRef Tsai CJ, Lee CI, Yang WP (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731CrossRef
Zurück zum Zitat Wiens TS, Dale BC, Boyce MS, Kershaw GP (2008) Three way \(k\)-fold cross-validation of resource selection functions. Ecol Model 212(3–4):244–255CrossRef Wiens TS, Dale BC, Boyce MS, Kershaw GP (2008) Three way \(k\)-fold cross-validation of resource selection functions. Ecol Model 212(3–4):244–255CrossRef
Zurück zum Zitat Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83CrossRef
Zurück zum Zitat Yang Y, Webb GI, Wu X (2010) Discretization methods. In: Proceedings of data mining and knowledge discovery handbook, pp 101–116 Yang Y, Webb GI, Wu X (2010) Discretization methods. In: Proceedings of data mining and knowledge discovery handbook, pp 101–116
Metadaten
Titel
ur-CAIM: improved CAIM discretization for unbalanced and balanced data
Publikationsdatum
24.10.2014
Erschienen in
Soft Computing / Ausgabe 1/2016
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-014-1488-1

Weitere Artikel der Ausgabe 1/2016

Soft Computing 1/2016 Zur Ausgabe