Skip to main content
Erschienen in: Knowledge and Information Systems 1/2015

01.01.2015 | Regular Paper

An automatic extraction method of the domains of competence for learning classifiers using data complexity measures

verfasst von: Julián Luengo, Francisco Herrera

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The constant appearance of algorithms and problems in data mining makes impossible to know in advance whether the model will perform well or poorly until it is applied, which can be costly. It would be useful to have a procedure that indicates, prior to the application of the learning algorithm and without needing a comparison with other methods, whether the outcome will be good or bad using the information available in the data. In this work, we present an automatic extraction method to determine the domains of competence of a classifier using a set of data complexity measures proposed for the task of classification. These domains codify the characteristics of the problems that are suitable or not for it, relating the concepts of data geometrical structures that may be difficult and the final accuracy obtained by any classifier. In order to do so, this proposal uses 12 metrics of data complexity acting over a large benchmark of datasets in order to analyze the behavior patterns of the method, obtaining intervals of data complexity measures with good or bad performance. As a representative for classifiers to analyze the proposal, three classical but different algorithms are used: C4.5, SVM and K-NN. From these intervals, two simple rules that describe the good or bad behaviors of the classifiers mentioned each are obtained, allowing the user to characterize the response quality of the methods from a dataset’s complexity. These two rules have been validated using fresh problems, showing that they are general and accurate. Thus, it can be established when the classifier will perform well or poorly prior to its application.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The software that implements the automatic extraction method can be downloaded from the associated webpage.
 
Literatur
1.
Zurück zum Zitat Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2008) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318CrossRef Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2008) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318CrossRef
2.
Zurück zum Zitat Alcalá-Fdez Jesús, Fernández Alberto, Luengo Julián, Derrac Joaquín, García Salvador (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi Valued Log Soft Comput 17(2–3):255–287 Alcalá-Fdez Jesús, Fernández Alberto, Luengo Julián, Derrac Joaquín, García Salvador (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Multi Valued Log Soft Comput 17(2–3):255–287
3.
Zurück zum Zitat Baskiotis N, Sebag M (2004) C4.5 competence map: a phase transition-inspired approach. In: ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 8. ACM, New York, NY, USA Baskiotis N, Sebag M (2004) C4.5 competence map: a phase transition-inspired approach. In: ICML ’04: Proceedings of the twenty-first international conference on Machine learning, page 8. ACM, New York, NY, USA
4.
Zurück zum Zitat Basu Mitra, Ho Tin Kam (2006) Data complexity in pattern recognition (advanced information and knowledge processing). Springe New York Inc., Secaucus, NJCrossRef Basu Mitra, Ho Tin Kam (2006) Data complexity in pattern recognition (advanced information and knowledge processing). Springe New York Inc., Secaucus, NJCrossRef
5.
Zurück zum Zitat Baumgartner R, Somorjai RL (2006) Data complexity assessment in undersampled classification of high-dimensional biomedical data. Pattern Recognit Lett 12:1383–1389CrossRef Baumgartner R, Somorjai RL (2006) Data complexity assessment in undersampled classification of high-dimensional biomedical data. Pattern Recognit Lett 12:1383–1389CrossRef
6.
Zurück zum Zitat Bensusan H, Kalousis A (2001) Estimating the predictive accuracy of a classifier. In EMCL ’01: Proceedings of the 12th european conference on machine learning Springer, London, pp 25–36 Bensusan H, Kalousis A (2001) Estimating the predictive accuracy of a classifier. In EMCL ’01: Proceedings of the 12th european conference on machine learning Springer, London, pp 25–36
7.
Zurück zum Zitat Bernadó-Mansilla Ester, Ho Tin Kam (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82–104CrossRef Bernadó-Mansilla Ester, Ho Tin Kam (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82–104CrossRef
8.
Zurück zum Zitat Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning: applications to data mining. Cognitive Technologies, Springer Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning: applications to data mining. Cognitive Technologies, Springer
9.
Zurück zum Zitat Cheeseman P, Kanefsky B, Taylor WM (1991) Where the really hard problems are. In: IJCAI’91: Proceedings of the 12th international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc, San Francisco, CA, pp 331–337 Cheeseman P, Kanefsky B, Taylor WM (1991) Where the really hard problems are. In: IJCAI’91: Proceedings of the 12th international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc, San Francisco, CA, pp 331–337
10.
Zurück zum Zitat Demšar Janez (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH Demšar Janez (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH
11.
Zurück zum Zitat Derrac Joaquín, Triguero Isaac, García Salvador, Herrera Francisco (2012) Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE Trans Syst Man Cybern Part B 42(5):1383–1397CrossRef Derrac Joaquín, Triguero Isaac, García Salvador, Herrera Francisco (2012) Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE Trans Syst Man Cybern Part B 42(5):1383–1397CrossRef
12.
Zurück zum Zitat Dong M, Kothari R (2003) Feature subset selection using a new definition of classificabilty. Pattern Recognit Lett 24:1215–1225CrossRefMATH Dong M, Kothari R (2003) Feature subset selection using a new definition of classificabilty. Pattern Recognit Lett 24:1215–1225CrossRefMATH
13.
Zurück zum Zitat Fernández A, García S, José M, del Jesús MJ, Francisco H (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2378–2398CrossRef Fernández A, García S, José M, del Jesús MJ, Francisco H (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18):2378–2398CrossRef
14.
Zurück zum Zitat García Salvador, Cano José Ramón, Bernadó-Mansilla Esther, Herrera Francisco (2009) Diagnose of effective evolutionary prototype selection using an overlapping measure. Int J Pattern Recognit Artif Intell 23(8):2378–2398 García Salvador, Cano José Ramón, Bernadó-Mansilla Esther, Herrera Francisco (2009) Diagnose of effective evolutionary prototype selection using an overlapping measure. Int J Pattern Recognit Artif Intell 23(8):2378–2398
15.
Zurück zum Zitat García Salvador, Herrera Francisco (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH García Salvador, Herrera Francisco (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH
16.
Zurück zum Zitat Ho Tin Kam, Baird Henry S (1998) Pattern classification with compact distribution maps. Comput Vis Image Underst 70(1):101–110CrossRef Ho Tin Kam, Baird Henry S (1998) Pattern classification with compact distribution maps. Comput Vis Image Underst 70(1):101–110CrossRef
17.
Zurück zum Zitat Ho Tin Kam, Basu Mitra (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300CrossRef Ho Tin Kam, Basu Mitra (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300CrossRef
18.
Zurück zum Zitat Hoekstra A, Duin RPW (1996) On the nonlinearity of pattern classifiers. In: ICPR ’96: Proceedings of the international conference on pattern recognition (ICPR ’96) volume IV-Volume 7472. IEEE Computer Society, Washington, DC, USA, pp 271–275 Hoekstra A, Duin RPW (1996) On the nonlinearity of pattern classifiers. In: ICPR ’96: Proceedings of the international conference on pattern recognition (ICPR ’96) volume IV-Volume 7472. IEEE Computer Society, Washington, DC, USA, pp 271–275
19.
Zurück zum Zitat Kalousis A (2002) Algorithm selection via meta-learning. PhD thesis, Université de Geneve Kalousis A (2002) Algorithm selection via meta-learning. PhD thesis, Université de Geneve
21.
Zurück zum Zitat Lebourgeois F, Emptoz H (1996) Pretopological approach for supervised learning. In: ICPR ’96: Proceedings of the international conference on pattern recognition (ICPR ’96) volume IV-Volume 7472. IEEE Computer Society, Washington, DC, USA, pp 256–260 Lebourgeois F, Emptoz H (1996) Pretopological approach for supervised learning. In: ICPR ’96: Proceedings of the international conference on pattern recognition (ICPR ’96) volume IV-Volume 7472. IEEE Computer Society, Washington, DC, USA, pp 256–260
22.
Zurück zum Zitat Lorena AC, Costa IG, Spolaôr N, de Souto MCP (2012) Analysis of complexity indices for classification problems: Cancer gene expression data. Neurocomputing 75(1):33–42CrossRef Lorena AC, Costa IG, Spolaôr N, de Souto MCP (2012) Analysis of complexity indices for classification problems: Cancer gene expression data. Neurocomputing 75(1):33–42CrossRef
23.
Zurück zum Zitat Lorena AC, de Carvalho ACPLF (2010) Building binary-tree-based multiclass classifiers using separability measures. Neurocomputing 73(16–18):2837–2845CrossRef Lorena AC, de Carvalho ACPLF (2010) Building binary-tree-based multiclass classifiers using separability measures. Neurocomputing 73(16–18):2837–2845CrossRef
24.
Zurück zum Zitat Luengo Julián, Fernández Alberto, García Salvador, Herrera Francisco (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936CrossRef Luengo Julián, Fernández Alberto, García Salvador, Herrera Francisco (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936CrossRef
25.
Zurück zum Zitat Luengo Julián, Herrera Francisco (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161(1):3–19CrossRefMathSciNet Luengo Julián, Herrera Francisco (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: a case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161(1):3–19CrossRefMathSciNet
26.
Zurück zum Zitat Luengo Julián, Herrera Francisco (2012) Shared domains of competence of approximate learning models using measures of separability of classes. Inf Sci 185(1):43–65CrossRefMathSciNet Luengo Julián, Herrera Francisco (2012) Shared domains of competence of approximate learning models using measures of separability of classes. Inf Sci 185(1):43–65CrossRefMathSciNet
27.
Zurück zum Zitat Macia N, Bernadó-Mansilla E, Orriols-Puig A, Kam Ho T (2012) Learner excellence biased by data set selection: A case for data characterisation and artificial data sets. Pattern Recognit (in press). doi:10.1016/j.patcog.2012.09.022 Macia N, Bernadó-Mansilla E, Orriols-Puig A, Kam Ho T (2012) Learner excellence biased by data set selection: A case for data characterisation and artificial data sets. Pattern Recognit (in press). doi:10.​1016/​j.​patcog.​2012.​09.​022
28.
Zurück zum Zitat McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley, New YorkMATH McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley, New YorkMATH
29.
Zurück zum Zitat Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: Proceedings of the 2nd Iberian conference on pattern recognition and image analysis. Springer, pp 27–34 Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: Proceedings of the 2nd Iberian conference on pattern recognition and image analysis. Springer, pp 27–34
30.
Zurück zum Zitat Okun Oleg, Priisalu Helen (2009) Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artif Intell Med 45(2–3):151–162CrossRef Okun Oleg, Priisalu Helen (2009) Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artif Intell Med 45(2–3):151–162CrossRef
31.
Zurück zum Zitat Orriols-Puig Albert, Bernadó-Mansilla Ester (2008) Evolutionary rule-based systems for imbalanced data sets. Soft Comput 13(3):213–225CrossRef Orriols-Puig Albert, Bernadó-Mansilla Ester (2008) Evolutionary rule-based systems for imbalanced data sets. Soft Comput 13(3):213–225CrossRef
32.
Zurück zum Zitat Orriols-Puig Albert, Casillas Jorge (2011) Fuzzy knowledge representation study for incremental learning in data streams and classification problems. Soft Comput 15(12):2389–2414CrossRef Orriols-Puig Albert, Casillas Jorge (2011) Fuzzy knowledge representation study for incremental learning in data streams and classification problems. Soft Comput 15(12):2389–2414CrossRef
33.
Zurück zum Zitat Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: ICML ’00: Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp 743–750 Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: ICML ’00: Proceedings of the seventeenth international conference on machine learning. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, pp 743–750
34.
Zurück zum Zitat Platt J (1998) Machines using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (eds) Advances in Kernel methods—support vector learning. MIT Press, Cambridge Platt J (1998) Machines using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (eds) Advances in Kernel methods—support vector learning. MIT Press, Cambridge
35.
Zurück zum Zitat Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Mateo-California Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers, San Mateo-California
36.
Zurück zum Zitat Ramentol Enislay, Caballero Yaile, Bello Rafael, Herrera Francisco (2012) Smote-rsb *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRef Ramentol Enislay, Caballero Yaile, Bello Rafael, Herrera Francisco (2012) Smote-rsb *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRef
37.
Zurück zum Zitat Sáez JA, Galar M, Luengo J, Herrera F (2013) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inf Syst (in press) doi:10.1007/s10115-012-0570-1 Sáez JA, Galar M, Luengo J, Herrera F (2013) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowl Inf Syst (in press) doi:10.​1007/​s10115-012-0570-1
38.
Zurück zum Zitat Sáez José A, Luengo Julián, Herrera Francisco (2013) Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognit 46(1):355–364CrossRef Sáez José A, Luengo Julián, Herrera Francisco (2013) Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognit 46(1):355–364CrossRef
39.
Zurück zum Zitat Sánchez José Salvador, Mollineda Ramón Alberto, Sotoca José Martínez (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10(3):189–201CrossRefMathSciNet Sánchez José Salvador, Mollineda Ramón Alberto, Sotoca José Martínez (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10(3):189–201CrossRefMathSciNet
40.
Zurück zum Zitat Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539CrossRef Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539CrossRef
41.
Zurück zum Zitat Smith FW (1968) Pattern classifier design by linear programming. IEEE Trans Comput 17(4):367–372CrossRef Smith FW (1968) Pattern classifier design by linear programming. IEEE Trans Comput 17(4):367–372CrossRef
42.
Zurück zum Zitat Vainer Igor, Kaminka Gal A, Kraus Sarit, Slovin Hamutal (2011) Obtaining scalable and accurate classification in large scale spatio-temporal domains. Knowl Inf Syst 29(3):527–564CrossRef Vainer Igor, Kaminka Gal A, Kraus Sarit, Slovin Hamutal (2011) Obtaining scalable and accurate classification in large scale spatio-temporal domains. Knowl Inf Syst 29(3):527–564CrossRef
43.
Zurück zum Zitat Vapnik VN (1998) Statistical learning theory. Wiley, New YorkMATH Vapnik VN (1998) Statistical learning theory. Wiley, New YorkMATH
44.
Zurück zum Zitat Wolpert David H (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390CrossRef Wolpert David H (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390CrossRef
Metadaten
Titel
An automatic extraction method of the domains of competence for learning classifiers using data complexity measures
verfasst von
Julián Luengo
Francisco Herrera
Publikationsdatum
01.01.2015
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2015
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-013-0700-4

Weitere Artikel der Ausgabe 1/2015

Knowledge and Information Systems 1/2015 Zur Ausgabe