Skip to main content
Erschienen in: Artificial Intelligence Review 4/2015

01.12.2015

Dealing with the evaluation of supervised classification algorithms

verfasst von: Guzman Santafe, Iñaki Inza, Jose A. Lozano

Erschienen in: Artificial Intelligence Review | Ausgabe 4/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Performance assessment of a learning method related to its prediction ability on independent data is extremely important in supervised classification. This process provides the information to evaluate the quality of a classification model and to choose the most appropriate technique to solve the specific supervised classification problem at hand. This paper aims to review the most important aspects of the evaluation process of supervised classification algorithms. Thus the overall evaluation process is put in perspective to lead the reader to a deep understanding of it. Additionally, different recommendations about their use and limitations as well as a critical view of the reviewed methods are presented according to the specific characteristics of the supervised classification problem scenario.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The output of the classifier for a given instance is just a class assignment.
 
2
A continuous classifier is the one that yields a numeric value representing the degree to which an instance is a member of a class. Therefore, a discrimination threshold is needed to obtain a discrete classifier and thus to decide the class assignment for the instance.
 
3
Certain types of misclassification forms are often more serious than other types (e.g. the relative severity of misclassifications when screening for a disease which can be easily treated if detected early enough, but which is otherwise fatal).
 
4
It is a similar estimation scheme as in leaving-one-out, but leaving j data samples in the test set.
 
5
An extreme example of a classier with a strong overfitting is \( kNN \) with \(k=1\).
 
6
Conservative tests have a real type I error lower than the nominal \(\alpha \) value. By contrast, liberal tests have real type I error higher than its nominal nominal \(\alpha \) value.
 
7
The same data partition is used for both classification algorithms.
 
8
Note that we abuse notation in letting N denote the size of the dataset (when dealing with only one dataset) and the number of datasets (when dealing with several datasets).
 
9
Homoscedasticity, in the case of comparing several classification algorithms in one dataset, implies equal variability of the classification behavior for every algorithm considered in the study.
 
10
\(\bar{S}^j=\frac{1}{K}\sum _{k=1}^K S_k^j, \text{ with } j=1,\cdots ,N\).
 
11
The best difference depends on the selected score. If the best performing algorithm is the one with maximum score, the higher the difference is, the better; and vice versa.
 
15
Note that confidence intervals, similarly to McNemar’s test, can be used to evaluate the differences between two specific classifiers but not the differences between the classification algorithms.
 
16
It is also possible to use WEKA methods from a command line and also to program new methods within the WEKA framework. This requires a deeper knowledge of the tool and programming languages but it allows to extend current functionalities.
 
17
Several datasets can be chosen for the study but classification algorithms are compared separately for each dataset.
 
Literatur
Zurück zum Zitat Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to binary: A unifying approach for margin classifiers. J Mach Learn Res 1(2):113–141MathSciNetMATH Allwein EL, Schapire RE, Singer Y (2001) Reducing multiclass to binary: A unifying approach for margin classifiers. J Mach Learn Res 1(2):113–141MathSciNetMATH
Zurück zum Zitat Andersson A, Davidsson P, Linén J (1999) Measure-based classifier performance evaluation. Pattern Recognit Lett 11–13(20):1165–1173CrossRef Andersson A, Davidsson P, Linén J (1999) Measure-based classifier performance evaluation. Pattern Recognit Lett 11–13(20):1165–1173CrossRef
Zurück zum Zitat Batuwita R, Palade V (2009) A new performance measure for class imbalance learning. Application to bioinformatics problems. In: Proceedings of the 26th international conference on machine learning and applications, pp 545–550 Batuwita R, Palade V (2009) A new performance measure for class imbalance learning. Application to bioinformatics problems. In: Proceedings of the 26th international conference on machine learning and applications, pp 545–550
Zurück zum Zitat Bengio Y, Grandvalet Y (2004) No unbiased estimator of the variance of k-fold cross-validation. J Mach Learn Res 5:1089–1105MathSciNetMATH Bengio Y, Grandvalet Y (2004) No unbiased estimator of the variance of k-fold cross-validation. J Mach Learn Res 5:1089–1105MathSciNetMATH
Zurück zum Zitat Bengio Y, Grandvalet Y (2005) Bias in estimating the variance of k-fold cross-validation. In: Duchesne P, Rémillard B (eds) Statistical modeling and analysis for complex data problems, chap 5. Springer, Berlin, pp 75–95CrossRef Bengio Y, Grandvalet Y (2005) Bias in estimating the variance of k-fold cross-validation. In: Duchesne P, Rémillard B (eds) Statistical modeling and analysis for complex data problems, chap 5. Springer, Berlin, pp 75–95CrossRef
Zurück zum Zitat Berrar D, Lozano JA (2013) Significance tests or confidence intervals: which are preferable for the comparison of classifiers? J Exp Theor Artif Intell 25(2):189–206 Berrar D, Lozano JA (2013) Significance tests or confidence intervals: which are preferable for the comparison of classifiers? J Exp Theor Artif Intell 25(2):189–206
Zurück zum Zitat Bouckaert RR (2004) Estimationg replicability of classifier learning experiments. In: Brodley CE (ed) Proceedings of the 21st international conference on machine learning. ACM Bouckaert RR (2004) Estimationg replicability of classifier learning experiments. In: Brodley CE (ed) Proceedings of the 21st international conference on machine learning. ACM
Zurück zum Zitat Bouckaert RR, Frank E (2004) Evaluating the replicability of significance tests fo comparing learning algorihtms. In: Proceedings of the 8th Pacifica-Asian conference on knowledge discovery and data mining, pp 3–12 Bouckaert RR, Frank E (2004) Evaluating the replicability of significance tests fo comparing learning algorihtms. In: Proceedings of the 8th Pacifica-Asian conference on knowledge discovery and data mining, pp 3–12
Zurück zum Zitat Boyd K, Eng KH, Page CD (2013) Area under the precision-recall curve: point estimates and confidence intervals. In: Machine learning and knowledge discovery in databases. ECML PKDD 2013, Part III, pp 451–466 Boyd K, Eng KH, Page CD (2013) Area under the precision-recall curve: point estimates and confidence intervals. In: Machine learning and knowledge discovery in databases. ECML PKDD 2013, Part III, pp 451–466
Zurück zum Zitat Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef
Zurück zum Zitat Braga-Neto U, Dougherty E (2004) Bolstered error estimation. Pattern Recognit 37(6):1267–1281CrossRefMATH Braga-Neto U, Dougherty E (2004) Bolstered error estimation. Pattern Recognit 37(6):1267–1281CrossRefMATH
Zurück zum Zitat Brain D, Webb GI (1999) On the effect of data set size on bias and variance in classification learning. In: Proceedings of the 4th Australian knowledge acquisition workshop, pp 117–128 Brain D, Webb GI (1999) On the effect of data set size on bias and variance in classification learning. In: Proceedings of the 4th Australian knowledge acquisition workshop, pp 117–128
Zurück zum Zitat Brain D, Webb GI (2002) The need for low bias algorithms in classification learning from large data sets. In: Proceedings of the 16th European conference principles of data mining and knowledge discovery, pp 62–73 Brain D, Webb GI (2002) The need for low bias algorithms in classification learning from large data sets. In: Proceedings of the 16th European conference principles of data mining and knowledge discovery, pp 62–73
Zurück zum Zitat Brier GW (1950) Verification of forecasts expressed in terms of probability. Monthly Weather Rev 78:1–3 Brier GW (1950) Verification of forecasts expressed in terms of probability. Monthly Weather Rev 78:1–3
Zurück zum Zitat Budka M (2013) Density-preserving sampling: robust and efficient alternative to cross-validation for error estimation. IEEE Trans Neural Netw Learn Syst 24(1):22–34CrossRef Budka M (2013) Density-preserving sampling: robust and efficient alternative to cross-validation for error estimation. IEEE Trans Neural Netw Learn Syst 24(1):22–34CrossRef
Zurück zum Zitat Burman P (1989) A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3):503–514MathSciNetCrossRefMATH Burman P (1989) A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3):503–514MathSciNetCrossRefMATH
Zurück zum Zitat Calvo B (2010) Positive unlabeled learning with applications in computational biology. Lambert Academic Publishing, Saarbrücken Calvo B (2010) Positive unlabeled learning with applications in computational biology. Lambert Academic Publishing, Saarbrücken
Zurück zum Zitat Chawla NV, Japkowicz N (2004) Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newslett 6(1):2000–2004CrossRef Chawla NV, Japkowicz N (2004) Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newslett 6(1):2000–2004CrossRef
Zurück zum Zitat Cohen J (1994) The earth is round (\(p <.05\)). Am Psychol 49:997–1003CrossRef Cohen J (1994) The earth is round (\(p <.05\)). Am Psychol 49:997–1003CrossRef
Zurück zum Zitat Cortes C, Mohri M (2004) AUC optimization vs. error rate minimization. In: Proceedings of the 16th advances in neural information processing systems conference, p 313 Cortes C, Mohri M (2004) AUC optimization vs. error rate minimization. In: Proceedings of the 16th advances in neural information processing systems conference, p 313
Zurück zum Zitat Daniel WW (1990) Applied nonparametric statistics. Duxbury Thomson Learning, Pacific Grove Daniel WW (1990) Applied nonparametric statistics. Duxbury Thomson Learning, Pacific Grove
Zurück zum Zitat Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning, pp 233–240 Davis J, Goadrich M (2006) The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning, pp 233–240
Zurück zum Zitat Davison A, Hinkley D (1997) Bootstrap methods and their application. Cambridge University Press, CambridgeCrossRefMATH Davison A, Hinkley D (1997) Bootstrap methods and their application. Cambridge University Press, CambridgeCrossRefMATH
Zurück zum Zitat Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH
Zurück zum Zitat Demsar J (2008) On the appropriateness of statistical tests in machine learning. In: 3rd workshop on evaluation methods for machine learning Demsar J (2008) On the appropriateness of statistical tests in machine learning. In: 3rd workshop on evaluation methods for machine learning
Zurück zum Zitat Denis DJ (2003) An alternative to null-hypothesis significance tests. Theory Sci 4(1) Denis DJ (2003) An alternative to null-hypothesis significance tests. Theory Sci 4(1)
Zurück zum Zitat Dmochowski JP, Sajda P, Parra LC (2010) Maximum likelihood in cost-sensitive learning: model specification, approximations, and upper bounds. J Mach Learn Res 11:3313–3332MathSciNetMATH Dmochowski JP, Sajda P, Parra LC (2010) Maximum likelihood in cost-sensitive learning: model specification, approximations, and upper bounds. J Mach Learn Res 11:3313–3332MathSciNetMATH
Zurück zum Zitat Drummond C (2006) Machine learning as an experimental science (revisited). In: Proceedings of the 1st workshop on evaluation methods for machine learning Drummond C (2006) Machine learning as an experimental science (revisited). In: Proceedings of the 1st workshop on evaluation methods for machine learning
Zurück zum Zitat Drummond C (2008) Finding a balance between anarchy and orthodoxy. In: Proceedings of the 3rd workshop on evaluation methods for machine learning Drummond C (2008) Finding a balance between anarchy and orthodoxy. In: Proceedings of the 3rd workshop on evaluation methods for machine learning
Zurück zum Zitat Drummond C, Holte RC (2006) Cost curves: an improved methyod for visualizing classifier performance. Mach Learn 65(1):95–130CrossRef Drummond C, Holte RC (2006) Cost curves: an improved methyod for visualizing classifier performance. Mach Learn 65(1):95–130CrossRef
Zurück zum Zitat Drummond C, Japkowicz N (2010) Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning. J Exp Theor Artif Intell 22(1):67–80CrossRefMATH Drummond C, Japkowicz N (2010) Warning: Statistical benchmarking is addictive. Kicking the habit in machine learning. J Exp Theor Artif Intell 22(1):67–80CrossRefMATH
Zurück zum Zitat Efron B (1982) The jackknife, the bootstrap and other resampling plans. Soc Ind Appl Math Efron B (1982) The jackknife, the bootstrap and other resampling plans. Soc Ind Appl Math
Zurück zum Zitat Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331MathSciNetCrossRefMATH Efron B (1983) Estimating the error rate of a prediction rule: improvement on cross-validation. J Am Stat Assoc 78(382):316–331MathSciNetCrossRefMATH
Zurück zum Zitat Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistics 1(1):54–77MathSciNetMATH Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistics 1(1):54–77MathSciNetMATH
Zurück zum Zitat Efron B, Tibshirani R (1993) An Introduction to the Bootstrap. Chapman & Hall, LondonCrossRefMATH Efron B, Tibshirani R (1993) An Introduction to the Bootstrap. Chapman & Hall, LondonCrossRefMATH
Zurück zum Zitat Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632+ bootstrap method. J Am Stat Assoc 92(438):548–560MathSciNetMATH Efron B, Tibshirani R (1997) Improvements on cross-validation: the 632+ bootstrap method. J Am Stat Assoc 92(438):548–560MathSciNetMATH
Zurück zum Zitat Egmont-Petersen M, Talmon JL, Hasman A (1997) Robustness metrics for measuring the influence of additive noise on the performance of statistical classifiers. Int J Med Inform 46:103–112CrossRef Egmont-Petersen M, Talmon JL, Hasman A (1997) Robustness metrics for measuring the influence of additive noise on the performance of statistical classifiers. Int J Med Inform 46:103–112CrossRef
Zurück zum Zitat Elazmeh W, Japkowicz N, Matwin S (2006) A framework for measuring classification difference with imbalance. In: Proceedings of the 1st workshop on evaluation methods for machine learning Elazmeh W, Japkowicz N, Matwin S (2006) A framework for measuring classification difference with imbalance. In: Proceedings of the 1st workshop on evaluation methods for machine learning
Zurück zum Zitat Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 4th international joint conference on artificial intelligence, vol 17, pp 973–978 Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 4th international joint conference on artificial intelligence, vol 17, pp 973–978
Zurück zum Zitat Ferri C, Hernández-Orallo R, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30:27–38CrossRef Ferri C, Hernández-Orallo R, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30:27–38CrossRef
Zurück zum Zitat Fisher RA (1937) Statistical methods and scientific inference. Hafner publishing Co, New York Fisher RA (1937) Statistical methods and scientific inference. Hafner publishing Co, New York
Zurück zum Zitat Friedman JH (1997) On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Min Knowl Discov 1:55–77CrossRef Friedman JH (1997) On bias, variance, 0/1 loss, and the curse-of-dimensionality. Data Min Knowl Discov 1:55–77CrossRef
Zurück zum Zitat Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44:1761–1776CrossRef Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44:1761–1776CrossRef
Zurück zum Zitat Gama J, Sebastiao R, Pereira Rodrigues P (2009) Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 329–338 Gama J, Sebastiao R, Pereira Rodrigues P (2009) Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 329–338
Zurück zum Zitat Garcia S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH Garcia S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694MATH
Zurück zum Zitat Garcia S, Fernandez A, Luengo J, Herrera F (2010a) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRef Garcia S, Fernandez A, Luengo J, Herrera F (2010a) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064CrossRef
Zurück zum Zitat Garcia V, Mollineda RA, Sanchez JS (2010b) Theoretical analysis of a performance measure for imbalanced data. In: Proceedings of the 18th IEEE international conference on pattern recognition, pp 617–620 Garcia V, Mollineda RA, Sanchez JS (2010b) Theoretical analysis of a performance measure for imbalanced data. In: Proceedings of the 18th IEEE international conference on pattern recognition, pp 617–620
Zurück zum Zitat Glover S, Dixon P (2004) Likelihood ratios: A simple and flexible statistic for empirical psychologists. Psychon Bull Rev 11(5):791–806CrossRef Glover S, Dixon P (2004) Likelihood ratios: A simple and flexible statistic for empirical psychologists. Psychon Bull Rev 11(5):791–806CrossRef
Zurück zum Zitat Golland P, Fischl B (2003) Permutation tests for classification: towards statistical significance in image-based studies. In: Proceedings of the 18th international conference on information processing in medical imaging, vol 18, pp 330–341 Golland P, Fischl B (2003) Permutation tests for classification: towards statistical significance in image-based studies. In: Proceedings of the 18th international conference on information processing in medical imaging, vol 18, pp 330–341
Zurück zum Zitat Golland P, Liang F, Makherjee S, Panchenko D (2005) Permutation tests for classification. In: Proceedings of the 18th annual conference on learning Theory, vol 18, pp 501–515 Golland P, Liang F, Makherjee S, Panchenko D (2005) Permutation tests for classification. In: Proceedings of the 18th annual conference on learning Theory, vol 18, pp 501–515
Zurück zum Zitat Good IJ (1968) Corroboration, explanation, evolving probability, simplicity, and a sharpened razor. Br J Philos Sci 19:123–143MathSciNetCrossRef Good IJ (1968) Corroboration, explanation, evolving probability, simplicity, and a sharpened razor. Br J Philos Sci 19:123–143MathSciNetCrossRef
Zurück zum Zitat Good PI (2000) Permutation test: a practical guide to resampling methods for testing hypotheses. Springer Good PI (2000) Permutation test: a practical guide to resampling methods for testing hypotheses. Springer
Zurück zum Zitat Goodman S (2008) A dirty dozen: twelve p-value misconceptions. Semin Hematol 45(3):135–140CrossRef Goodman S (2008) A dirty dozen: twelve p-value misconceptions. Semin Hematol 45(3):135–140CrossRef
Zurück zum Zitat Grandvalet Y, Bengio Y (2006) Hypothesis testing for cross-validation. Tech. rep., Département d’informatique et recherche opérationnelle, Université de Montréal Grandvalet Y, Bengio Y (2006) Hypothesis testing for cross-validation. Tech. rep., Département d’informatique et recherche opérationnelle, Université de Montréal
Zurück zum Zitat Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18CrossRef Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18CrossRef
Zurück zum Zitat Haller H, Krauss S (2002) Misinterpretations of significance: A problem students share with their teachers. Methods Psychol Res Online 7(1):1–20 Haller H, Krauss S (2002) Misinterpretations of significance: A problem students share with their teachers. Methods Psychol Res Online 7(1):1–20
Zurück zum Zitat Hamill TM (1996) Reliability diagrams for multicategory probabilistic forecast. Weather Forecast 12(4):736–741CrossRef Hamill TM (1996) Reliability diagrams for multicategory probabilistic forecast. Weather Forecast 12(4):736–741CrossRef
Zurück zum Zitat Hand DJ (1986) Recent advances in error rate estimation. Pattern Recognit Lett 4(5):335–346CrossRef Hand DJ (1986) Recent advances in error rate estimation. Pattern Recognit Lett 4(5):335–346CrossRef
Zurück zum Zitat Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under de ROC curve. Mach Learn 77:103–123CrossRef Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under de ROC curve. Mach Learn 77:103–123CrossRef
Zurück zum Zitat Hand DJ (2010) Evaluation diagnostic tests: the area under the ROC curve and the balance of errors. Stat Med 29:1502–1510MathSciNet Hand DJ (2010) Evaluation diagnostic tests: the area under the ROC curve and the balance of errors. Stat Med 29:1502–1510MathSciNet
Zurück zum Zitat Hand DJ, Anagnostopoulos C (2013) When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance? Pattern Recognit Lett 34(5):492–495CrossRef Hand DJ, Anagnostopoulos C (2013) When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance? Pattern Recognit Lett 34(5):492–495CrossRef
Zurück zum Zitat Hand DJ, Anagnostopoulos C (2014) A better Beta for the H measure of classification performance. Pattern Recogn Lett 40:41–46CrossRef Hand DJ, Anagnostopoulos C (2014) A better Beta for the H measure of classification performance. Pattern Recogn Lett 40:41–46CrossRef
Zurück zum Zitat Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45:171–186CrossRefMATH Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45:171–186CrossRefMATH
Zurück zum Zitat Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, BerlinCrossRefMATH Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer, BerlinCrossRefMATH
Zurück zum Zitat Hsing T, Attoor S, Dougherty E (2003) Relation between permutation-test p values and classifier error estimates. Mach Learn 52(1):11–30CrossRef Hsing T, Attoor S, Dougherty E (2003) Relation between permutation-test p values and classifier error estimates. Mach Learn 52(1):11–30CrossRef
Zurück zum Zitat Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Commun Stat 9:571–595CrossRefMATH Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Commun Stat 9:571–595CrossRefMATH
Zurück zum Zitat Isaksson A, Wallman M, Goransson H, Gustafsson M (2008) Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recognit Lett 29(14):1960–1965CrossRef Isaksson A, Wallman M, Goransson H, Gustafsson M (2008) Cross-validation and bootstrapping are unreliable in small sample classification. Pattern Recognit Lett 29(14):1960–1965CrossRef
Zurück zum Zitat Jamain A, Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25:87–112MathSciNetCrossRefMATH Jamain A, Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25:87–112MathSciNetCrossRefMATH
Zurück zum Zitat Japkowicz N (2006) Why question machine learning evaluation methods (an illustrative review of the shortcomings of current methods). In: Proceedings of the 1st workshop on evaluation methods for machine learning Japkowicz N (2006) Why question machine learning evaluation methods (an illustrative review of the shortcomings of current methods). In: Proceedings of the 1st workshop on evaluation methods for machine learning
Zurück zum Zitat Japkowicz N (2008) Classifier evaluation: a need for better education and restructuring. In: Proceedings of the 3rd workshop on evaluation methods for machine learning Japkowicz N (2008) Classifier evaluation: a need for better education and restructuring. In: Proceedings of the 3rd workshop on evaluation methods for machine learning
Zurück zum Zitat Japkowicz N, Shah M (2011) Evaluating learning algorithms. Cambridge University Press, Cambridge, A classification perspectiveCrossRefMATH Japkowicz N, Shah M (2011) Evaluating learning algorithms. Cambridge University Press, Cambridge, A classification perspectiveCrossRefMATH
Zurück zum Zitat Johnson DH (1999) The insignificance of statistical significance testing. J Wildl Manag 63(3):763–772CrossRef Johnson DH (1999) The insignificance of statistical significance testing. J Wildl Manag 63(3):763–772CrossRef
Zurück zum Zitat Joshi A, Porikli F, Papanikolopoulos NP (2012) Scalable active learning for multiclass image classification. IEEE Trans Pattern Anal Mach Intell 34(11):2259–2273CrossRef Joshi A, Porikli F, Papanikolopoulos NP (2012) Scalable active learning for multiclass image classification. IEEE Trans Pattern Anal Mach Intell 34(11):2259–2273CrossRef
Zurück zum Zitat Joshi MV, Agarwal RC, Kumar V (2001) Mining needle in a haystack: classifying rare classes via two-phase rule induction. In: Proceedings of the 27th ACM SIGMOD international conference on management of data, pp 91–102 Joshi MV, Agarwal RC, Kumar V (2001) Mining needle in a haystack: classifying rare classes via two-phase rule induction. In: Proceedings of the 27th ACM SIGMOD international conference on management of data, pp 91–102
Zurück zum Zitat Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 1137–1143 Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, pp 1137–1143
Zurück zum Zitat Kohavi R, Wolpert DH (1996) Bias plus variance decomposition for zero-one loss functions. In: Saitta L (ed) Proceedings of the 13th international conference on machine learning, Morgan Kaumann, pp 275–283 Kohavi R, Wolpert DH (1996) Bias plus variance decomposition for zero-one loss functions. In: Saitta L (ed) Proceedings of the 13th international conference on machine learning, Morgan Kaumann, pp 275–283
Zurück zum Zitat Kruskal W, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621CrossRefMATH Kruskal W, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621CrossRefMATH
Zurück zum Zitat Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2):195–215CrossRef Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2):195–215CrossRef
Zurück zum Zitat Lacoste A, Laviolette F, Marchand M (2012) Bayesian comparison of machine learning algorithms on single and multiple datasets. In: Proceedings of the 15th international conference on artificial intellegence and statistics, pp 665–675 Lacoste A, Laviolette F, Marchand M (2012) Bayesian comparison of machine learning algorithms on single and multiple datasets. In: Proceedings of the 15th international conference on artificial intellegence and statistics, pp 665–675
Zurück zum Zitat Larson SC (1931) The shrinkage of the coefficient of multiple correlation. J Educ Psychol 22:45–55CrossRef Larson SC (1931) The shrinkage of the coefficient of multiple correlation. J Educ Psychol 22:45–55CrossRef
Zurück zum Zitat Lavesson N (2006) Evaluation of supervised learning algorithms and classifiers. Master’s thesis, Blekinge Institute of Technology Lavesson N (2006) Evaluation of supervised learning algorithms and classifiers. Master’s thesis, Blekinge Institute of Technology
Zurück zum Zitat Ling CX, Li C (1998) Data mining for direct marketing: Problems and solutions. In: Proceedings of the 4th international conference on knowledge discovery and data minig, pp 73–79 Ling CX, Li C (1998) Data mining for direct marketing: Problems and solutions. In: Proceedings of the 4th international conference on knowledge discovery and data minig, pp 73–79
Zurück zum Zitat Masson M (2011) A tutorial on a practical bayesian alternative to null-hypothesis significance testing. Behav Res Methods 43(3):679–90MathSciNetCrossRef Masson M (2011) A tutorial on a practical bayesian alternative to null-hypothesis significance testing. Behav Res Methods 43(3):679–90MathSciNetCrossRef
Zurück zum Zitat May WL, Johnson WD (1997) Confidence intervals for differences in correlated binary proportions. Stat Med 16(18):2127–2136CrossRef May WL, Johnson WD (1997) Confidence intervals for differences in correlated binary proportions. Stat Med 16(18):2127–2136CrossRef
Zurück zum Zitat McLachlan G (1992) Discriminant analysis and statistical pattern recognition. Wiley, New YorkCrossRefMATH McLachlan G (1992) Discriminant analysis and statistical pattern recognition. Wiley, New YorkCrossRefMATH
Zurück zum Zitat Moreno-Torres JG, Reader T, Aláiz-Rodriíguez R, Chawla NV, Herrera F (2012a) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530CrossRef Moreno-Torres JG, Reader T, Aláiz-Rodriíguez R, Chawla NV, Herrera F (2012a) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530CrossRef
Zurück zum Zitat Moreno-Torres JG, Sáez JA, Herrera F (2012b) Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans Neural Netw Learn Syst 23(8):1304–1312CrossRef Moreno-Torres JG, Sáez JA, Herrera F (2012b) Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans Neural Netw Learn Syst 23(8):1304–1312CrossRef
Zurück zum Zitat Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281CrossRefMATH Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281CrossRefMATH
Zurück zum Zitat Nakhaeizadeh G, Schnabl A (1998) Towards the personalization of algorihtms evaluation in data mining. In. In Proceedings of the 3rd international conference on knowledge discovery and data mining, pp 289–293 Nakhaeizadeh G, Schnabl A (1998) Towards the personalization of algorihtms evaluation in data mining. In. In Proceedings of the 3rd international conference on knowledge discovery and data mining, pp 289–293
Zurück zum Zitat Ojala M, Garriga GC (2010) Permutation tests for studying classifier performance. J Mach Learn Res 11:1833–1863MathSciNetMATH Ojala M, Garriga GC (2010) Permutation tests for studying classifier performance. J Mach Learn Res 11:1833–1863MathSciNetMATH
Zurück zum Zitat Otero J, Sánchez L, Couso I, Palacios A (2014) Bootstrap analysis of multiple repetitions of experiments using an interval-valued multiple comparison procedure. J Comput Syst Sci 80(1):88–100CrossRefMathSciNetMATH Otero J, Sánchez L, Couso I, Palacios A (2014) Bootstrap analysis of multiple repetitions of experiments using an interval-valued multiple comparison procedure. J Comput Syst Sci 80(1):88–100CrossRefMathSciNetMATH
Zurück zum Zitat Prati RC, Batista GEPA, Monard MC (2011) A survey on graphical methods for classification predictive performance evaluation. IEEE Trans Knowl Data Eng 23(11):1601–1618CrossRef Prati RC, Batista GEPA, Monard MC (2011) A survey on graphical methods for classification predictive performance evaluation. IEEE Trans Knowl Data Eng 23(11):1601–1618CrossRef
Zurück zum Zitat Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Proceeding of the 15th international conference on machine learning, pp 445–453 Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Proceeding of the 15th international conference on machine learning, pp 445–453
Zurück zum Zitat Raghavan V, Bollmann P, Jung GS (1989) A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inf Syst 7(3):205–229CrossRef Raghavan V, Bollmann P, Jung GS (1989) A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inf Syst 7(3):205–229CrossRef
Zurück zum Zitat Ranawana R, Palade V (2006) Optimized precision–a new measure for classifier performance evaluation. In: Proceedings of the 23th IEEE international conference on evolutionary computation, pp 2254–2261 Ranawana R, Palade V (2006) Optimized precision–a new measure for classifier performance evaluation. In: Proceedings of the 23th IEEE international conference on evolutionary computation, pp 2254–2261
Zurück zum Zitat Reader T, Hoens TR, Chawla NV (2010) Consequences of variability in classifier performance estimates. In: Proceedings of the 10th IEEE international conference on data mining, pp 421–430 Reader T, Hoens TR, Chawla NV (2010) Consequences of variability in classifier performance estimates. In: Proceedings of the 10th IEEE international conference on data mining, pp 421–430
Zurück zum Zitat Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141MathSciNetMATH Rifkin R, Klautau A (2004) In defense of one-vs-all classification. J Mach Learn Res 5:101–141MathSciNetMATH
Zurück zum Zitat Rodríguez JD, Pérez A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575CrossRef Rodríguez JD, Pérez A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans Pattern Anal Mach Intell 32(3):569–575CrossRef
Zurück zum Zitat Rodríguez JD, Pérez A, Lozano JA (2013) A general framework for the statistical analysis of the sources of variance for classification error estimators. Pattern Recognit 46(3):855–864CrossRef Rodríguez JD, Pérez A, Lozano JA (2013) A general framework for the statistical analysis of the sources of variance for classification error estimators. Pattern Recognit 46(3):855–864CrossRef
Zurück zum Zitat Rom DM (1990) A sequentially rejective test procedure based on a modified bonferroni inequality. Biometrika 77:663–665MathSciNetCrossRef Rom DM (1990) A sequentially rejective test procedure based on a modified bonferroni inequality. Biometrika 77:663–665MathSciNetCrossRef
Zurück zum Zitat Rozeboom W (1960) The fallacy of the null-hypothesis significance test. Psychol Bull 57(5):416–428CrossRef Rozeboom W (1960) The fallacy of the null-hypothesis significance test. Psychol Bull 57(5):416–428CrossRef
Zurück zum Zitat Schubert CM, Thorsen SN, Oxley ME (2011) The ROC manifold for classification systems. Pattern Recognit 44(2):350–362CrossRefMATH Schubert CM, Thorsen SN, Oxley ME (2011) The ROC manifold for classification systems. Pattern Recognit 44(2):350–362CrossRefMATH
Zurück zum Zitat Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:551–584CrossRef Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:551–584CrossRef
Zurück zum Zitat Silla CN, Freitas AA (2011) A survey of hierarchical classification across different application domains. Data Min Knowl Discov 22(1–2):31–72MathSciNetCrossRefMATH Silla CN, Freitas AA (2011) A survey of hierarchical classification across different application domains. Data Min Knowl Discov 22(1–2):31–72MathSciNetCrossRefMATH
Zurück zum Zitat Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and ROC: a family of discriminant measures for performance evaluation. In: Proceedings of the 19th Australian joint conference on artificial intelligence: advances in artificial intelligence, pp 1015–1021 Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and ROC: a family of discriminant measures for performance evaluation. In: Proceedings of the 19th Australian joint conference on artificial intelligence: advances in artificial intelligence, pp 1015–1021
Zurück zum Zitat Stone M (1974) Cross-validatory choice and assessment of statistical predictions (with discussion). J R Stat Soc Ser B 36:111–147MATH Stone M (1974) Cross-validatory choice and assessment of statistical predictions (with discussion). J R Stat Soc Ser B 36:111–147MATH
Zurück zum Zitat Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687CrossRef Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687CrossRef
Zurück zum Zitat Tan P, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Reading Tan P, Steinbach M, Kumar V (2006) Introduction to data mining. Addison Wesley, Reading
Zurück zum Zitat Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Wareh Min 3(3):1–13CrossRef Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Wareh Min 3(3):1–13CrossRef
Zurück zum Zitat van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, OxfordMATH van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, OxfordMATH
Zurück zum Zitat Webb G (2000) Multiboosting: a technique for combining boosting and wagging. Mach Learn 40(2):159–196CrossRef Webb G (2000) Multiboosting: a technique for combining boosting and wagging. Mach Learn 40(2):159–196CrossRef
Zurück zum Zitat Webb GI, Conilione P (2003) Estimating bias and variance from data. Tech. rep Webb GI, Conilione P (2003) Estimating bias and variance from data. Tech. rep
Zurück zum Zitat Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newslett 6(1):7–19CrossRef Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newslett 6(1):7–19CrossRef
Zurück zum Zitat Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390CrossRef Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390CrossRef
Zurück zum Zitat Zar JH (2010) Biostatistical analysis, 5th edn. Pearson Prentice Hall, Englewood Cliffs Zar JH (2010) Biostatistical analysis, 5th edn. Pearson Prentice Hall, Englewood Cliffs
Metadaten
Titel
Dealing with the evaluation of supervised classification algorithms
verfasst von
Guzman Santafe
Iñaki Inza
Jose A. Lozano
Publikationsdatum
01.12.2015
Verlag
Springer Netherlands
Erschienen in
Artificial Intelligence Review / Ausgabe 4/2015
Print ISSN: 0269-2821
Elektronische ISSN: 1573-7462
DOI
https://doi.org/10.1007/s10462-015-9433-y

Weitere Artikel der Ausgabe 4/2015

Artificial Intelligence Review 4/2015 Zur Ausgabe