Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 3/2017

05.12.2016

On classifier behavior in the presence of mislabeling noise

verfasst von: Katsiaryna Mirylenka, George Giannakopoulos, Le Minh Do, Themis Palpanas

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Machine learning algorithms perform differently in settings with varying levels of training set mislabeling noise. Therefore, the choice of the right algorithm for a particular learning problem is crucial. The contribution of this paper is towards two, dual problems: first, comparing algorithm behavior; and second, choosing learning algorithms for noisy settings. We present the “sigmoid rule” framework, which can be used to choose the most appropriate learning algorithm depending on the properties of noise in a classification problem. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. We study the characteristics of the sigmoid function using five representative non-sequential classifiers, namely, Naïve Bayes, kNN, SVM, a decision tree classifier, and a rule-based classifier, and three widely used sequential classifiers based on hidden Markov models, conditional random fields and recursive neural networks. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, showing that we can estimate an expected performance over a dataset regardless of the underlying algorithm. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
When we write performance of an algorithm, we mean the classification accuracy.
 
2
Throughout the rest of this work we will use the term noise to refer to the label noise, unless otherwise indicated.
 
3
A preliminary version of this work has appeared in Mirylenka et al. (2012).
 
4
Instead of 0.05, one can use any value close to 0, describing a normalized measure of distance from the optimal performance. In the case of \(p = 0.05\), the distance from the optimal performance is \(5\%\).
 
5
We note that high levels of noise such as 95% are often observed in the presence of concept drift, e.g., when learning computer-user browsing habits in a network environment with a single IP, and several different users sharing it.
 
7
Repeated random sub-sampling validation is also known as Monte Carlo cross-validation (Kuhn and Johnson 2013).
 
8
The settings of the genetic algorithm can be found in the appendix.
 
9
We would like to thank Christos Faloutsos for kindly providing the code for the fractal dimensionality estimation.
 
10
RMAE is also known as mean absolute percentage error.
 
Literatur
Zurück zum Zitat Abdulrahman SM, Brazdil P, van Rijn JN, Vanschoren J (2015) Algorithm selection via meta-learning and sample-based active testing. In: Proceedings of the 2015 international workshop on meta-learning and algorithm selection (MetaSel) co-located with European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 55–66 Abdulrahman SM, Brazdil P, van Rijn JN, Vanschoren J (2015) Algorithm selection via meta-learning and sample-based active testing. In: Proceedings of the 2015 international workshop on meta-learning and algorithm selection (MetaSel) co-located with European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 55–66
Zurück zum Zitat Ali S, Smith K (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138CrossRef Ali S, Smith K (2006) On learning algorithm selection for classification. Appl Soft Comput 6(2):119–138CrossRef
Zurück zum Zitat Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, New YorkMATH Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, New YorkMATH
Zurück zum Zitat Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef Bradley A (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef
Zurück zum Zitat Brazdil P, Giraud Carrier C, Soares C, Vilalta R (2009) Development of metalearning systems for algorithm recommendation. Metalearning, 31–59 Brazdil P, Giraud Carrier C, Soares C, Vilalta R (2009) Development of metalearning systems for algorithm recommendation. Metalearning, 31–59
Zurück zum Zitat Brazdil PB, Soares C, Pinto Da Costa J (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50(3):251–277CrossRef Brazdil PB, Soares C, Pinto Da Costa J (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50(3):251–277CrossRef
Zurück zum Zitat Camastra F, Vinciarelli A (2002) Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans Pattern Anal Mach Intell 24(10):1404–1407CrossRef Camastra F, Vinciarelli A (2002) Estimating the intrinsic dimension of data with a fractal-based method. IEEE Trans Pattern Anal Mach Intell 24(10):1404–1407CrossRef
Zurück zum Zitat Chevaleyre Y, Zucker JD (2000) Noise-tolerant rule induction from multi-instance data. In: Proceedings of the workshop on attribute-value and relational learning: crossing the boundaries, co-located with international conference on machine learning (ICML), pp 47–52 Chevaleyre Y, Zucker JD (2000) Noise-tolerant rule induction from multi-instance data. In: Proceedings of the workshop on attribute-value and relational learning: crossing the boundaries, co-located with international conference on machine learning (ICML), pp 47–52
Zurück zum Zitat Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning, pp 115–123CrossRef Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning, pp 115–123CrossRef
Zurück zum Zitat Corder GW, Foreman DI (2009) Nonparametric statistics for non-statisticians: a step-by-step approach. Wiley, HobokenCrossRef Corder GW, Foreman DI (2009) Nonparametric statistics for non-statisticians: a step-by-step approach. Wiley, HobokenCrossRef
Zurück zum Zitat Cruz RM, Sabourin R, Cavalcanti GD, Ren TI (2015) Meta-des: a dynamic ensemble selection framework using meta-learning. Pattern Recognit 48(5):1925–1935CrossRef Cruz RM, Sabourin R, Cavalcanti GD, Ren TI (2015) Meta-des: a dynamic ensemble selection framework using meta-learning. Pattern Recognit 48(5):1925–1935CrossRef
Zurück zum Zitat de Sousa E, Traina A, Traina Jr. C, Faloutsos C (2006) Evaluating the intrinsic dimension of evolving data streams. In: Proceedings of the 2006 ACM symposium on applied computing, pp 643–648 de Sousa E, Traina A, Traina Jr. C, Faloutsos C (2006) Evaluating the intrinsic dimension of evolving data streams. In: Proceedings of the 2006 ACM symposium on applied computing, pp 643–648
Zurück zum Zitat Dupont P (2006) Noisy sequence classification with smoothed Markov chains. In: Proceedings of the 8th French conference on machine learning (CAP 2006), pp 187–201 Dupont P (2006) Noisy sequence classification with smoothed Markov chains. In: Proceedings of the 8th French conference on machine learning (CAP 2006), pp 187–201
Zurück zum Zitat Eom SB, Ketcherside MA, Lee HH, Rodgers ML, Starrett D (2004) The determinants of web-based instructional systems’ outcome and satisfaction: an empirical investigation. Cognitive aspects of online programs. Instr Technol, pp 96–139 Eom SB, Ketcherside MA, Lee HH, Rodgers ML, Starrett D (2004) The determinants of web-based instructional systems’ outcome and satisfaction: an empirical investigation. Cognitive aspects of online programs. Instr Technol, pp 96–139
Zurück zum Zitat Garcia LPF, de Carvalho ACPLF, Lorena AC (2016) Noise detection in the meta-learning level. Neurocomputing 176:14–25CrossRef Garcia LPF, de Carvalho ACPLF, Lorena AC (2016) Noise detection in the meta-learning level. Neurocomputing 176:14–25CrossRef
Zurück zum Zitat Giannakopoulos G, Palpanas T (2010) The effect of history on modeling systems’ performance: the problem of the demanding lord. In: IEEE 10th international conference on data mining (ICDM). doi:10.1109/ICDM.2010.90 Giannakopoulos G, Palpanas T (2010) The effect of history on modeling systems’ performance: the problem of the demanding lord. In: IEEE 10th international conference on data mining (ICDM). doi:10.​1109/​ICDM.​2010.​90
Zurück zum Zitat Giraud-Carrier C, Vilalta R, Brazdil P (2004) Introduction to the special issue on meta-learning. Mach Learn 54(3):187–193CrossRef Giraud-Carrier C, Vilalta R, Brazdil P (2004) Introduction to the special issue on meta-learning. Mach Learn 54(3):187–193CrossRef
Zurück zum Zitat Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef
Zurück zum Zitat Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann, San FranciscoMATH Han J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann, San FranciscoMATH
Zurück zum Zitat Haussler D (1990) Probably approximately correct learning. University of California, Santa Cruz, Computer Research Laboratory Haussler D (1990) Probably approximately correct learning. University of California, Santa Cruz, Computer Research Laboratory
Zurück zum Zitat Heywood MI (2015) Evolutionary model building under streaming data for classification tasks: opportunities and challenges. Genet Program Evolvable Mach 16(3):283–326MathSciNetCrossRef Heywood MI (2015) Evolutionary model building under streaming data for classification tasks: opportunities and challenges. Genet Program Evolvable Mach 16(3):283–326MathSciNetCrossRef
Zurück zum Zitat Kalapanidas E, Avouris N, Craciun M, Neagu D (2003) Machine learning algorithms: a study on noise sensitivity. In: Proceedings 1st Balcan conference in informatics, pp 356–365 Kalapanidas E, Avouris N, Craciun M, Neagu D (2003) Machine learning algorithms: a study on noise sensitivity. In: Proceedings 1st Balcan conference in informatics, pp 356–365
Zurück zum Zitat Keerthi S, Shevade S, Bhattacharyya C, Murthy K (2001) Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput 13(3):637–649CrossRef Keerthi S, Shevade S, Bhattacharyya C, Murthy K (2001) Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput 13(3):637–649CrossRef
Zurück zum Zitat Klinkenberg R (2005) Meta-learning, model selection, and example selection in machine learning domains with concept drift. In: Lernen, Wissensentdeckung und Adaptivität (LWA) 2005, GI Workshops, Saarbrücken, October 10th–12th, pp 164–171 Klinkenberg R (2005) Meta-learning, model selection, and example selection in machine learning domains with concept drift. In: Lernen, Wissensentdeckung und Adaptivität (LWA) 2005, GI Workshops, Saarbrücken, October 10th–12th, pp 164–171
Zurück zum Zitat Kuh A, Petsche T, Rivest RL (1990) Learning time-varying concepts. In: Conference on neural information processing systems (NIPS), pp 183–189 Kuh A, Petsche T, Rivest RL (1990) Learning time-varying concepts. In: Conference on neural information processing systems (NIPS), pp 183–189
Zurück zum Zitat Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning (ICML), pp 282–289 Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning (ICML), pp 282–289
Zurück zum Zitat Li Q, Li T, Zhu S, Kambhamettu C (2002) Improving medical/biological data classification performance by wavelet preprocessing. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 657–660 Li Q, Li T, Zhu S, Kambhamettu C (2002) Improving medical/biological data classification performance by wavelet preprocessing. In: Proceedings of the IEEE international conference on data mining (ICDM), pp 657–660
Zurück zum Zitat Massey FJ Jr (1951) The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78CrossRef Massey FJ Jr (1951) The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc 46(253):68–78CrossRef
Zurück zum Zitat Mirylenka K, Cormode G, Palpanas T, Srivastava D (2015) Conditional heavy hitters: detecting interesting correlations in data streams. Int J Very Large Data Bases (VLDB) 24(3):395–414CrossRef Mirylenka K, Cormode G, Palpanas T, Srivastava D (2015) Conditional heavy hitters: detecting interesting correlations in data streams. Int J Very Large Data Bases (VLDB) 24(3):395–414CrossRef
Zurück zum Zitat Mirylenka K, Giannakopoulos G, Palpanas T (2012) SRF: a framework for the study of classifier behavior under training set mislabeling noise. In: Advances in knowledge discovery and data mining, lecture notes in computer science, vol 7301, pp 109–121CrossRef Mirylenka K, Giannakopoulos G, Palpanas T (2012) SRF: a framework for the study of classifier behavior under training set mislabeling noise. In: Advances in knowledge discovery and data mining, lecture notes in computer science, vol 7301, pp 109–121CrossRef
Zurück zum Zitat Mirylenka K, Palpanas T, Cormode G, Srivastava D (2013) Finding interesting correlations with conditional heavy hitters. In: IEEE 29th international conference on data engineering (ICDE), pp 1069–1080 Mirylenka K, Palpanas T, Cormode G, Srivastava D (2013) Finding interesting correlations with conditional heavy hitters. In: IEEE 29th international conference on data engineering (ICDE), pp 1069–1080
Zurück zum Zitat Mantovani RG, Rossi ALD, Vanschoren J, Carvalho ACPLF (2015) Meta-learning recommendation of default hyper-parameter values for SVMs in classification tasks. In: Proceedings of the 2015 international workshop on meta-learning and algorithm selection (MetaSel), European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 80–92 Mantovani RG, Rossi ALD, Vanschoren J, Carvalho ACPLF (2015) Meta-learning recommendation of default hyper-parameter values for SVMs in classification tasks. In: Proceedings of the 2015 international workshop on meta-learning and algorithm selection (MetaSel), European conference on machine learning and principles and practice of knowledge discovery in databases (ECMLPKDD), pp 80–92
Zurück zum Zitat Nettleton DF, Orriols-Puig A, Fornells A (2010) A study of the effect of different types of noise on the precision of supervised learning techniques. Artif Intell Rev 33(4):275–306. doi:10.1007/s10462-010-9156-z CrossRef Nettleton DF, Orriols-Puig A, Fornells A (2010) A study of the effect of different types of noise on the precision of supervised learning techniques. Artif Intell Rev 33(4):275–306. doi:10.​1007/​s10462-010-9156-z CrossRef
Zurück zum Zitat Pechenizkiy M (2015) Predictive analytics on evolving data streams anticipating and adapting to changes in known and unknown contexts. In: IEEE international conference on high performance computing & simulation (HPCS), pp 658–659 Pechenizkiy M (2015) Predictive analytics on evolving data streams anticipating and adapting to changes in known and unknown contexts. In: IEEE international conference on high performance computing & simulation (HPCS), pp 658–659
Zurück zum Zitat Pendrith M, Sammut C (1994) On reinforcement learning of control actions in noisy and non-markovian domains. Technical report, School of Computer Science and Engineering, The University of New South Wales, Sydney Pendrith M, Sammut C (1994) On reinforcement learning of control actions in noisy and non-markovian domains. Technical report, School of Computer Science and Engineering, The University of New South Wales, Sydney
Zurück zum Zitat Rabiner L, Juang B (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16CrossRef Rabiner L, Juang B (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16CrossRef
Zurück zum Zitat Rossi ALD, de Leon Ponce, Ferreira de Carvalho AC, Soares C, Feres de Souza B (2014) MetaStream: a meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing 127:52–64CrossRef Rossi ALD, de Leon Ponce, Ferreira de Carvalho AC, Soares C, Feres de Souza B (2014) MetaStream: a meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing 127:52–64CrossRef
Zurück zum Zitat Smith MR, Mitchell L, Giraud-Carrier C, Martinez T (2014) Recommending learning algorithms and their associated hyperparameters. arXiv:1407.1890 Smith MR, Mitchell L, Giraud-Carrier C, Martinez T (2014) Recommending learning algorithms and their associated hyperparameters. arXiv:​1407.​1890
Zurück zum Zitat Taylor R (1990) Interpretation of the correlation coefficient: a basic review. J Diagn Med Sonogr 6(1):35–39CrossRef Taylor R (1990) Interpretation of the correlation coefficient: a basic review. J Diagn Med Sonogr 6(1):35–39CrossRef
Zurück zum Zitat Teytaud O (2001) Learning with noise. Extension to regression. In: Proceedings of the IEEE international joint conference on neural networks (IJCNN’01) vol 3, pp 1787–1792 Teytaud O (2001) Learning with noise. Extension to regression. In: Proceedings of the IEEE international joint conference on neural networks (IJCNN’01) vol 3, pp 1787–1792
Zurück zum Zitat Theodoridis S, Koutroumbas K (2003) Pattern recognition. Academic Press, San DiegoMATH Theodoridis S, Koutroumbas K (2003) Pattern recognition. Academic Press, San DiegoMATH
Zurück zum Zitat Valiant L (1984) A theory of the learnable. Commun ACM 27(11):1134–1142CrossRef Valiant L (1984) A theory of the learnable. Commun ACM 27(11):1134–1142CrossRef
Zurück zum Zitat Vapnik VN (1998) Statistical learning theory, vol 1. Wiley, New York Vapnik VN (1998) Statistical learning theory, vol 1. Wiley, New York
Zurück zum Zitat Waluyan L, Sasipan S, Noguera S, Asai T (2009) Analysis of potential problems in people management concerning information security in cross-cultural environment -in the case of Malaysia. In: Proceedings of the third international symposium on human aspects of information security & assurance (HAISA), pp 13–24 Waluyan L, Sasipan S, Noguera S, Asai T (2009) Analysis of potential problems in people management concerning information security in cross-cultural environment -in the case of Malaysia. In: Proceedings of the third international symposium on human aspects of information security & assurance (HAISA), pp 13–24
Zurück zum Zitat Widmer G (1997) Tracking context changes through meta-learning. Mach Learn 27(3):259–286CrossRef Widmer G (1997) Tracking context changes through meta-learning. Mach Learn 27(3):259–286CrossRef
Zurück zum Zitat Wolpert D (1996) The existence of a priori distinctions between learning algorithms. Neural Comput 8:1391–1421CrossRef Wolpert D (1996) The existence of a priori distinctions between learning algorithms. Neural Comput 8:1391–1421CrossRef
Zurück zum Zitat Wolpert D (2001) The supervised learning no-free-lunch theorems. In: Proceedings of the 6th online world conference on soft computing in industrial applications. Springer, London, pp 25–42CrossRef Wolpert D (2001) The supervised learning no-free-lunch theorems. In: Proceedings of the 6th online world conference on soft computing in industrial applications. Springer, London, pp 25–42CrossRef
Zurück zum Zitat Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390CrossRef Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8:1341–1390CrossRef
Zurück zum Zitat won Lee J, Giraud-Carrier C (2008) New insights into learning algorithms and datasets. In: IEEE seventh international conference on machine learning and applications (ICMLA’08), pp 135–140 won Lee J, Giraud-Carrier C (2008) New insights into learning algorithms and datasets. In: IEEE seventh international conference on machine learning and applications (ICMLA’08), pp 135–140
Metadaten
Titel
On classifier behavior in the presence of mislabeling noise
verfasst von
Katsiaryna Mirylenka
George Giannakopoulos
Le Minh Do
Themis Palpanas
Publikationsdatum
05.12.2016
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 3/2017
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-016-0484-8

Weitere Artikel der Ausgabe 3/2017

Data Mining and Knowledge Discovery 3/2017 Zur Ausgabe