Skip to main content
Erschienen in: Pattern Analysis and Applications 2/2018

27.09.2016 | Theoretical Advances

Dealing with overlap and imbalance: a new metric and approach

verfasst von: Zalán Borsos, Camelia Lemnaru, Rodica Potolea

Erschienen in: Pattern Analysis and Applications | Ausgabe 2/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper addresses learning in complex scenarios involving imbalance and overlap. We propose a novel measure, the Augmented R-value, for estimating the level of overlap in the data. It improves an existing model-based measure, by including the data imbalance in the estimation process. We provide both a theoretical demonstration and empirical validations of the new metric’s efficacy in estimating the overlap level. Another contribution of the present paper is to propose a collection of meta-features to be used in conjunction with a meta-learning strategy for predicting the most suitable classifier for a given problem. The evaluations performed on a well-known collection of benchmark problems have shown that the meta-learning approach achieves superior results to the manual classifier selection process normally carried out by data scientists. The analysis of the results obtained by the meta-feature selection step has confirmed the power of the Augmented R-value in predicting the expected performance of classifiers in such complex classification scenarios. Also, we found that the overlap is a more serious factor affecting the performance of classifiers than imbalance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aha DW (1992) Generalizing from case studies: a case study. In: Proceedings of the ninth international conference on machine learning, Morgan Kaufmann, pp 1–10 Aha DW (1992) Generalizing from case studies: a case study. In: Proceedings of the ninth international conference on machine learning, Morgan Kaufmann, pp 1–10
4.
Zurück zum Zitat Barandela R, Sánchez JS, Garcıa V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit. 36(3):849–851CrossRef Barandela R, Sánchez JS, Garcıa V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recognit. 36(3):849–851CrossRef
5.
Zurück zum Zitat Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 30:1145–1159CrossRef Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 30:1145–1159CrossRef
6.
Zurück zum Zitat Brodersen K, Ong CS, Stephan K, Buhmann J (2010) The balanced accuracy and its posterior distribution. In: Pattern recognition (ICPR), 2010 20th international conference on, pp 3121–3124, doi:10.1109/ICPR.2010.764 Brodersen K, Ong CS, Stephan K, Buhmann J (2010) The balanced accuracy and its posterior distribution. In: Pattern recognition (ICPR), 2010 20th international conference on, pp 3121–3124, doi:10.​1109/​ICPR.​2010.​764
7.
Zurück zum Zitat Chawla N (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York. doi:10.1007/0-387-25465-X_40 Chawla N (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York. doi:10.​1007/​0-387-25465-X_​40
8.
Zurück zum Zitat Chawla N, Lazarevic A, Hall L, Bowyer K (2003) Smoteboost: Improving prediction of the minority class in boosting. In: LavraČ N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in computer science, vol 2838, Springer, Berlin, pp 107–119. doi:10.1007/978-3-540-39804-2_12, Chawla N, Lazarevic A, Hall L, Bowyer K (2003) Smoteboost: Improving prediction of the minority class in boosting. In: LavraČ N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in computer science, vol 2838, Springer, Berlin, pp 107–119. doi:10.​1007/​978-3-540-39804-2_​12,
11.
Zurück zum Zitat Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Proceedings of the 23rd Canadian conference on advances in artificial intelligence, Springer, Berlin, Heidelberg, AI’10, pp 220–231. doi:10.1007/978-3-642-13059-5_22 Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Proceedings of the 23rd Canadian conference on advances in artificial intelligence, Springer, Berlin, Heidelberg, AI’10, pp 220–231. doi:10.​1007/​978-3-642-13059-5_​22
12.
Zurück zum Zitat Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’99, pp 155–164, doi:10.1145/312129.312220, Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’99, pp 155–164, doi:10.​1145/​312129.​312220,
14.
Zurück zum Zitat Grzymala-Busse J, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita M, Howlett R, Jain L (eds) Knowledge-based intelligent information and engineering systems. Lecture notes in computer science, vol 3213, Springer, Berlin, Heidelberg, pp 757–763. doi:10.1007/978-3-540-30132-5_103, Grzymala-Busse J, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita M, Howlett R, Jain L (eds) Knowledge-based intelligent information and engineering systems. Lecture notes in computer science, vol 3213, Springer, Berlin, Heidelberg, pp 757–763. doi:10.​1007/​978-3-540-30132-5_​103,
16.
Zurück zum Zitat Gutlein M, Frank E, Hall M, Karwath A (2009) Large-scale attribute selection using wrappers. In: Computational intelligence and data mining, 2009. CIDM ’09. IEEE Symposium on, pp 332–339. doi:10.1109/CIDM.2009.4938668 Gutlein M, Frank E, Hall M, Karwath A (2009) Large-scale attribute selection using wrappers. In: Computational intelligence and data mining, 2009. CIDM ’09. IEEE Symposium on, pp 332–339. doi:10.​1109/​CIDM.​2009.​4938668
20.
Zurück zum Zitat Lemnaru C, Potolea R (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Zhang R, Zhang J, Zhang Z, Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 102, Springer, Berlin, Heidelberg, pp 35–50. doi:10.1007/978-3-642-29958-2_3 Lemnaru C, Potolea R (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Zhang R, Zhang J, Zhang Z, Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 102, Springer, Berlin, Heidelberg, pp 35–50. doi:10.​1007/​978-3-642-29958-2_​3
22.
Zurück zum Zitat Liu B, Ma Y, Wong C (2000) Improving an association rule based classifier. In: Zighed D, Komorowski J, Żytkow J (eds) Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1910, Springer, Berlin, Heidelberg, pp 504–509. doi:10.1007/3-540-45372-5_58, Liu B, Ma Y, Wong C (2000) Improving an association rule based classifier. In: Zighed D, Komorowski J, Żytkow J (eds) Principles of data mining and knowledge discovery. Lecture notes in computer science, vol 1910, Springer, Berlin, Heidelberg, pp 504–509. doi:10.​1007/​3-540-45372-5_​58,
24.
Zurück zum Zitat López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. doi:10.1016/j.ins.2013.07.007 CrossRef López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. doi:10.​1016/​j.​ins.​2013.​07.​007 CrossRef
25.
Zurück zum Zitat Luengo J, Fernández A, García S (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936. doi:10.1007/s00500-010-0625-8 CrossRef Luengo J, Fernández A, García S (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936. doi:10.​1007/​s00500-010-0625-8 CrossRef
27.
Zurück zum Zitat Potolea R, Cacoveanu S, Lemnaru C (2011) Meta-learning framework for prediction strategy evaluation. In: Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 73, Springer, Berlin, Heidelberg, pp 280–295. doi:10.1007/978-3-642-19802-1_20, Potolea R, Cacoveanu S, Lemnaru C (2011) Meta-learning framework for prediction strategy evaluation. In: Filipe J, Cordeiro J (eds) Enterprise information systems. Lecture notes in business information processing, vol 73, Springer, Berlin, Heidelberg, pp 280–295. doi:10.​1007/​978-3-642-19802-1_​20,
29.
Zurück zum Zitat Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, San Francisc Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, San Francisc
31.
Zurück zum Zitat Seiffert C, Khoshgoftaar T, Van Hulse J, Folleco A (2007) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. In: Information reuse and integration, 2007. IRI 2007. IEEE international conference on, pp 651–658. doi:10.1109/IRI.2007.4296694 Seiffert C, Khoshgoftaar T, Van Hulse J, Folleco A (2007) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. In: Information reuse and integration, 2007. IRI 2007. IEEE international conference on, pp 651–658. doi:10.​1109/​IRI.​2007.​4296694
32.
Zurück zum Zitat Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data Sun Y, Kamel MS, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data
35.
Zurück zum Zitat Visa S (2005) Issues in mining imbalanced data sets—a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, 2005, pp 67–73 Visa S (2005) Issues in mining imbalanced data sets—a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, 2005, pp 67–73
37.
Zurück zum Zitat Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, New Brunswick, NJ, USA, aAI3093004 Weiss GM (2003) The effect of small disjuncts and class distribution on decision tree learning. PhD thesis, New Brunswick, NJ, USA, aAI3093004
39.
Zurück zum Zitat Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: In ICML 2003 workshop on learning from imbalanced data sets, pp 49–56 Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: In ICML 2003 workshop on learning from imbalanced data sets, pp 49–56
40.
Zurück zum Zitat Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh international conference on knowledge discovery and data mining, ACM Press, pp 204–213 Zadrozny B, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh international conference on knowledge discovery and data mining, ACM Press, pp 204–213
Metadaten
Titel
Dealing with overlap and imbalance: a new metric and approach
verfasst von
Zalán Borsos
Camelia Lemnaru
Rodica Potolea
Publikationsdatum
27.09.2016
Verlag
Springer London
Erschienen in
Pattern Analysis and Applications / Ausgabe 2/2018
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-016-0583-6

Weitere Artikel der Ausgabe 2/2018

Pattern Analysis and Applications 2/2018 Zur Ausgabe

Premium Partner