Skip to main content
Top
Published in: Knowledge and Information Systems 7/2021

01-06-2021 | Regular Paper

Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect

Authors: José Daniel Pascual-Triana, David Charte, Marta Andrés Arroyo, Alberto Fernández, Francisco Herrera

Published in: Knowledge and Information Systems | Issue 7/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers’ performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics has been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
4.
go back to reference Alpaydin E (2016) Machine learning: the new AI. MIT Press, Cambridge Alpaydin E (2016) Machine learning: the new AI. MIT Press, Cambridge
7.
go back to reference Astorino A, Fuduli A, Gaudioso M, Vocaturo E (2019) Multiple instance learning algorithm for medical image classification. SEBD 2400:1–8 Astorino A, Fuduli A, Gaudioso M, Vocaturo E (2019) Multiple instance learning algorithm for medical image classification. SEBD 2400:1–8
11.
go back to reference Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(10):281–305MathSciNetMATH Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(10):281–305MathSciNetMATH
41.
go back to reference Hutter F, Kotthoff L, Vanschoren J (2019) Automated machine learning - methods, systems challenges. Springer, BerlinCrossRef Hutter F, Kotthoff L, Vanschoren J (2019) Automated machine learning - methods, systems challenges. Springer, BerlinCrossRef
42.
go back to reference Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. Proc. ECML PKDD08 Discovery Challenge p. 9 Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. Proc. ECML PKDD08 Discovery Challenge p. 9
63.
go back to reference Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in C++. Universitat Ramon Llull, La Salle 196:1–40 Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in C++. Universitat Ramon Llull, La Salle 196:1–40
65.
68.
go back to reference Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USACrossRef Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USACrossRef
70.
go back to reference Sun S, Mao L, Dong Z, Wu L (2019) Multiview Machine Learning, 1st edn. Springer, BerlinCrossRef Sun S, Mao L, Dong Z, Wu L (2019) Multiview Machine Learning, 1st edn. Springer, BerlinCrossRef
72.
go back to reference Tanwani AK, Farooq M (2010) Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. In: Bacardit J, Browne W, Drugowitsch, J Bernadó-Mansilla E, Butz MV (eds) Learning Classifier Systems Lecture Notes in Computer Science, (pp. 127–144) Springer, Berlin. doi: https://doi.org/10.1007/978-3-642-17508-4_9 Tanwani AK, Farooq M (2010) Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. In: Bacardit J, Browne W, Drugowitsch, J Bernadó-Mansilla E, Butz MV (eds) Learning Classifier Systems Lecture Notes in Computer Science, (pp. 127–144) Springer, Berlin. doi: https://​doi.​org/​10.​1007/​978-3-642-17508-4_​9
77.
go back to reference Zhu X (2005) Semi-supervised learning with graphs. phd, Carnegie Mellon University, USA. AAI3179046 ISBN-10: 0542190591 Zhu X (2005) Semi-supervised learning with graphs. phd, Carnegie Mellon University, USA. AAI3179046 ISBN-10: 0542190591
Metadata
Title
Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect
Authors
José Daniel Pascual-Triana
David Charte
Marta Andrés Arroyo
Alberto Fernández
Francisco Herrera
Publication date
01-06-2021
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 7/2021
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-021-01577-1

Other articles of this Issue 7/2021

Knowledge and Information Systems 7/2021 Go to the issue

Premium Partner