Skip to main content
Erschienen in: Knowledge and Information Systems 7/2021

01.06.2021 | Regular Paper

Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect

verfasst von: José Daniel Pascual-Triana, David Charte, Marta Andrés Arroyo, Alberto Fernández, Francisco Herrera

Erschienen in: Knowledge and Information Systems | Ausgabe 7/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers’ performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics has been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
4.
Zurück zum Zitat Alpaydin E (2016) Machine learning: the new AI. MIT Press, Cambridge Alpaydin E (2016) Machine learning: the new AI. MIT Press, Cambridge
7.
Zurück zum Zitat Astorino A, Fuduli A, Gaudioso M, Vocaturo E (2019) Multiple instance learning algorithm for medical image classification. SEBD 2400:1–8 Astorino A, Fuduli A, Gaudioso M, Vocaturo E (2019) Multiple instance learning algorithm for medical image classification. SEBD 2400:1–8
11.
Zurück zum Zitat Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(10):281–305MathSciNetMATH Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(10):281–305MathSciNetMATH
41.
Zurück zum Zitat Hutter F, Kotthoff L, Vanschoren J (2019) Automated machine learning - methods, systems challenges. Springer, BerlinCrossRef Hutter F, Kotthoff L, Vanschoren J (2019) Automated machine learning - methods, systems challenges. Springer, BerlinCrossRef
42.
Zurück zum Zitat Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. Proc. ECML PKDD08 Discovery Challenge p. 9 Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. Proc. ECML PKDD08 Discovery Challenge p. 9
63.
Zurück zum Zitat Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in C++. Universitat Ramon Llull, La Salle 196:1–40 Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in C++. Universitat Ramon Llull, La Salle 196:1–40
65.
Zurück zum Zitat Rodriguez D, Dolado J, Tuya J (2015) Bayesian concepts in software testing: An initial review. In: A-TEST 2015: Proceedings of the 6th International Workshop on Automating Test Case Design, Selection and Evaluation, pp. 41–46. https://doi.org/10.1145/2804322.2804329 Rodriguez D, Dolado J, Tuya J (2015) Bayesian concepts in software testing: An initial review. In: A-TEST 2015: Proceedings of the 6th International Workshop on Automating Test Case Design, Selection and Evaluation, pp. 41–46. https://​doi.​org/​10.​1145/​2804322.​2804329
68.
Zurück zum Zitat Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USACrossRef Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USACrossRef
70.
Zurück zum Zitat Sun S, Mao L, Dong Z, Wu L (2019) Multiview Machine Learning, 1st edn. Springer, BerlinCrossRef Sun S, Mao L, Dong Z, Wu L (2019) Multiview Machine Learning, 1st edn. Springer, BerlinCrossRef
72.
Zurück zum Zitat Tanwani AK, Farooq M (2010) Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. In: Bacardit J, Browne W, Drugowitsch, J Bernadó-Mansilla E, Butz MV (eds) Learning Classifier Systems Lecture Notes in Computer Science, (pp. 127–144) Springer, Berlin. doi: https://doi.org/10.1007/978-3-642-17508-4_9 Tanwani AK, Farooq M (2010) Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. In: Bacardit J, Browne W, Drugowitsch, J Bernadó-Mansilla E, Butz MV (eds) Learning Classifier Systems Lecture Notes in Computer Science, (pp. 127–144) Springer, Berlin. doi: https://​doi.​org/​10.​1007/​978-3-642-17508-4_​9
77.
Zurück zum Zitat Zhu X (2005) Semi-supervised learning with graphs. phd, Carnegie Mellon University, USA. AAI3179046 ISBN-10: 0542190591 Zhu X (2005) Semi-supervised learning with graphs. phd, Carnegie Mellon University, USA. AAI3179046 ISBN-10: 0542190591
Metadaten
Titel
Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect
verfasst von
José Daniel Pascual-Triana
David Charte
Marta Andrés Arroyo
Alberto Fernández
Francisco Herrera
Publikationsdatum
01.06.2021
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 7/2021
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-021-01577-1

Weitere Artikel der Ausgabe 7/2021

Knowledge and Information Systems 7/2021 Zur Ausgabe