Skip to main content
Top
Published in: Artificial Intelligence Review 2/2015

01-02-2015

The number of classes as a source for instability of decision tree algorithms in high dimensional datasets

Author: José Augusto Baranauskas

Published in: Artificial Intelligence Review | Issue 2/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

For a long time, experimental studies have been performed in a large number of fields of AI, specially in machine learning. A careful evaluation of a specific machine learning algorithm is important, but often difficult to conduct in practice. On the other hand, simulation studies can provide insights on behavior and performance aspects of machine learning approaches much more readily than using real-world datasets, where the target concept is normally unknown. Under decision tree induction algorithms an interesting source of instability that sometimes is neglected by researchers is the number of classes in the training set. This paper uses simulation to extended a previous work performed by Leo Breiman about properties of splitting criteria. Our simulation results have showed the number of best-splits grows according to the number of classes: exponentially, for both entropy and twoing criteria and linearly, for gini criterion. Since more splits imply more alternative choices, decreasing the number of classes in high dimensional datasets (ranging from hundreds to thousands of attributes, typically found in biomedical domains) can help lowering instability of decision trees. Another important contribution of this work concerns the fact that for \(<\)5 classes balanced datasets are prone to provide more best-splits (thus increasing instability) than imbalanced ones, including binary problems often addressable in machine learning; on the other hand, for five or more classes balanced datasets can provide few best-splits.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
Impurity functions \(\phi (\mathbf p )\) must obey three properties: (i) it is convex in \(\mathbf p \); (ii) its maximum is achieved when all \(p_j=1/J\), and (iii) its minimum is achieved when one \(p_i=1\), and therefore all other \(p_j=0\) for all \(j \ne i\). Refer to Breiman et al. (1984) for more details.
 
2
Since the problem is symmetric for \(P_L\) and \(P_R\), a maximum for \({\alpha }\) is also a maximum for its complement.
 
3
Outliers are defined as any points larger than \(Q3+1.5(Q3-Q1)\) or lower than \(Q1-1.5(Q3-Q1)\), where \(Q1\) and \(Q3\) represent first and third quartiles, respectively.
 
Literature
go back to reference Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29CrossRef Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29CrossRef
go back to reference Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression tress. Wadsworth & Books, Pacific Grove, CA Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression tress. Wadsworth & Books, Pacific Grove, CA
go back to reference Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. ACM, San Diego, CA Domingos P (1999) MetaCost: a general method for making classifiers cost-sensitive. ACM, San Diego, CA
go back to reference Gamberger D, Lavrač N, Zelezný F, Tolar J (2004) Induction of comprehensible models for gene expression datasets by subgroup discovery methodology. J Biomed Inform 37(4):269–284CrossRef Gamberger D, Lavrač N, Zelezný F, Tolar J (2004) Induction of comprehensible models for gene expression datasets by subgroup discovery methodology. J Biomed Inform 37(4):269–284CrossRef
go back to reference Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco, CA Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco, CA
go back to reference Rosenfeld N, Aharonov R, Meiri E, Rosenwald S, Spector Y, Zepeniuk M, Benjamin H, Shabes N, Tabak S, Levy A et al (2008) Micrornas accurately identify cancer tissue origin. Nat Biotechnol 26(4):462–469CrossRef Rosenfeld N, Aharonov R, Meiri E, Rosenwald S, Spector Y, Zepeniuk M, Benjamin H, Shabes N, Tabak S, Levy A et al (2008) Micrornas accurately identify cancer tissue origin. Nat Biotechnol 26(4):462–469CrossRef
go back to reference Souto M, Bittencourt V, Costa J (2006) An empirical analysis of under-sampling techniques to balance a protein structural class dataset. In: King I, Wang J, Chan L, Wang D (eds) Neural information processing, Lecture notes in computer science, vol 4234, pp 21–29. Springer, Berlin. http://dx.doi.org/10.1007/11893295_3 Souto M, Bittencourt V, Costa J (2006) An empirical analysis of under-sampling techniques to balance a protein structural class dataset. In: King I, Wang J, Chan L, Wang D (eds) Neural information processing, Lecture notes in computer science, vol 4234, pp 21–29. Springer, Berlin. http://​dx.​doi.​org/​10.​1007/​11893295_​3
go back to reference Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19:315–354MATH Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19:315–354MATH
go back to reference Weiss SM, Indurkhya N, Zhang T, Damerau F (2004) Text mining: predictive methods for analyzing unstructured information. Springer, Berlin Weiss SM, Indurkhya N, Zhang T, Damerau F (2004) Text mining: predictive methods for analyzing unstructured information. Springer, Berlin
Metadata
Title
The number of classes as a source for instability of decision tree algorithms in high dimensional datasets
Author
José Augusto Baranauskas
Publication date
01-02-2015
Publisher
Springer Netherlands
Published in
Artificial Intelligence Review / Issue 2/2015
Print ISSN: 0269-2821
Electronic ISSN: 1573-7462
DOI
https://doi.org/10.1007/s10462-012-9374-7

Other articles of this Issue 2/2015

Artificial Intelligence Review 2/2015 Go to the issue

Premium Partner