Erschienen in:

1996 | OriginalPaper | Buchkapitel

An Exact Probability Metric for Decision Tree Splitting

verfasst von : J. Kent Martin

Erschienen in: Learning from Data

Verlag: Springer New York

Enthalten in: Professional Book Archive

Zugang erhalten

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

ID3’s information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated by the gain ratio used in C4.5. Several alternatives have been proposed, notably orthogonality and Beta. Gain ratio and orthogonality are strongly correlated, and all of the metrics share a common bias towards splits with one or more small expected values, under circumstances where the split likely ocurred by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the posterior probability of the null hypothesis. Both gain and the chi-squared significance test are shown to arise in asymptotic approximations to the hypergeometric, revealing similar criteria for admissibility and showing the nature of their biases. Previous failures to find admissible stopping rules are traced to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. Empirical results show that pre-pruning should be done, as trees pruned in this way are simpler, more efficient, and no less accurate than unpruned trees. Average training time is reduced by up to 30%, and expensive post-pruning avoided.

Springer Professional