ABSTRACT
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree.
We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the “real” tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost.
Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely rebuilding the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.
- AD97.A.C.Davison and D.V.Hinkley. Bootstrap Methods and their Applications. Cambridge Series in Statistical and Probabilistie Mathematics. Cambridge University Press, 1997.Google Scholar
- AGI+92.R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An interval classifier for database mining applications. VLDB 1992. Google ScholarDigital Library
- AIS93.R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914-925, December 1993. Google ScholarDigital Library
- BFOS84.L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984.Google Scholar
- ET93.B. Eft'on and R. J. Tibshirani. An introduction to the bootstrap. Chapman & Hall, 1993.Google Scholar
- FMM96.T. Fukuda, Y. Morimoto, and S. Morishita. Constructing efficient decision trees by using optimized numeric association rules. VLDB 1996. Google ScholarDigital Library
- FMMT96a.T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD 1996 Google ScholarDigital Library
- FMMT96b.T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokalyama. Mining optimized association rules for numeric attributes. PODS 1996. Google ScholarDigital Library
- FPSSU96.U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Da,ta Mining. AAAI/MIT Press, 1996. Google ScholarDigital Library
- GRG98.J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainfor~:st - A framework for fast decision tree construction of large datasets. VLDB 1996. Google ScholarDigital Library
- Han97.D.J. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons, Chichester, England, 1997.Google Scholar
- LLS97.Tjen-Sien Lira, Wei-Y'm Loll, and Yu-Shan Shih. An empirical comparison of decision trees and other classification methods. Technical Report 979, Department of Statistics, University of Wisconsin, Madison, June 1997.Google Scholar
- LS97.Wei-Y'm Loh and Yu-Shan Shih. Split selection meth(~s for classification trees. Statistica Sinica, 7(4), October 1997.Google Scholar
- Man94.O.L. Mangasarian. Nonlinear Programming. Classics in Applied Mathematics. Society for industrial and Applied Mathematics, 1994.Google Scholar
- MAR96.M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth lnt'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996. Google ScholarDigital Library
- MFM+98.Y. Morimoto, T. Fukuda, H. Matsuzawa, T. Tokuyama, and K. Yoda. Algorithms for mining association rules for binary segmentations of huge categorical databases. VLDB 1998. Google ScholarDigital Library
- MS98.N. Megiddo and R. Srikant. Discovering predictive association rules. KDD 1998.Google Scholar
- MST94.D. Miehie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Hor wood, 1994. Google ScholarDigital Library
- Mur95.S. IC Murthy. On growing better decision trees from data. Phl) thesis, Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, 1995. Google ScholarDigital Library
- Olk93.E Olken. Random Sampling from Databases. PhD lthesis, University of California at Berkeley, 1993.Google Scholar
- Qui86.j. Ross Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. Google ScholarCross Ref
- RS98.R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. VLDB 1998. Google ScholarDigital Library
- SAM96.J. Shafer, IL Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. VLDB 1996. Google ScholarDigital Library
- UBC97.P.E. Utgoff, N. C. Berkman, and J. A. Clouse. Decision tree induction based on efficient tree restructuring. Machine Learning, 29:5-44, 1997. Google ScholarDigital Library
- Utg89.P.E. Utgoff. Incremental induction of decision trees. Machine Learning, 4:16 i-186, 1989. Google ScholarDigital Library
Index Terms
- BOAT—optimistic decision tree construction
Recommendations
BOAT—optimistic decision tree construction
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A ...
Sparse alternating decision tree
Alternating decision tree (ADTree) brings interpretability to boosting.A novel sparse version of multivariate ADTree is presented.Sparse ADTree is a better generalization of existing univariate ADTree.The decision nodes are designed based on modified ...
Efficient Steiner tree construction based on spanning graphs
ISPD '03: Proceedings of the 2003 international symposium on Physical designSteiner Minimal Tree (SMT) problem is a very important problem in VLSI CAD. Given n points on a plane, a Steiner minimal tree connects these points through some extra points (called Steiner points) to achieve a minimal total length. Even though there ...
Comments