Abstract
Quantitative attributes are usually discretized in Naive-Bayes learning. We establish simple conditions under which discretization is equivalent to use of the true probability density function during naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we supply insights into managing discretization bias and variance by adjusting the number of intervals and the number of training instances contained in each interval. We accordingly propose proportional discretization and fixed frequency discretization, two efficient unsupervised discretization methods that are able to effectively manage discretization bias and variance. We evaluate our new techniques against four key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical analyses by showing that with statistically significant frequency, naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by current established discretization methods.
Article PDF
Similar content being viewed by others
References
Acid, S., Campos, L. M. D., & Castellano, J. G. (2005). Learning Bayesian network classifiers: searching in a space of partially directed acyclic graphs. Machine Learning, 59(3), 213–235.
An, A., & Cercone, N. (1999). Discretization of continuous attributes for learning classification rules. In Proceedings of the 3rd Pacific-Asia conference on methodologies for knowledge discovery and data mining (pp. 509–514), 1999.
Androutsopoulos, I., Koutsias, J., Chandrinos, K., & Spyropoulos, C. (2000). An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 160–167), 2000.
Bay, S. D. (1999). The UCI KDD archive [http://kdd.ics.uci.edu]. Irvine: Department of Information and Computer Science, University of California.
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases [http://www.ics.uci.edu/~mlearn/mlrepository.html]. Irvine: Department of Information and Computer Science, University of California.
Bluman, A. G. (1992). Elementary statistics, a step by step approach. Dubuque: Wm.C. Brown Publishers.
Breiman, L. (1996). Bias, variance and arcing classifiers (Technical report 460). Statistics Department, University of California, Berkeley.
Casella, G., & Berger, R. L. (1990). Statistical inference. Pacific Grove.
Catlett, J. (1991). On changing continuous attributes into ordered discrete attributes. In Proceedings of the European working session on learning (pp. 164–178), 1991.
Cerquides, J., & Mántaras, R. L. D. (2005). TAN classifiers based on decomposable distributions. Machine Learning, 59(3), 323–354.
Cestnik, B. (1990). Estimating probabilities: a crucial task in machine learning. In Proceedings of the 9th European conference on artificial intelligence (pp. 147–149), 1990.
Cestnik, B., Kononenko, I., & Bratko, I. (1987). Assistant 86: a knowledge-elicitation tool for sophisticated users. In Proceedings of the 2nd European working session on learning (pp. 31–45), 1987.
Clark, P., & Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3, 261–283.
Crawford, E., Kay, J., & Eric, M. (2002). IEMS—the intelligent email sorter. In Proceedings of the 19th international conference on machine learning (pp. 83–90), 2002.
Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
Domingos, P., & Pazzani, M. J. (1996). Beyond independence: conditions for the optimality of the simple Bayesian classifier. In Proceedings of the 13th international conference on machine learning (pp. 105–112). San Mateo: Morgan Kaufmann Publishers.
Domingos, P., & Pazzani, M. J. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–130.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Proceedings of the 12th international conference on machine learning (pp. 194–202), 1995.
Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley.
Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th international joint conference on artificial intelligence (pp. 1022–1027), 1993.
Freitas, A. A., & Lavington, S. H. (1996). Speeding up knowledge discovery in large relational databases by means of a new discretization algorithm. In Advances in databases, proceedings of the 14th British national conference on databases (pp. 124–133), 1996.
Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1), 55–77.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675–701.
Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11, 86–92.
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2), 131–163.
Gama, J., Torgo, L., & Soares, C. (1998). Dynamic discretization of continuous attributes. In Proceedings of the 6th Ibero-American conference on AI (pp. 160–169), 1998.
Hsu, C.-N., Huang, H.-J., & Wong, T.-T. (2000). Why discretization works for naive Bayesian classifiers. In Proceedings of the 17th international conference on machine learning (pp. 309–406), 2000.
Hsu, C.-N., Huang, H.-J., & Wong, T.-T. (2003). Implications of the Dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers. Machine Learning, 53(3), 235–263.
Hussain, F., Liu, H., Tan, C. L., & Dash, M. (1999). Discretization: An enabling technique. (Technical Report, TRC6/99). School of Computing, National University of Singapore.
John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Proceedings of the 11th conference on uncertainty in artificial intelligence (pp. 338–345), 1995.
Keogh, E., & Pazzani, M. J. (1999). Learning augmented Bayesian classifiers: a comparison of distribution-based and classification-based approaches. In Proceedings of international workshop on artificial intelligence and statistics (pp. 225–230), 1999.
Kerber, R. (1992). Chimerge: Discretization for numeric attributes. In National conference on artificial intelligence (pp. 123–128). Menlo Park: AAAI Press.
Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. In Proceedings of the 13th international conference on machine learning (pp. 275–283), 1996.
Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In Proceedings of the 12th international conference on machine learning (pp. 313–321), 1995.
Kononenko, I. (1990). Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. Amsterdam: IOS Press.
Kononenko, I. (1992). Naive Bayesian classifier and continuous attributes. Informatica, 16(1), 1–8.
Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine, 23(1), 89–109.
Langley, P. (1993). Induction of recursive Bayesian classifiers. In Proceedings of the European conference on machine learning (pp. 153–164), 1993.
Langley, P., & Sage, S. (1994). Induction of selective Bayesian classifiers. In Proceedings of the 10th conference on uncertainty in artificial intelligence (pp. 399–406), 1994.
Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the 10th national conference on artificial intelligence (pp. 223–228), 1992.
Lavrac, N. (1998). Data mining in medicine: selected techniques and applications. In Proceedings of the 2nd international conference on the practical applications of knowledge discovery and data mining (pp. 11–31), 1998.
Lavrac, N., Keravnou, E., & Zupan, B. (2000). Intelligent data analysis in medicine. Encyclopedia of Computer Science and Technology, 42(9), 113–157.
Lewis, D. D. (1998). Naive (Bayes) at forty: the independence assumption in information retrieval. In Proceedings of the 10th European conference on machine learning (pp. 4–15), 1998.
Maron, M., & Kuhns, J. (1960). On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery, 7(3), 216–244.
Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill.
Miyahara, K., & Pazzani, M. J. (2000). Collaborative filtering with the simple Bayesian classifier. In Proceedings of the 6th Pacific rim international conference on artificial intelligence (pp. 679–689), 2000.
Mooney, R. J., & Roy, L. (2000). Content-based book recommending using learning for text categorization. In Proceedings of the 5th ACM conference on digital libraries (pp. 195–204). New York: ACM Press.
Moore, D. S., & McCabe, G. P. (2002). Introduction to the practice of statistics (4th ed.). San Francisco: Michelle Julet.
Mora, L., Fortes, I., Morales, R., & Triguero, F. (2000). Dynamic discretization of continuous values from time series. In Proceedings of the 11th European conference on machine learning (pp. 280–291), 2000.
Pazzani, M. J. (1995). An iterative improvement approach for the discretization of numeric attributes in Bayesian classifiers. In Proceedings of the 1st international conference on knowledge discovery and data mining (pp. 228–233), 1995.
Pazzani, M. J., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994). Reducing misclassification costs. In Proceedings of the 11th international conference on machine learning (pp. 217–225). San Mateo: Morgan Kaufmann.
Perner, P., & Trautzsch, S. (1998). Multi-interval discretization methods for decision tree learning. In Proceedings of advances in pattern recognition, joint IAPR international workshops SSPR98 and SPR98 (pp. 475–482), 1998.
Provost, F., & Aronis, J. (1996). Scaling up machine learning with massive parallelism. Machine Learning, 23(1), 33–46.
Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Proceedings of the 2nd international conference on knowledge discovery and data mining (pp. 334–338), 1996.
Samuels, M. L., & Witmer, J. A. (1999). Statistics for the life sciences (2nd ed.). New York: Prentice-Hall.
Singh, M., & Provan, G. M. (1996). Efficient learning of selective Bayesian network classifiers. In Proceedings of the 13th international conference on machine learning (pp. 453–461), 1996.
Starr, B., Ackerman, M. S., & Pazzani, M. J. (1996). Do-I-care: a collaborative web agent. In Proceedings of the ACM conference on human factors in computing systems (pp. 273–274), 1996.
Torgo, L., & Gama, J. (1997). Search-based class discretization. In Proceedings of the 9th European conference on machine learning (pp. 266–273), 1997.
Webb, G. I. (2000). Multiboosting: a technique for combining boosting and wagging. Machine Learning, 40(2), 159–196.
Webb, G. I., Boughton, J., & Wang, Z. (2005). Not so naive Bayes: averaged one-dependence estimators. Machine Learning, 58(1), 5–24.
Weiss, N. A. (2002). Introductory statistics (6th ed.). Greg Tobin.
Yang, Y., & Webb, G. I. (2001). Proportional k-interval discretization for naive-Bayes classifiers. In Proceedings of the 12th European conference on machine learning (pp. 564–575), 2001.
Yang, Y., & Webb, G. I. (2003). On why discretization works for naive-Bayes classifiers. In Proceedings of the 16th Australian joint conference on artificial intelligence (pp. 440–452), 2003.
Zheng, Z., & Webb, G. I. (2000). Lazy learning of Bayesian rules. Machine Learning, 41(1), 53–84.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Dan Roth.
Rights and permissions
About this article
Cite this article
Yang, Y., Webb, G.I. Discretization for naive-Bayes learning: managing discretization bias and variance. Mach Learn 74, 39–74 (2009). https://doi.org/10.1007/s10994-008-5083-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5083-5