Abstract
Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates.
This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e., regression) tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weighted linear regression, and a method that produces “model trees”—decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on real-world datasets it is almost uniformly worse than locally weighted linear regression and model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, while for another it is worse. We also show that standard naive Bayes applied to regression problems by discretizing the target value performs similarly badly. We then present empirical evidence that isolates naive Bayes' independence assumption as the culprit for its poor performance in the regression setting. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.
Article PDF
Similar content being viewed by others
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proceedings of the 2nd International Symposium on Information Theory (pp. 267–281). Budapest: Akadémiai Kiadó.
Atkeson, C. G., Moore, A.W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11, 11–73.
Blake, C., Keogh, E., & Merz, C. J. (1998). UCI repository of machine learning data-bases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/~mlearn/ MLRepository.html].
Cestnik, B. (1990). Estimating probabilities:A crucial task in machine learning. In Proceedings of the 9th European Conference on Artificial Intelligence, Stockholm, Sweden (pp. 147–149). London: Pitman.
Clark, P. & Niblett, T. (1989). The CN2 Induction Algorithm. Machine Learning, 3(4), 261–283.
Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. In Machine Learning, 29(2/3), 103–130.
Duda, R. & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley.
Fayyad, U. M. & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France (pp. 1022–1027). San Mateo, CA: Morgan Kaufmann.
Frank, E., Wang, Y., Inglis, S., Holmes, G., & Witten, I. H. (1998). Using model trees for classification. Machine Learning, 32(1), 63–76.
Friedman, J. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1, 55–77.
Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 29(2/3), 131–163.
Ghahramani, Z.& Jordan, M. I. (1994). Supervised learning from incomplete data via anEMapproach. In Advances in neural information processing systems 6 (pp. 120–127). San Mateo, CA: Morgan Kaufmann.
Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63–91.
John, G. H. & Kohavi, R. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1/2), 273–324.
John, G. H. & Langley P. (1995). Estimating continuous distributions in Bayesian Classifiers. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec (pp. 338–345). San Mateo, CA: Morgan Kaufmann.
Kasif, S., Salzberg, S., Waltz, D., Rachlin, J., & Aha, D.W. (1998). A probabilistic framework for memory-based reasoning. Artificial Intelligence, 104(1/2), 297–312.
Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction using instance-based learning with encoding length selection. In Progress in Connectionist-Based Information Systems, Dunedin, New Zealand (pp. 984–987). Singapore: Springer-Verlag.
Kononenko, I. (1991). Semi-naive Bayesian classifier. In Proceedings of the 6th European Working Session on Learning, Porto, Portugal (pp. 206–219). Berlin: Springer-Verlag.
Kononenko, I. (1998). Personal Communication.
Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA (pp. 223–228). Menlo Park, CA: AAAI Press.
Langley, P. (1993).Induction of recursive Bayesian classifiers. In Proceedings of the 8th European Conference on Machine Learning, Vienna, Austria (pp. 153–164). Berlin: Springer-Verlag.
Langley, P. & Sage, S. (1994). Induction of selective Bayesian classifiers, In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, Seattle, WA (pp. 399–406). San Mateo, CA: Morgan Kaufmann.
Lehmann, E. L. (1983). Theory of point estimation. New York: Wiley.
Pazzani, M. (1996). Searching for dependencies in Bayesian classifiers. In Learning from data: Artificial intelligence and statistics V (pp. 343–348). New York: Springer-Verlag.
Quinlan, J. R. (1992). Learning with continuous classes. In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, Hobart, Australia (pp. 343–348). Singapore: World Scientific.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann.
Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR (pp. 335–338). Menlo Park, CA: AAAI Press.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. New York: Chapman and Hall.
Simonoff, J. S. (1996). Smoothing methods in statistics. New York: Springer-Verlag.
Smyth, P., Gray, A., & Fayyad, U. M. (1995). Retrofitting decision tree classifiers using kernel density estimation, In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA (pp. 506–514). San Francisco, CA: Morgan Kaufmann.
StatLib (1999). Department of Statistics, Carnegie Mellon University. [http://lib.stat.cmu.edu].
Wang, Y. & Witten, I. H. (1997). Induction of model trees for predicting continuous classes, In Proceedings of the Poster Papers of the European Conference on Machine Learning, Prague (pp. 128–137). Prague: University of Economics, Faculty of Informatics and Statistics.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Frank, E., Trigg, L., Holmes, G. et al. Technical Note: Naive Bayes for Regression. Machine Learning 41, 5–25 (2000). https://doi.org/10.1023/A:1007670802811
Issue Date:
DOI: https://doi.org/10.1023/A:1007670802811