ABSTRACT
There has historically been very little concern with extrapolation in Machine Learning, yet extrapolation can be critical to diagnose. Predictor functions are almost always learned on a set of highly correlated data comprising a very small segment of predictor space. Moreover, flexible predictors, by their very nature, are not controlled at points of extrapolation. This becomes a problem for diagnostic tools that require evaluation on a product distribution. It is also an issue when we are trying to optimize a response over some variable in the input space. Finally, it can be a problem in non-static systems in which the underlying predictor distribution gradually drifts with time or when typographical errors misrecord the values of some predictors.We present a diagnosis for extrapolation as a statistical test for a point originating from the data distribution as opposed to a null hypothesis uniform distribution. This allows us to employ general classification methods for estimating such a test statistic. Further, we observe that CART can be modified to accept an exact distribution as an argument, providing a better classification tool which becomes our extrapolation-detection procedure. We explore some of the advantages of this approach and present examples of its practical application.
- L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984.Google Scholar
- J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5):1189--1232, 2001.Google ScholarCross Ref
- D. Harrison and D. L. Rubinfeld. Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5:81--102, 1978.Google ScholarCross Ref
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Machine Learning: Data Mining, Inference and Prediction. Springer, New York, 2001.Google Scholar
- G. Hooker. Black box diagnostics and the problem of extrapolation: Extending the functional anova. Technical report, Stanford University, 2004.Google Scholar
- R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaurmann, San Mateo, 1993. Google ScholarDigital Library
- R-project. http://www.r-project.org/.Google Scholar
Index Terms
- Diagnosing extrapolation: tree-based density estimation
Recommendations
A Hybrid Predictive Model Integrating C4.5 and Decision Table Classifiers for Medical Data Sets
This article describes how, recently, data mining has been in great use for extracting meaningful patterns from medical domain data sets, and these patterns are then applied for clinical diagnosis. Truly, any accurate, precise and reliable ...
Monitoring and diagnosing software requirements
We propose a framework adapted from Artificial Intelligence theories of action and diagnosis for monitoring and diagnosing failures of software requirements. Software requirements are specified using goal models where they are associated with ...
Power Transformer Condition Forecast with Time-series Extrapolation
ICCAE '17: Proceedings of the 9th International Conference on Computer and Automation EngineeringThe paper estimates diagnostic capabilities of high-voltage transformer condition on-line monitoring systems. Limited capabilities are proven to be determined by lack of engineering condition analysis and forecast methods based on the total of ...
Comments