Skip to main content
main-content
Top

Hint

Swipe to navigate through the articles of this issue

18-06-2022 | Regular Paper

Sample-selection-adjusted random forests

Author: Jonathan Cook

Published in: International Journal of Data Science and Analytics

Login to get access
share
SHARE

Abstract

A predictive model that is trained with non-randomly selected samples can offer biased predictions for the population. This paper discusses when non-random selection is a problem. For the applications in which it is a problem, this paper presents a procedure for adjusting the predictions of random forest to account for non-random sampling of the training data. This adjustment results in more accurate predictions for the population. This paper also warns against the use of inverse probability weighting for analyzing selected samples.
Appendix
Available only for authorised users
Footnotes
1
The R code files used to generate this example and other examples in the paper are available at https://​github.​com/​JonathanCook2/​SARF.
 
2
These data are available at the University of California at Irvine’s Machine Learning Repository, https://​archive.​ics.​uci.​edu/​ml/​datasets.​htm1.
 
Literature
1.
go back to reference Anderson, R.: The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation. Oxford University Press, Oxford (2007) Anderson, R.: The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation. Oxford University Press, Oxford (2007)
4.
go back to reference Chan, J.Y., Cook, J.A.: Inferring Zambia’s HIV prevalence from a selected sample. Appl. Econ. 52(39), 4236–4249 (2020) CrossRef Chan, J.Y., Cook, J.A.: Inferring Zambia’s HIV prevalence from a selected sample. Appl. Econ. 52(39), 4236–4249 (2020) CrossRef
5.
go back to reference Cook, J., Newberger, N., Lee, J.-S.: 0n identification and estimation of Heckman models. Stata J. 21(4), 972–998 (2021) CrossRef Cook, J., Newberger, N., Lee, J.-S.: 0n identification and estimation of Heckman models. Stata J. 21(4), 972–998 (2021) CrossRef
6.
go back to reference Cook, J.A.: ROC curves and nonrandom data. Pattern Recogn. Lett. 85(1), 35–41 (2017) CrossRef Cook, J.A.: ROC curves and nonrandom data. Pattern Recogn. Lett. 85(1), 35–41 (2017) CrossRef
8.
go back to reference Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009) CrossRef Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547–553 (2009) CrossRef
10.
go back to reference Friedberg, R., Tibshirani, J., Athey, S., Wager, S.: Local linear forests. J. Comput. Graph. Stat. 30(2), 503–517 (2020) MathSciNetCrossRef Friedberg, R., Tibshirani, J., Athey, S., Wager, S.: Local linear forests. J. Comput. Graph. Stat. 30(2), 503–517 (2020) MathSciNetCrossRef
11.
go back to reference Gao, Z., Zand, M., Ruan, J.: A novel multiple classifier generation and combination framework based on fuzzy clustering and individualized ensemble construction. In: 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 231–240 (2019) Gao, Z., Zand, M., Ruan, J.: A novel multiple classifier generation and combination framework based on fuzzy clustering and individualized ensemble construction. In: 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 231–240 (2019)
13.
go back to reference Heckman, J.: Varieties of selection bias. Am. Econ. Rev. 80(2), 313–318 (1990) Heckman, J.: Varieties of selection bias. Am. Econ. Rev. 80(2), 313–318 (1990)
14.
go back to reference Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952) MathSciNetCrossRef Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952) MathSciNetCrossRef
15.
go back to reference Meng, X.-L.: Statistical paradises and paradoxes in big data (I): law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat. 12(2), 685–726 (2018) MathSciNetCrossRef Meng, X.-L.: Statistical paradises and paradoxes in big data (I): law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat. 12(2), 685–726 (2018) MathSciNetCrossRef
16.
17.
go back to reference Rehn, P., Ahmadi, Z., Kramer, S.: Forest of normalized trees: fast and accurate density estimation of streaming data. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 199–208 (2018) Rehn, P., Ahmadi, Z., Kramer, S.: Forest of normalized trees: fast and accurate density estimation of streaming data. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 199–208 (2018)
18.
go back to reference Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recommendations as treatments: debiasing learning and evaluation. In: Proceedings of the 33rd International Conference on Machine Learning, vol. 48, pp. 1670–1679 (2016) Schnabel, T., Swaminathan, A., Singh, A., Chandak, N., Joachims, T.: Recommendations as treatments: debiasing learning and evaluation. In: Proceedings of the 33rd International Conference on Machine Learning, vol. 48, pp. 1670–1679 (2016)
19.
20.
go back to reference Sinoquet, C., Mekhnacha, K.: Random forest framework customized to handle highly correlated variables: an extensive experimental study applied to feature selection in genetic data. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 217–226 (2018) Sinoquet, C., Mekhnacha, K.: Random forest framework customized to handle highly correlated variables: an extensive experimental study applied to feature selection in genetic data. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 217–226 (2018)
21.
go back to reference Steck, H.: Training and testing of recommender systems on data missing not at random. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 213–220 (2013) Steck, H.: Training and testing of recommender systems on data missing not at random. In: Proceedings of the 7th ACM Conference on Recommender Systems, pp. 213–220 (2013)
23.
go back to reference Wager, S., Athey, S.: Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 113, 1228–1242 (2018) MathSciNetCrossRef Wager, S., Athey, S.: Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 113, 1228–1242 (2018) MathSciNetCrossRef
24.
go back to reference Wooldridge, J.M.: Inverse probability weighted estimation for general missing data problems. J. Econom. 141(2), 1281–1301 (2007) MathSciNetCrossRef Wooldridge, J.M.: Inverse probability weighted estimation for general missing data problems. J. Econom. 141(2), 1281–1301 (2007) MathSciNetCrossRef
Metadata
Title
Sample-selection-adjusted random forests
Author
Jonathan Cook
Publication date
18-06-2022
Publisher
Springer International Publishing
Published in
International Journal of Data Science and Analytics
Print ISSN: 2364-415X
Electronic ISSN: 2364-4168
DOI
https://doi.org/10.1007/s41060-022-00337-w

Premium Partner