Abstract
In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues, or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones, such as hospital referral regions (HRR) or hospital service areas (HSA). Such levels constitute partitions over the underlying individual level data, which may not match the groupings that would have been obtained if one clustered the data based on individual-level attributes. Moreover, treating aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this article, we seek a better utilization of variably aggregated datasets, which are possibly assembled from different sources. We propose a novel cross-level imputation technique that models the generative process of such datasets using a Bayesian directed graphical model. The imputation is based on the underlying data distribution and is shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, yielding improved accuracies. The experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, CUDIA: Probabilistic cross-level imputation using individual auxiliary information
- Achen, C. H. and Shively, W. P. 1995. Cross-Level Inference. The University of Chicago Press, Chicago, IL.Google Scholar
- Agarwal, D. and Chen, B.-C. 2009. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Asuncion, A., Welling, M., Smyth, P., and Teh, Y. 2009. On smoothing and inference for topic models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. 27--34. Google ScholarDigital Library
- Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705--1749. Google ScholarCross Ref
- Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 993--1022. Google ScholarDigital Library
- Booth, J. G. and Hovert, J. P. 1999. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. Royal Stat. Soc. Series B 61, 265--285.Google ScholarCross Ref
- Breiman, L. 1984. Classification and Regression Trees. Wadsworth International Group.Google Scholar
- Brownstone, D. and Valletta, R. 2001. The bootstrap and multiple imputations: Harnessing increased computing power for improved statistical tests. J. Econ. Perspect. 15, 4, 129--141.Google ScholarCross Ref
- Carmelli, D., Cardon, L. R., and Fabsitz, R. 1994. Clustering of hypertension, diabetes, and obesity in adult male twins: Same genes or same environments? Amer. J. Human Genet. 55, 3, 566--573.Google Scholar
- Cawley, G. C., Talbot, N. L., and Girolami, M. 2006. Sparse multinomial logistic regression via bayesian l1 regularization. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems. 209--216.Google Scholar
- Dempster, A. P., Laird, N. M., and Rubin, D. B. 1976. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B 39.Google Scholar
- Duncan, O. D. and Davis, B. 1953. An alternative to ecological correlation. Am. Sociol. Rev. 18, 665--666.Google ScholarCross Ref
- Dwork, C. 2006. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, Vol. 4052, 1--12. Google ScholarDigital Library
- Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. 2006a. Our data, ourselves: Privacy via distributed noise generation. In Proceedings of the 25th International Cryptology Conference (EUROCRYPT). 486--503. Google ScholarDigital Library
- Dwork, C., McSherry, F., Nissim, K., and Smith, A. 2006b. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference. Google ScholarDigital Library
- Emam, K. E. and Fineberg, A. 2009. An overview of techniques for de-identifying personal health information. Social Sci. Res. Netw.Google Scholar
- Freedman, D. A. 1999. Ecological inference and the ecological fallacy. Tech. rep. 549, Department of Statistics, University of California Berkeley, CA.Google Scholar
- Goodman, L. 1953. Ecological regression and the behavior of individuals. Am. Sociol. Rev. 18, 663--664.Google ScholarCross Ref
- Goodman, L. 1959. Some alternatives to ecological correlation. Amer. J. Sociol. 64, 610--625.Google ScholarCross Ref
- Grimmett, G. and Stirzaker, D. 2001. Probability and Random Processes 3rd Ed. Oxford, Chapter 3.7, 67.Google Scholar
- Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning 2nd Ed. Springer.Google Scholar
- Henry, K. A. and Boscoe, F. P. 2008. Estimating the accuracy of geographical imputation. Int. J. Health Geograph..Google Scholar
- HIPAA Compliance Assistance. 2003. Summary of the HIPAA Privacy Rule. http://www.hhs.gov/ocr/privacy/hipaa/understanding/summary/privacysummary.pdf.Google Scholar
- Jackson, C., Best, N., and Richardson, S. 2008. Hierarchical related regression for combining aggregate and individual data in studies of socio-economic disease risk factors. J. Royal Stat. Soc. Series A 171, 159--178.Google Scholar
- Jackson, C., Best, N., and Richardson, S. 2009. Bayesian graphical models for regression on multiple data sets with different variables. J. Biostat. 10, 2, 335--351.Google ScholarCross Ref
- King, G. 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton University Press, Princeton, NJGoogle Scholar
- Liu, J. S. 1994. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Am. Stat. Assoc. 89, 427, 958--966.Google ScholarCross Ref
- Park, Y. and Ghosh, J. 2011. A generative framework for predictive modeling using variably aggregated, multi-source healthcare data. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Medicine and Healthcare. 27--32. Google ScholarDigital Library
- Park, Y. and Ghosh, J. 2012. A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. Google ScholarDigital Library
- Quinlan, J. R. 1993. C4.5: Prgrams for Machine Learning. Morgan Kaufmann. Google ScholarDigital Library
- Robinson, W. S. 1950. Ecological correlations and the behavior of individuals. Amer. Sociol. Rev. 15, 351--357.Google ScholarCross Ref
- Rubin, D. B. 2004. Multiple Imputation for Nonresponse in Surveys. Wiley-IEEE.Google Scholar
- Steppan, C. M., Bailey, S. T., Baht, S., Brown, E. J., Banerjee, R. R., Writhe, C. M., Patel, H. R., Ahima, R. S., and Lazar, M. A. 2011. The hormene resistin links obesity to diabetes. Nature 209, 307--312.Google Scholar
- Tabachnick, B. G. and Fidel, L. S. 2001. Using Multivariate Statistics 4th Ed. Allyn & Bacon, Boston, MA.Google Scholar
- Wakefield, J. and Salway, R. 2001. A statistical framework for ecological and aggregated studies. J. Royal Stat. Soc. Series A 164, 119--137.Google ScholarCross Ref
Recommendations
A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data
IHI '12: Proceedings of the 2nd ACM SIGHIT International Health Informatics SymposiumIn healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-...
Reliable medical recommendation systems with patient privacy
Survey papers, special sections on the semantic adaptive social web, intelligent systems for health informatics, regular papersOne of the concerns patients have when confronted with a medical condition is which physician to trust. Any recommendation system that seeks to answer this question must ensure that any sensitive medical information collected by the system is properly ...
Comments