skip to main content
10.1145/2023582.2023587acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

A generative framework for predictive modeling using variably aggregated, multi-source healthcare data

Published:21 August 2011Publication History

ABSTRACT

Many measures of healthcare delivery or quality are not publicly available at the individual patient or hospital level largely due to privacy restrictions, legal issues or reporting norms. Instead, such measures are provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR1s and HSA2s). Such levels constitute partitionings of the underlying individual level data into segments that may not match the data clusters that would have been obtained if one analyzed individual-level data. Moreover, different data sources may use different underlying partitions as the bases for their data summarization. How can one run data mining procedures such as clustering or regression on data where different variables are available at different levels of aggregation or granularity? We first examine this problem in a clustering setting given a mix of individual-level and (arbitrarily) aggregated level data. For this setting, we present an extension of the Latent Dirichlet Allocation model that can use such aggregated information. The model provides reasonable cluster centroids under certain conditions, and is extended to impute masked features at the individual-level. The imputed feature values are based on an underlying mixture distribution, and help to improve the performance in subsequent predictive modeling tasks. The model parameters are learned using an approximated Gibbs sampling method, which employs the Metropolis-Hastings algorithm efficiently. Experimental results using data from the Dartmouth Health Atlas, CDC, and the U.S. Census Bureau are provided to illustrate the generality and capabilities of the proposed framework.

References

  1. Texas Inpatient Public Use Data File. http://www.dshs.state.tx.us/thcic/hospitals/HospitalData.shtm.Google ScholarGoogle Scholar
  2. D. Agarwal and B. Chen. Regression-based latent factor models. In KDD '09, pages 19--28, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal and R. Srikant. Privacy-preserving data mining. In ACM SIGMOD, pages 439--450, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Jl. Machine Learning Research (JMLR), 6:1705--1749, October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, pages 993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. G. Booth and J. P. Hovert. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B, 61:265--285, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  7. Centers for Disease Control and Prevension (CDC). http://apps.nccd.cdc.gov/DDTSTRS/default.aspx.Google ScholarGoogle Scholar
  8. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  9. C. Dwork. Differential privacy. In ICALP, volume 4052, pages 1--12, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Grimmett and D. Stirzaker. Probability and Random Processes, chapter 3.7, page 67. Oxford, third edition, 2001.Google ScholarGoogle Scholar
  11. C. Jackson, N. Best, and S. Richardson. Hierarchical related regression for combining aggregate and individual data in studies of socio-economic disease risk factors. Journal of Royal Statistical Society: Series A, 171:159--178, 2008.Google ScholarGoogle Scholar
  12. C. Jackson, N. Best, and S. Richardson. Bayesian graphical models for regression on multiple data sets with different variables. Biostatistics, 10(2):335--351, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. S. Liu. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 89(427):958--966, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  14. L. Sweeney. Information explosion. In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, 2001.Google ScholarGoogle Scholar
  15. H. T. Tavani. Information privacy, data mining, and the internet. In Ethics and Information Technology, volume 1, pages 137--145, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. The Dartmouth Atlas of Health Care. http://www.dartmouthatlas.org/.Google ScholarGoogle Scholar
  17. U.S. Census Bureau. http://www.census.gov/did/www/sahie/data/2007/dataset.html.Google ScholarGoogle Scholar
  18. G. C. G. Wei and M. A. Tanner. A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. Journal of the American Statistical Association, 85(411):699--704, 1990.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A generative framework for predictive modeling using variably aggregated, multi-source healthcare data

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader