ABSTRACT
Many measures of healthcare delivery or quality are not publicly available at the individual patient or hospital level largely due to privacy restrictions, legal issues or reporting norms. Instead, such measures are provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR1s and HSA2s). Such levels constitute partitionings of the underlying individual level data into segments that may not match the data clusters that would have been obtained if one analyzed individual-level data. Moreover, different data sources may use different underlying partitions as the bases for their data summarization. How can one run data mining procedures such as clustering or regression on data where different variables are available at different levels of aggregation or granularity? We first examine this problem in a clustering setting given a mix of individual-level and (arbitrarily) aggregated level data. For this setting, we present an extension of the Latent Dirichlet Allocation model that can use such aggregated information. The model provides reasonable cluster centroids under certain conditions, and is extended to impute masked features at the individual-level. The imputed feature values are based on an underlying mixture distribution, and help to improve the performance in subsequent predictive modeling tasks. The model parameters are learned using an approximated Gibbs sampling method, which employs the Metropolis-Hastings algorithm efficiently. Experimental results using data from the Dartmouth Health Atlas, CDC, and the U.S. Census Bureau are provided to illustrate the generality and capabilities of the proposed framework.
- Texas Inpatient Public Use Data File. http://www.dshs.state.tx.us/thcic/hospitals/HospitalData.shtm.Google Scholar
- D. Agarwal and B. Chen. Regression-based latent factor models. In KDD '09, pages 19--28, 2009. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Privacy-preserving data mining. In ACM SIGMOD, pages 439--450, 2000. Google ScholarDigital Library
- A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Jl. Machine Learning Research (JMLR), 6:1705--1749, October 2005. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, pages 993--1022, 2003. Google ScholarDigital Library
- J. G. Booth and J. P. Hovert. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B, 61:265--285, 1999.Google ScholarCross Ref
- Centers for Disease Control and Prevension (CDC). http://apps.nccd.cdc.gov/DDTSTRS/default.aspx.Google Scholar
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.Google ScholarCross Ref
- C. Dwork. Differential privacy. In ICALP, volume 4052, pages 1--12, 2006. Google ScholarDigital Library
- G. Grimmett and D. Stirzaker. Probability and Random Processes, chapter 3.7, page 67. Oxford, third edition, 2001.Google Scholar
- C. Jackson, N. Best, and S. Richardson. Hierarchical related regression for combining aggregate and individual data in studies of socio-economic disease risk factors. Journal of Royal Statistical Society: Series A, 171:159--178, 2008.Google Scholar
- C. Jackson, N. Best, and S. Richardson. Bayesian graphical models for regression on multiple data sets with different variables. Biostatistics, 10(2):335--351, 2009.Google ScholarCross Ref
- J. S. Liu. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 89(427):958--966, 1994.Google ScholarCross Ref
- L. Sweeney. Information explosion. In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, 2001.Google Scholar
- H. T. Tavani. Information privacy, data mining, and the internet. In Ethics and Information Technology, volume 1, pages 137--145, 1999. Google ScholarDigital Library
- The Dartmouth Atlas of Health Care. http://www.dartmouthatlas.org/.Google Scholar
- U.S. Census Bureau. http://www.census.gov/did/www/sahie/data/2007/dataset.html.Google Scholar
- G. C. G. Wei and M. A. Tanner. A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. Journal of the American Statistical Association, 85(411):699--704, 1990.Google ScholarCross Ref
Index Terms
- A generative framework for predictive modeling using variably aggregated, multi-source healthcare data
Recommendations
A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data
IHI '12: Proceedings of the 2nd ACM SIGHIT International Health Informatics SymposiumIn healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-...
Improving healthcare services using source anonymous scheme with privacy preserving distributed healthcare data collection and mining
AbstractThe trends of data mining on healthcare data for improving medical services have increased because of the electronic healthcare record(EHR) system, which collects a massive amount of data on a daily basis. In the current scenario, hospital ...
A conceptual framework for modeling longitudinal healthcare encounter data
WSC '16: Proceedings of the 2016 Winter Simulation ConferenceWe discuss a framework for analyzing data concerning healthcare encounters at the individual level. These encounters can be of various types - outpatient, emergency room, inpatient, pharmaceutical etc., each corresponding to one or more diagnoses. Each ...
Comments