research-article

A generative framework for predictive modeling using variably aggregated, multi-source healthcare data

Authors:
Yubin Park

University of Texas at Austin, Austin, TX, USA

University of Texas at Austin, Austin, TX, USA
View Profile

,
Joydeep Ghosh

University of Texas at Austin, Austin, TX, USA

University of Texas at Austin, Austin, TX, USA
View Profile

DMMH '11: Proceedings of the 2011 workshop on Data mining for medicine and healthcareAugust 2011Pages 27–32https://doi.org/10.1145/2023582.2023587

Published:21 August 2011Publication History

DMMH '11: Proceedings of the 2011 workshop on Data mining for medicine and healthcare

Pages 27–32

ABSTRACT

Many measures of healthcare delivery or quality are not publicly available at the individual patient or hospital level largely due to privacy restrictions, legal issues or reporting norms. Instead, such measures are provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRR¹s and HSA²s). Such levels constitute partitionings of the underlying individual level data into segments that may not match the data clusters that would have been obtained if one analyzed individual-level data. Moreover, different data sources may use different underlying partitions as the bases for their data summarization. How can one run data mining procedures such as clustering or regression on data where different variables are available at different levels of aggregation or granularity? We first examine this problem in a clustering setting given a mix of individual-level and (arbitrarily) aggregated level data. For this setting, we present an extension of the Latent Dirichlet Allocation model that can use such aggregated information. The model provides reasonable cluster centroids under certain conditions, and is extended to impute masked features at the individual-level. The imputed feature values are based on an underlying mixture distribution, and help to improve the performance in subsequent predictive modeling tasks. The model parameters are learned using an approximated Gibbs sampling method, which employs the Metropolis-Hastings algorithm efficiently. Experimental results using data from the Dartmouth Health Atlas, CDC, and the U.S. Census Bureau are provided to illustrate the generality and capabilities of the proposed framework.

References

Texas Inpatient Public Use Data File. http://www.dshs.state.tx.us/thcic/hospitals/HospitalData.shtm.Google Scholar
D. Agarwal and B. Chen. Regression-based latent factor models. In KDD '09, pages 19--28, 2009. Google ScholarDigital Library
R. Agrawal and R. Srikant. Privacy-preserving data mining. In ACM SIGMOD, pages 439--450, 2000. Google ScholarDigital Library
A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Jl. Machine Learning Research (JMLR), 6:1705--1749, October 2005. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, pages 993--1022, 2003. Google ScholarDigital Library
J. G. Booth and J. P. Hovert. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B, 61:265--285, 1999.Google ScholarCross Ref
Centers for Disease Control and Prevension (CDC). http://apps.nccd.cdc.gov/DDTSTRS/default.aspx.Google Scholar
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.Google ScholarCross Ref
C. Dwork. Differential privacy. In ICALP, volume 4052, pages 1--12, 2006. Google ScholarDigital Library
G. Grimmett and D. Stirzaker. Probability and Random Processes, chapter 3.7, page 67. Oxford, third edition, 2001.Google Scholar
C. Jackson, N. Best, and S. Richardson. Hierarchical related regression for combining aggregate and individual data in studies of socio-economic disease risk factors. Journal of Royal Statistical Society: Series A, 171:159--178, 2008.Google Scholar
C. Jackson, N. Best, and S. Richardson. Bayesian graphical models for regression on multiple data sets with different variables. Biostatistics, 10(2):335--351, 2009.Google ScholarCross Ref
J. S. Liu. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 89(427):958--966, 1994.Google ScholarCross Ref
L. Sweeney. Information explosion. In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, 2001.Google Scholar
H. T. Tavani. Information privacy, data mining, and the internet. In Ethics and Information Technology, volume 1, pages 137--145, 1999. Google ScholarDigital Library
The Dartmouth Atlas of Health Care. http://www.dartmouthatlas.org/.Google Scholar
U.S. Census Bureau. http://www.census.gov/did/www/sahie/data/2007/dataset.html.Google Scholar
G. C. G. Wei and M. A. Tanner. A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. Journal of the American Statistical Association, 85(411):699--704, 1990.Google ScholarCross Ref

Index Terms

A generative framework for predictive modeling using variably aggregated, multi-source healthcare data
1. Applied computing
  1. Life and medical sciences
    1. Consumer health
    2. Health informatics
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic algorithms
    2. Probabilistic reasoning algorithms
      1. Markov-chain Monte Carlo methods
      2. Sequential Monte Carlo methods

Recommendations

A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data
IHI '12: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium

In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-...
Read More
Improving healthcare services using source anonymous scheme with privacy preserving distributed healthcare data collection and mining
Abstract
The trends of data mining on healthcare data for improving medical services have increased because of the electronic healthcare record(EHR) system, which collects a massive amount of data on a daily basis. In the current scenario, hospital ...
Read More
A conceptual framework for modeling longitudinal healthcare encounter data
WSC '16: Proceedings of the 2016 Winter Simulation Conference

We discuss a framework for analyzing data concerning healthcare encounters at the individual level. These encounters can be of various types - outpatient, emergency room, inpatient, pharmaceutical etc., each corresponding to one or more diagnoses. Each ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DMMH '11: Proceedings of the 2011 workshop on Data mining for medicine and healthcare
August 2011
86 pages
ISBN:9781450308434
DOI:10.1145/2023582
Program Chairs:
Nitesh Chawla
University of Notre Dame
,
Rayid Ghani
Accenture Technology Labs
,
Jianying Hu
IBM T.J. Watson Research Center
,
Balaji Krishnapuram
Siemens Medical Solutions
,
Mohit Kumar
Accenture Technology Labs
,
David Madigan
Columbia University
,
Jonathan Silverstein
NorthShore University HealthSystem
,
Jimeng Sun
IBM T.J. Watson Research Center
,
K. P. Unnikrishnan,
Ramasamy Uthurusamy
General Motors
,
Fei Wang
IBM T.J. Watson Research Center
,
John Younger
University of Michigan
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
dartmouth health atlas
lda
multi-source health metrics
privacy preserving datamining
Qualifiers
- research-article
Conference
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 275
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A generative framework for predictive modeling using variably aggregated, multi-source healthcare data

DMMH '11: Proceedings of the 2011 workshop on Data mining for medicine and healthcare

ABSTRACT

References

Cited By

Index Terms

Recommendations

A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data

Improving healthcare services using source anonymous scheme with privacy preserving distributed healthcare data collection and mining

A conceptual framework for modeling longitudinal healthcare encounter data