research-article

LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data

Authors:
Yubin Park

the university of texas at austin, Austin, USA

the university of texas at austin, Austin, USA
View Profile

,
Joydeep Ghosh

the university of texas at austin, Austin, USA

the university of texas at austin, Austin, USA
View Profile

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2014Pages 55–64https://doi.org/10.1145/2623330.2623659

Published:24 August 2014Publication History

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 55–64

ABSTRACT

In the past few years, the government and other agencies have publicly released a prodigious amount of data that can be potentially mined to benefit the society at large. However, data such as health records are typically only provided at aggregated levels (e.g. per State, per Hospital Referral Region, etc.) to protect privacy. Unfortunately aggregation can severely diminish the utility of such data when modeling or analysis is desired at a per-individual basis. So, not surprisingly, despite the increasing abundance of aggregate data, there have been very few successful attempts in exploiting them for individual-level analyses. This paper introduces LUDIA, a novel low-rank approximation algorithm that utilizes aggregation constraints in addition to auxiliary information in order to estimate or "reconstruct" the original individual-level values from aggregate data. If the reconstructed data are statistically similar to the original individual-level data, off-the-shelf individual-level models can be readily and reliably applied for subsequent predictive or descriptive analytics. LUDIA is more robust to nonlinear estimates and random effects than other reconstruction algorithms. It solves a Sylvester equation and leverages multi-level (also known as hierarchical or mixed-effect) modeling approaches efficiently. A novel graphical model is also introduced to provide a probabilistic viewpoint of LUDIA. Experimental results using a Texas inpatient dataset show that individual-level data can be reasonably reconstructed from county-, hospital-, and zip code-level aggregate data. Several factors affecting the reconstruction quality are discussed, along with the implications of this work for current aggregation guidelines.

Supplemental Material

p55-sidebyside.mp4

mp4

282 MB

Download

References

M. P. Armstrong, G. Rushton, and D. L. Zimmerman. Geographically masking health data to preserve confidentiality. Statistics in Medicine, 18:497--525, 1999.Google ScholarCross Ref
R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax+xb=c. Communications of the ACM, 15(9):820--826, 1972. Google ScholarDigital Library
R. Bhatia and P. Rosenthal. How and why to solve the operater equation ax?xb=y. Bulletin of the London Mathematical Society, 29(1):1--21, 1997.Google ScholarCross Ref
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google ScholarDigital Library
E. J. Candes and B. Recht. Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 2008. Google ScholarDigital Library
K. Carroll. Experimental evidence of dietary factors and hormone-dependent cancers. Cancer Research, 35:3374--3383, 1975.Google Scholar
Centers for Disease Control and Prevension (CDC). Data and statistics. http://www.cdc.gov/datastatistics/, 2014.Google Scholar
T. Dalenius and S. P. Reiss. Data-swapping: A technique for disclosure control (exteded abstract). In Proceedings of the Section on Survey Research Methods, 1978.Google Scholar
Data.CMS.gov. Inpatient prospective payment system. https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3, 2014.Google Scholar
G. T. Duncan, M. Elliot, and J.-J. Salazar-Gonzalez. Statistical Confidentiality: Principles and Practice. Springer, 2011.Google Scholar
O. D. Duncan and B. Davis. An alternative to ecological correlation. American Sociological Review, 18:665--666, 1953.Google ScholarCross Ref
C. Eckart and G. Young. The approximation of one matrix by another lower rank. Psychometrika, 1936.Google ScholarCross Ref
S. E. Fienberg and J. McIntyre. Data swapping: Variations on a theme by Dalenius and Reiss. Journal of Official Statistics, 21:309--323, 2005.Google Scholar
D. A. Freedman. Ecological inference and the ecological fallacy. Technical Report 549, Department of Statistics, University of California Berkeley, CA 94720, October 1999.Google Scholar
D. A. Freedman, S. P. Klein, M. Ostland, and M. Roberts. On 'solutions' to the ecological inference probelm. Journal of the American Statistical Association, 93:1518--22, 1999.Google ScholarCross Ref
D. A. Freedman, S. P. Klein, J. Sacks, C. A. Smyth, and C. G. Everett. Ecological regression and voting rights. Evaluation Review, (673-816), 15.Google Scholar
W. A. Fuller. Masking procedures for microdata disclosure limitation. Journal of Official Statistics, 9(2):383--406, 1993.Google Scholar
A. Gelman and J. Hill. Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.Google Scholar
H. Goldstein. Multilevel Statistical Models. Wiley, 4th edition, 2010.Google Scholar
L. Goodman. Ecological regression and the behavior of individuals. American Sociological Review, 18:663--664, 1953.Google ScholarCross Ref
L. Goodman. Some alternatives to ecological correlation. American Journal of Socialogy, 64:610--625, 1959.Google ScholarCross Ref
W. He, X. Liu, H. Nguyen, K. Nahrstedt, and T. Abdelzaher. PDA: Privacy-preserving data aggregation in wireless sensor netoworks. IEEE International Conference on Computer Communications, pages 2045--2053, 2007.Google ScholarDigital Library
C. R. Johnson. Matrix completion problems: a survey. Proceedings of Symposia in Applied Mathematics, 1990.Google Scholar
Kaiser Family Foundation. Hospital adjusted expenses per inpatient day. http://kff.org/other/state-indicator/expenses-per-inpatient-day/, 2011.Google Scholar
G. King. A Solution to the ecological inference problem: reconstructing individual behavior from aggregate data. Princeton University Press, 1997.Google Scholar
G. King, O. Rosen, and M. A. Tanner. Binomial-beta hierarchical models for ecological inference. Sociological Methods and Research, 28:61--90, 1999.Google ScholarCross Ref
G. King, O. Rosen, and M. A. Tanner. Ecological Inference. Cambridge University Press, 2004.Google ScholarCross Ref
A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramanian. l-diversity: Privacy beyond k-anonymity. Transactions on Knowledge Discovery from Data, 1, 2007. Google ScholarDigital Library
C. Ordonez and Z. Chen. Horizontal aggregations in sql to prepare data sets for data mining analysis. IEEE Transactions on Knowledge and Data Engineering, pages 678--691, 2012. Google ScholarDigital Library
Y. Park and J. Ghosh. A generative framework for predictive modeling using variably aggregated, multi-source healthcare data. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Medicine and Healthcare, pages 27--32, 2011. Google ScholarDigital Library
Y. Park and J. Ghosh. Cudia: Probabilistic cross-level imputation using individual auxiliary information. ACM Transactions on Intelligent Systems and Technology, 2012. Google ScholarDigital Library
Y. Park and J. Ghosh. A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 2012. Google ScholarDigital Library
President's Concil of Advisors on Science and Technology. Report to the president realizing the full potential of health information technology to improve healthcare for americans: the path forward. Technical report, Office of Science and Technology Policy, the White House, December 2010.Google Scholar
J. Richmond. Aggregation and identification. International Economic Review, 17:47--56, 1976.Google ScholarCross Ref
W. S. Robinson. Ecological correlations and the behavior of individuals. American Sociological Review, 15:351--357, 1950.Google ScholarCross Ref
L. Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10:557--570, October 2002. Google ScholarDigital Library
M. Templ. Statistical disclosure control for microdata using the r-package sdcMicro. Transactions on Data Privacy, pages 67--85, 2008. Google ScholarDigital Library
Texas Department of State Health Services. Texas Inpatient Public Use Data File. https://www.dshs.state.tx.us/thcic/hospitals/Inpatientpudf.shtm, 2014.Google Scholar

Index Terms

LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data
1. Applied computing

Recommendations

Aggregating privatized medical data for secure querying applications

Public and private organizations generate large amounts of data which they are happy to allow others to query as long as it is privatized. (One example is that of medical data which can be used for research purposes.) Aggregation of such data on a cloud ...
Read More
PRDA: polynomial regression-based privacy-preserving data aggregation for wireless sensor networks

In wireless sensor networks, data aggregation protocols are used to prolong the network lifetime. However, the problem of how to perform data aggregation while preserving data privacy is challenging. This paper presents a polynomial regression-based ...
Read More
Efficient Aggregate Computations in Large-Scale Dense WSN
RTAS '09: Proceedings of the 2009 15th IEEE Symposium on Real-Time and Embedded Technology and Applications

We focus on large-scale and dense deeply embedded systems where, due to the large amount of information generated by all nodes, even simple aggregate computations such as the minimum value (MIN) of the sensor readings become notoriously expensive to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2014
2028 pages
ISBN:9781450329569
DOI:10.1145/2623330
General Chairs:
Sofus Macskassy
Facebook
,
Claudia Perlich
Dstillery
,
Program Chairs:
Jure Leskovec
Stanford University
,
Wei Wang
UCLA
,
Rayid Ghani
University of Chicago
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data aggregation
low rank approximation
multi-level model
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '14 Paper Acceptance Rate151of1,036submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 857
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Aggregating privatized medical data for secure querying applications

PRDA: polynomial regression-based privacy-preserving data aggregation for wireless sensor networks

Efficient Aggregate Computations in Large-Scale Dense WSN