ABSTRACT
In the past few years, the government and other agencies have publicly released a prodigious amount of data that can be potentially mined to benefit the society at large. However, data such as health records are typically only provided at aggregated levels (e.g. per State, per Hospital Referral Region, etc.) to protect privacy. Unfortunately aggregation can severely diminish the utility of such data when modeling or analysis is desired at a per-individual basis. So, not surprisingly, despite the increasing abundance of aggregate data, there have been very few successful attempts in exploiting them for individual-level analyses. This paper introduces LUDIA, a novel low-rank approximation algorithm that utilizes aggregation constraints in addition to auxiliary information in order to estimate or "reconstruct" the original individual-level values from aggregate data. If the reconstructed data are statistically similar to the original individual-level data, off-the-shelf individual-level models can be readily and reliably applied for subsequent predictive or descriptive analytics. LUDIA is more robust to nonlinear estimates and random effects than other reconstruction algorithms. It solves a Sylvester equation and leverages multi-level (also known as hierarchical or mixed-effect) modeling approaches efficiently. A novel graphical model is also introduced to provide a probabilistic viewpoint of LUDIA. Experimental results using a Texas inpatient dataset show that individual-level data can be reasonably reconstructed from county-, hospital-, and zip code-level aggregate data. Several factors affecting the reconstruction quality are discussed, along with the implications of this work for current aggregation guidelines.
Supplemental Material
- M. P. Armstrong, G. Rushton, and D. L. Zimmerman. Geographically masking health data to preserve confidentiality. Statistics in Medicine, 18:497--525, 1999.Google ScholarCross Ref
- R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax+xb=c. Communications of the ACM, 15(9):820--826, 1972. Google ScholarDigital Library
- R. Bhatia and P. Rosenthal. How and why to solve the operater equation ax?xb=y. Bulletin of the London Mathematical Society, 29(1):1--21, 1997.Google ScholarCross Ref
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google ScholarDigital Library
- E. J. Candes and B. Recht. Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 2008. Google ScholarDigital Library
- K. Carroll. Experimental evidence of dietary factors and hormone-dependent cancers. Cancer Research, 35:3374--3383, 1975.Google Scholar
- Centers for Disease Control and Prevension (CDC). Data and statistics. http://www.cdc.gov/datastatistics/, 2014.Google Scholar
- T. Dalenius and S. P. Reiss. Data-swapping: A technique for disclosure control (exteded abstract). In Proceedings of the Section on Survey Research Methods, 1978.Google Scholar
- Data.CMS.gov. Inpatient prospective payment system. https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3, 2014.Google Scholar
- G. T. Duncan, M. Elliot, and J.-J. Salazar-Gonzalez. Statistical Confidentiality: Principles and Practice. Springer, 2011.Google Scholar
- O. D. Duncan and B. Davis. An alternative to ecological correlation. American Sociological Review, 18:665--666, 1953.Google ScholarCross Ref
- C. Eckart and G. Young. The approximation of one matrix by another lower rank. Psychometrika, 1936.Google ScholarCross Ref
- S. E. Fienberg and J. McIntyre. Data swapping: Variations on a theme by Dalenius and Reiss. Journal of Official Statistics, 21:309--323, 2005.Google Scholar
- D. A. Freedman. Ecological inference and the ecological fallacy. Technical Report 549, Department of Statistics, University of California Berkeley, CA 94720, October 1999.Google Scholar
- D. A. Freedman, S. P. Klein, M. Ostland, and M. Roberts. On 'solutions' to the ecological inference probelm. Journal of the American Statistical Association, 93:1518--22, 1999.Google ScholarCross Ref
- D. A. Freedman, S. P. Klein, J. Sacks, C. A. Smyth, and C. G. Everett. Ecological regression and voting rights. Evaluation Review, (673-816), 15.Google Scholar
- W. A. Fuller. Masking procedures for microdata disclosure limitation. Journal of Official Statistics, 9(2):383--406, 1993.Google Scholar
- A. Gelman and J. Hill. Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.Google Scholar
- H. Goldstein. Multilevel Statistical Models. Wiley, 4th edition, 2010.Google Scholar
- L. Goodman. Ecological regression and the behavior of individuals. American Sociological Review, 18:663--664, 1953.Google ScholarCross Ref
- L. Goodman. Some alternatives to ecological correlation. American Journal of Socialogy, 64:610--625, 1959.Google ScholarCross Ref
- W. He, X. Liu, H. Nguyen, K. Nahrstedt, and T. Abdelzaher. PDA: Privacy-preserving data aggregation in wireless sensor netoworks. IEEE International Conference on Computer Communications, pages 2045--2053, 2007.Google ScholarDigital Library
- C. R. Johnson. Matrix completion problems: a survey. Proceedings of Symposia in Applied Mathematics, 1990.Google Scholar
- Kaiser Family Foundation. Hospital adjusted expenses per inpatient day. http://kff.org/other/state-indicator/expenses-per-inpatient-day/, 2011.Google Scholar
- G. King. A Solution to the ecological inference problem: reconstructing individual behavior from aggregate data. Princeton University Press, 1997.Google Scholar
- G. King, O. Rosen, and M. A. Tanner. Binomial-beta hierarchical models for ecological inference. Sociological Methods and Research, 28:61--90, 1999.Google ScholarCross Ref
- G. King, O. Rosen, and M. A. Tanner. Ecological Inference. Cambridge University Press, 2004.Google ScholarCross Ref
- A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramanian. l-diversity: Privacy beyond k-anonymity. Transactions on Knowledge Discovery from Data, 1, 2007. Google ScholarDigital Library
- C. Ordonez and Z. Chen. Horizontal aggregations in sql to prepare data sets for data mining analysis. IEEE Transactions on Knowledge and Data Engineering, pages 678--691, 2012. Google ScholarDigital Library
- Y. Park and J. Ghosh. A generative framework for predictive modeling using variably aggregated, multi-source healthcare data. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Medicine and Healthcare, pages 27--32, 2011. Google ScholarDigital Library
- Y. Park and J. Ghosh. Cudia: Probabilistic cross-level imputation using individual auxiliary information. ACM Transactions on Intelligent Systems and Technology, 2012. Google ScholarDigital Library
- Y. Park and J. Ghosh. A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 2012. Google ScholarDigital Library
- President's Concil of Advisors on Science and Technology. Report to the president realizing the full potential of health information technology to improve healthcare for americans: the path forward. Technical report, Office of Science and Technology Policy, the White House, December 2010.Google Scholar
- J. Richmond. Aggregation and identification. International Economic Review, 17:47--56, 1976.Google ScholarCross Ref
- W. S. Robinson. Ecological correlations and the behavior of individuals. American Sociological Review, 15:351--357, 1950.Google ScholarCross Ref
- L. Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10:557--570, October 2002. Google ScholarDigital Library
- M. Templ. Statistical disclosure control for microdata using the r-package sdcMicro. Transactions on Data Privacy, pages 67--85, 2008. Google ScholarDigital Library
- Texas Department of State Health Services. Texas Inpatient Public Use Data File. https://www.dshs.state.tx.us/thcic/hospitals/Inpatientpudf.shtm, 2014.Google Scholar
Index Terms
- LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data
Recommendations
Aggregating privatized medical data for secure querying applications
Public and private organizations generate large amounts of data which they are happy to allow others to query as long as it is privatized. (One example is that of medical data which can be used for research purposes.) Aggregation of such data on a cloud ...
PRDA: polynomial regression-based privacy-preserving data aggregation for wireless sensor networks
In wireless sensor networks, data aggregation protocols are used to prolong the network lifetime. However, the problem of how to perform data aggregation while preserving data privacy is challenging. This paper presents a polynomial regression-based ...
Efficient Aggregate Computations in Large-Scale Dense WSN
RTAS '09: Proceedings of the 2009 15th IEEE Symposium on Real-Time and Embedded Technology and ApplicationsWe focus on large-scale and dense deeply embedded systems where, due to the large amount of information generated by all nodes, even simple aggregate computations such as the minimum value (MIN) of the sensor readings become notoriously expensive to ...
Comments