skip to main content
10.1145/2623330.2623659acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data

Published:24 August 2014Publication History

ABSTRACT

In the past few years, the government and other agencies have publicly released a prodigious amount of data that can be potentially mined to benefit the society at large. However, data such as health records are typically only provided at aggregated levels (e.g. per State, per Hospital Referral Region, etc.) to protect privacy. Unfortunately aggregation can severely diminish the utility of such data when modeling or analysis is desired at a per-individual basis. So, not surprisingly, despite the increasing abundance of aggregate data, there have been very few successful attempts in exploiting them for individual-level analyses. This paper introduces LUDIA, a novel low-rank approximation algorithm that utilizes aggregation constraints in addition to auxiliary information in order to estimate or "reconstruct" the original individual-level values from aggregate data. If the reconstructed data are statistically similar to the original individual-level data, off-the-shelf individual-level models can be readily and reliably applied for subsequent predictive or descriptive analytics. LUDIA is more robust to nonlinear estimates and random effects than other reconstruction algorithms. It solves a Sylvester equation and leverages multi-level (also known as hierarchical or mixed-effect) modeling approaches efficiently. A novel graphical model is also introduced to provide a probabilistic viewpoint of LUDIA. Experimental results using a Texas inpatient dataset show that individual-level data can be reasonably reconstructed from county-, hospital-, and zip code-level aggregate data. Several factors affecting the reconstruction quality are discussed, along with the implications of this work for current aggregation guidelines.

Skip Supplemental Material Section

Supplemental Material

p55-sidebyside.mp4

mp4

282 MB

References

  1. M. P. Armstrong, G. Rushton, and D. L. Zimmerman. Geographically masking health data to preserve confidentiality. Statistics in Medicine, 18:497--525, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  2. R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax+xb=c. Communications of the ACM, 15(9):820--826, 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Bhatia and P. Rosenthal. How and why to solve the operater equation ax?xb=y. Bulletin of the London Mathematical Society, 29(1):1--21, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  4. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. J. Candes and B. Recht. Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Carroll. Experimental evidence of dietary factors and hormone-dependent cancers. Cancer Research, 35:3374--3383, 1975.Google ScholarGoogle Scholar
  7. Centers for Disease Control and Prevension (CDC). Data and statistics. http://www.cdc.gov/datastatistics/, 2014.Google ScholarGoogle Scholar
  8. T. Dalenius and S. P. Reiss. Data-swapping: A technique for disclosure control (exteded abstract). In Proceedings of the Section on Survey Research Methods, 1978.Google ScholarGoogle Scholar
  9. Data.CMS.gov. Inpatient prospective payment system. https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3, 2014.Google ScholarGoogle Scholar
  10. G. T. Duncan, M. Elliot, and J.-J. Salazar-Gonzalez. Statistical Confidentiality: Principles and Practice. Springer, 2011.Google ScholarGoogle Scholar
  11. O. D. Duncan and B. Davis. An alternative to ecological correlation. American Sociological Review, 18:665--666, 1953.Google ScholarGoogle ScholarCross RefCross Ref
  12. C. Eckart and G. Young. The approximation of one matrix by another lower rank. Psychometrika, 1936.Google ScholarGoogle ScholarCross RefCross Ref
  13. S. E. Fienberg and J. McIntyre. Data swapping: Variations on a theme by Dalenius and Reiss. Journal of Official Statistics, 21:309--323, 2005.Google ScholarGoogle Scholar
  14. D. A. Freedman. Ecological inference and the ecological fallacy. Technical Report 549, Department of Statistics, University of California Berkeley, CA 94720, October 1999.Google ScholarGoogle Scholar
  15. D. A. Freedman, S. P. Klein, M. Ostland, and M. Roberts. On 'solutions' to the ecological inference probelm. Journal of the American Statistical Association, 93:1518--22, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  16. D. A. Freedman, S. P. Klein, J. Sacks, C. A. Smyth, and C. G. Everett. Ecological regression and voting rights. Evaluation Review, (673-816), 15.Google ScholarGoogle Scholar
  17. W. A. Fuller. Masking procedures for microdata disclosure limitation. Journal of Official Statistics, 9(2):383--406, 1993.Google ScholarGoogle Scholar
  18. A. Gelman and J. Hill. Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.Google ScholarGoogle Scholar
  19. H. Goldstein. Multilevel Statistical Models. Wiley, 4th edition, 2010.Google ScholarGoogle Scholar
  20. L. Goodman. Ecological regression and the behavior of individuals. American Sociological Review, 18:663--664, 1953.Google ScholarGoogle ScholarCross RefCross Ref
  21. L. Goodman. Some alternatives to ecological correlation. American Journal of Socialogy, 64:610--625, 1959.Google ScholarGoogle ScholarCross RefCross Ref
  22. W. He, X. Liu, H. Nguyen, K. Nahrstedt, and T. Abdelzaher. PDA: Privacy-preserving data aggregation in wireless sensor netoworks. IEEE International Conference on Computer Communications, pages 2045--2053, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. R. Johnson. Matrix completion problems: a survey. Proceedings of Symposia in Applied Mathematics, 1990.Google ScholarGoogle Scholar
  24. Kaiser Family Foundation. Hospital adjusted expenses per inpatient day. http://kff.org/other/state-indicator/expenses-per-inpatient-day/, 2011.Google ScholarGoogle Scholar
  25. G. King. A Solution to the ecological inference problem: reconstructing individual behavior from aggregate data. Princeton University Press, 1997.Google ScholarGoogle Scholar
  26. G. King, O. Rosen, and M. A. Tanner. Binomial-beta hierarchical models for ecological inference. Sociological Methods and Research, 28:61--90, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  27. G. King, O. Rosen, and M. A. Tanner. Ecological Inference. Cambridge University Press, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  28. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramanian. l-diversity: Privacy beyond k-anonymity. Transactions on Knowledge Discovery from Data, 1, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Ordonez and Z. Chen. Horizontal aggregations in sql to prepare data sets for data mining analysis. IEEE Transactions on Knowledge and Data Engineering, pages 678--691, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Park and J. Ghosh. A generative framework for predictive modeling using variably aggregated, multi-source healthcare data. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Medicine and Healthcare, pages 27--32, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Park and J. Ghosh. Cudia: Probabilistic cross-level imputation using individual auxiliary information. ACM Transactions on Intelligent Systems and Technology, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Park and J. Ghosh. A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. President's Concil of Advisors on Science and Technology. Report to the president realizing the full potential of health information technology to improve healthcare for americans: the path forward. Technical report, Office of Science and Technology Policy, the White House, December 2010.Google ScholarGoogle Scholar
  34. J. Richmond. Aggregation and identification. International Economic Review, 17:47--56, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  35. W. S. Robinson. Ecological correlations and the behavior of individuals. American Sociological Review, 15:351--357, 1950.Google ScholarGoogle ScholarCross RefCross Ref
  36. L. Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10:557--570, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Templ. Statistical disclosure control for microdata using the r-package sdcMicro. Transactions on Data Privacy, pages 67--85, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Texas Department of State Health Services. Texas Inpatient Public Use Data File. https://www.dshs.state.tx.us/thcic/hospitals/Inpatientpudf.shtm, 2014.Google ScholarGoogle Scholar

Index Terms

  1. LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2014
      2028 pages
      ISBN:9781450329569
      DOI:10.1145/2623330

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 August 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '14 Paper Acceptance Rate151of1,036submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader