skip to main content
research-article

CUDIA: Probabilistic cross-level imputation using individual auxiliary information

Published:08 October 2013Publication History
Skip Abstract Section

Abstract

In healthcare-related studies, individual patient or hospital data are not often publicly available due to privacy restrictions, legal issues, or reporting norms. However, such measures may be provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones, such as hospital referral regions (HRR) or hospital service areas (HSA). Such levels constitute partitions over the underlying individual level data, which may not match the groupings that would have been obtained if one clustered the data based on individual-level attributes. Moreover, treating aggregated values as representatives for the individuals can result in the ecological fallacy. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this article, we seek a better utilization of variably aggregated datasets, which are possibly assembled from different sources. We propose a novel cross-level imputation technique that models the generative process of such datasets using a Bayesian directed graphical model. The imputation is based on the underlying data distribution and is shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, yielding improved accuracies. The experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework.

Skip Supplemental Material Section

Supplemental Material

References

  1. Achen, C. H. and Shively, W. P. 1995. Cross-Level Inference. The University of Chicago Press, Chicago, IL.Google ScholarGoogle Scholar
  2. Agarwal, D. and Chen, B.-C. 2009. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Asuncion, A., Welling, M., Smyth, P., and Teh, Y. 2009. On smoothing and inference for topic models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. 27--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. 2005. Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705--1749. Google ScholarGoogle ScholarCross RefCross Ref
  5. Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Booth, J. G. and Hovert, J. P. 1999. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. Royal Stat. Soc. Series B 61, 265--285.Google ScholarGoogle ScholarCross RefCross Ref
  7. Breiman, L. 1984. Classification and Regression Trees. Wadsworth International Group.Google ScholarGoogle Scholar
  8. Brownstone, D. and Valletta, R. 2001. The bootstrap and multiple imputations: Harnessing increased computing power for improved statistical tests. J. Econ. Perspect. 15, 4, 129--141.Google ScholarGoogle ScholarCross RefCross Ref
  9. Carmelli, D., Cardon, L. R., and Fabsitz, R. 1994. Clustering of hypertension, diabetes, and obesity in adult male twins: Same genes or same environments? Amer. J. Human Genet. 55, 3, 566--573.Google ScholarGoogle Scholar
  10. Cawley, G. C., Talbot, N. L., and Girolami, M. 2006. Sparse multinomial logistic regression via bayesian l1 regularization. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems. 209--216.Google ScholarGoogle Scholar
  11. Dempster, A. P., Laird, N. M., and Rubin, D. B. 1976. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Series B 39.Google ScholarGoogle Scholar
  12. Duncan, O. D. and Davis, B. 1953. An alternative to ecological correlation. Am. Sociol. Rev. 18, 665--666.Google ScholarGoogle ScholarCross RefCross Ref
  13. Dwork, C. 2006. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, Vol. 4052, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. 2006a. Our data, ourselves: Privacy via distributed noise generation. In Proceedings of the 25th International Cryptology Conference (EUROCRYPT). 486--503. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dwork, C., McSherry, F., Nissim, K., and Smith, A. 2006b. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Emam, K. E. and Fineberg, A. 2009. An overview of techniques for de-identifying personal health information. Social Sci. Res. Netw.Google ScholarGoogle Scholar
  17. Freedman, D. A. 1999. Ecological inference and the ecological fallacy. Tech. rep. 549, Department of Statistics, University of California Berkeley, CA.Google ScholarGoogle Scholar
  18. Goodman, L. 1953. Ecological regression and the behavior of individuals. Am. Sociol. Rev. 18, 663--664.Google ScholarGoogle ScholarCross RefCross Ref
  19. Goodman, L. 1959. Some alternatives to ecological correlation. Amer. J. Sociol. 64, 610--625.Google ScholarGoogle ScholarCross RefCross Ref
  20. Grimmett, G. and Stirzaker, D. 2001. Probability and Random Processes 3rd Ed. Oxford, Chapter 3.7, 67.Google ScholarGoogle Scholar
  21. Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning 2nd Ed. Springer.Google ScholarGoogle Scholar
  22. Henry, K. A. and Boscoe, F. P. 2008. Estimating the accuracy of geographical imputation. Int. J. Health Geograph..Google ScholarGoogle Scholar
  23. HIPAA Compliance Assistance. 2003. Summary of the HIPAA Privacy Rule. http://www.hhs.gov/ocr/privacy/hipaa/understanding/summary/privacysummary.pdf.Google ScholarGoogle Scholar
  24. Jackson, C., Best, N., and Richardson, S. 2008. Hierarchical related regression for combining aggregate and individual data in studies of socio-economic disease risk factors. J. Royal Stat. Soc. Series A 171, 159--178.Google ScholarGoogle Scholar
  25. Jackson, C., Best, N., and Richardson, S. 2009. Bayesian graphical models for regression on multiple data sets with different variables. J. Biostat. 10, 2, 335--351.Google ScholarGoogle ScholarCross RefCross Ref
  26. King, G. 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton University Press, Princeton, NJGoogle ScholarGoogle Scholar
  27. Liu, J. S. 1994. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Am. Stat. Assoc. 89, 427, 958--966.Google ScholarGoogle ScholarCross RefCross Ref
  28. Park, Y. and Ghosh, J. 2011. A generative framework for predictive modeling using variably aggregated, multi-source healthcare data. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Medicine and Healthcare. 27--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Park, Y. and Ghosh, J. 2012. A probabilistic imputation framework for predictive analysis using variably aggregated, multi-source healthcare data. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Quinlan, J. R. 1993. C4.5: Prgrams for Machine Learning. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Robinson, W. S. 1950. Ecological correlations and the behavior of individuals. Amer. Sociol. Rev. 15, 351--357.Google ScholarGoogle ScholarCross RefCross Ref
  32. Rubin, D. B. 2004. Multiple Imputation for Nonresponse in Surveys. Wiley-IEEE.Google ScholarGoogle Scholar
  33. Steppan, C. M., Bailey, S. T., Baht, S., Brown, E. J., Banerjee, R. R., Writhe, C. M., Patel, H. R., Ahima, R. S., and Lazar, M. A. 2011. The hormene resistin links obesity to diabetes. Nature 209, 307--312.Google ScholarGoogle Scholar
  34. Tabachnick, B. G. and Fidel, L. S. 2001. Using Multivariate Statistics 4th Ed. Allyn & Bacon, Boston, MA.Google ScholarGoogle Scholar
  35. Wakefield, J. and Salway, R. 2001. A statistical framework for ecological and aggregated studies. J. Royal Stat. Soc. Series A 164, 119--137.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 4, Issue 4
    Survey papers, special sections on the semantic adaptive social web, intelligent systems for health informatics, regular papers
    September 2013
    452 pages
    ISSN:2157-6904
    EISSN:2157-6912
    DOI:10.1145/2508037
    Issue’s Table of Contents

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 8 October 2013
    • Accepted: 1 August 2012
    • Revised: 1 May 2012
    • Received: 1 December 2011
    Published in tist Volume 4, Issue 4

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader