Abstract
To protect public-use microdata, one approach is not to allow users access to the microdata. Instead, users submit analyses to a remote computer that reports back basic output from the fitted model, such as coefficients and standard errors. To be most useful, this remote server also should provide some way for users to check the fit of their models, without disclosing actual data values. This paper discusses regression diagnostics for remote servers. The proposal is to release synthetic diagnostics—i.e. simulated values of residuals and dependent and independent variables–constructed to mimic the relationships among the real-data residuals and independent variables. Using simulations, it is shown that the proposed synthetic diagnostics can reveal model inadequacies without substantial increase in the risk of disclosures. This approach also can be used to develop remote server diagnostics for generalized linear models.
Similar content being viewed by others
References
Abowd J.M. and Woodcock S.D. 2001. Disclosure limitation in longitudinal linked data. In: Doyle P., Lane J., Zayatz L., and Theeuwes J., (Eds.), Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, North-Holland, Amsterdam, 215–277.
Bustros J. 2000. Access to microdata files at Statistics Canada. In: Proceedings of the Survey Methods Section of the Statistical Society of Canada, pp. 61–68.
Cleveland W.S. 1979. Robust locally-weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74: 829–836.
Duncan G.T., Keller-McNulty S.A., and Stokes S.L. 2001. Disclosure risk vs. data utility: The R-U confidentiality map. Tech. Rep., U.S. National Institute of Statistical Sciences.
Duncan G.T. and Mukherjee S. 2000. Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. Journal of the American Statistical Association 95: 720–729.
Fienberg S.E., Makov U.E., and Steele R.J. 1998. Disclosure limitation using perturbation and related methods for categorical data. Journal of Official Statistics 14: 485–502.
Franconi L. and Stander J. 2003. Spatial and non-spatial model-based protection procedures for the release of business microdata. Statistics and Computing 13: 295–305.
Fuller W.A. 1993. Masking procedures for microdata disclosure limitation. Journal of Official Statistics 9: 383–406.
Hastie T.J. and Tibshirani R.J. 1990. Generalized Additive Models. Chapman & Hall, New York.
Keller-McNulty S. and Unger E.A. 1998. A database system proto-type for remote access to information based on confidential data. Journal of Official Statistics 14: 347–360.
Kennickell A.B. 1997. Multiple imputation and disclosure protection: The case of the 1995 Survey of Consumer Finances. In: Alvey W. and Jamerson B. (Eds.), Record Linkage Techniques, 1997, National Academy Press, Washington, DC, pp. 248–267.
Mantel H. and Nadon S. 1999. Dummy file creation for the remote access program of the National Population Health Survey. In: Proceedings of the Survey Methods Section of the Statistical Society of Canada, pp. 181–186.
Muralidhar K. and Sarathy R. 2003. A theoretical basis for perturbation methods. Statistics and Computing 13: 339–342.
Polettini S. 2003. Maximum entropy simulation for microdata protection. Statistics and Computing 13: 307–320.
Raghunathan T.E., Reiter J.P., and Rubin D.B. 2003. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics (forthcoming).
Reiter J.P. 2002. Satisfying disclosure restrictions with synthetic data sets. Journal of Official Statistics 18: 531–544.
Reiter J.P. 2003a. Inference for partially synthetic, public use microdata sets. Survey Methodology ( forthcoming).
Reiter J.P. 2003b. Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Tech. Rep., Institute of Statistics and Decision Sciences, Duke University.
Rubin D.B. 1993. Discussion: Statistical disclosure limitation. Journal of Official Statistics 9: 462–468.
Schouten B. and Cigrang M. 2003. Remote access systems for statistical analysis of microdata. Statistics and Computing 13: 381–389.
Venables W.N. and Ripley B.D. 1997. Modern Applied Statistics with S-Plus. Springer-Verlag, New York.
Wegman E.J. 1972. Nonparametric probability density estimation. Technometrics 14: 533–546.
Willenborg L. and de Waal T. 2001. Elements of Statistical Disclosure Control. Springer-Verlag, New York.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Reiter, J.P. Model Diagnostics for Remote Access Regression Servers. Statistics and Computing 13, 371–380 (2003). https://doi.org/10.1023/A:1025623108012
Issue Date:
DOI: https://doi.org/10.1023/A:1025623108012