Statistical Disclosure Control for Microdata

Methods and Applications in R

verfasst von: Matthias Templ

Verlag: Springer International Publishing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book on statistical disclosure control presents the theory, applications and software implementation of the traditional approach to (micro)data anonymization, including data perturbation methods, disclosure risk, data utility, information loss and methods for simulating synthetic data. Introducing readers to the R packages sdcMicro and simPop, the book also features numerous examples and exercises with solutions, as well as case studies with real-world data, accompanied by the underlying R code to allow readers to reproduce all results.

The demand for and volume of data from surveys, registers or other sources containing sensible information on persons or enterprises have increased significantly over the last several years. At the same time, privacy protection principles and regulations have imposed restrictions on the access and use of individual data. Proper and secure microdata dissemination calls for the application of statistical disclosure control methods to the data before release.

This book is intended for practitioners at statistical agencies and other national and international organizations that deal with confidential data. It will also be interesting for researchers working in statistical disclosure control and the health sciences.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Software

Abstract

The methods used in this book are exclusively available in the software environment R (R Development Core Team 2014). A very brief introduction to some functionalities of R is given. This introduction does not replace a general introduction to R but shows some points that are important in order to understand the examples and the R code in the book. The package sdcMicro (Templ et al. 2015) forms a basis for this book and it includes all presented SDC methods. It is free, open-source and available from the comprehensive R archive network (CRAN). This package implements popular statistical disclosure methods for risk estimation such as the suda2-algorithm, the individual risk approach or risk measurement using log-linear models. In addition, perturbation methods such as global recoding, local suppression, post-randomization, microaggregation, adding correlated noise, shuffling and various other methods are integrated. With the package sdcMicro, statistical disclosure control methods can be applied in an exploratory, interactive and user-friendly manner. All results are saved in a structured manner and these results are updated automatically as soon a method is applied. Print and summary methods allow to summarize the status of disclosure risk and data-utility as well as reports can be generated in automated manner. In addition, most applications/anonymizations can be carried out with the point-and-click graphical user interface (GUI) sdcMicroGUI (Kowarik et al. 2013) without knowledge in the software environment R or the newer version of sdcMicroGUI, an app that is available within the package sdcMicro as function sdcApp. The new version runs in a browser and is based on shiny (Chang et al. 2016). A software package with a similar concept as sdcMicro—the simPop package (Templ et al. 2017)—is used to generate synthetic data sets.

Matthias Templ

Chapter 2. Basic Concepts

Abstract

This section introduces the basic concepts related to statistical disclosure. It presents definitions for certain groups of variables such as sensitive variables or key variables. They are crucial for any other chapter, since SDC methods differ depending on the variables chosen. In addition, basic intruder scenarios are described such as identity, attribute and inferential disclosure. The chapter ends with a discussion about the trade-off between disclosure risk and information loss. The more the disclosure risk is reduced the higher the information loss and the lower the data utility. The concept of risk-utility maps that reports this trade-off is explained based on real data.

Matthias Templ

Chapter 3. Disclosure Risk

Abstract

One of the key tasks in SDC is to estimate the disclosure risk of individuals but also to estimate a global risk for the whole data set. A very basic idea is to calculate frequency counts of the categorical key variables. The concept of uniqueness and the concept of k-anonymity and l-diversity are important and outlined first. SUDA is extending the concept of k-anonymity it also searches for uniqueness in subsets of key variables. For surveys from complex designs, the estimation of frequency counts in the population and sample is of central interest. Mainly two approaches are used: the individual risk approach and the estimation of the global risk by log-linear models. For continuous key variables, other concepts are used to estimate the disclosure risk. They are rather based on distances than on counts. The risk estimation concepts presented here evaluate original data sets or data sets that are modified through traditional (perturbative) anonymization methods.

Matthias Templ

Chapter 4. Methods for Data Perturbation

Abstract

Methods for perturbation of data differ for categorical and continuous variables. The risk for categorical key variables is dependent on the frequency counts of keys, whereas keys with only few observations are problematic. Categories of categorical key variables with low frequency counts are therefore often recoded and combined with other categories. However, a still too high disclosure risk may be present for some individuals. Local suppression is one method to further reduce the disclosure risk. In order to find a well-balanced, suitable solution, global recoding is usually applied in an explorative manner to observe with which (reasonable) recodings one achieves the best effect in terms of reducing the disclosure risk and providing high data utility. Especially with a large amount of key variables, swapping methods, such as PRAM, are good alternatives. Methods for continuous scaled variables are combining values (microaggregation) or adding noise to the values. Advanced methods such as shuffling allow to preserve certain statistics.

Matthias Templ

Chapter 5. Data Utility and Information Loss

Abstract

Once SDC methods have been applied to modify the original data set and to lower the disclosure risk, it is critical to measure the resulting information loss and data utility. Basically, two different kinds of complementary approaches exist to assess information loss: (i) direct measuring of distances/frequencies between the original data and perturbed data, and (ii) comparing statistics computed on the original and perturbed data. The first concept is common but often of limited use. The latter concept is closer to the users and data sets since its aim is to measure the differences for the most important indicators/estimates.

Matthias Templ

Chapter 6. Synthetic Data

Abstract

The generation of synthetic data sets serves as a statistical disclosure control solution to generate public use files out of confidential/protected data. In addition, it is also a tool to create “augmented data sets” which serve as input for micro-simulation models or as data sets for remote execution. Multiple approaches and tools have been developed to generate synthetic data. These approaches can be categorized into three main groups: synthetic reconstruction, combinatorial optimization, and model-based generation. In this chapter, the most promising and important method—model-based simulation—is described in detail. It is also the reason why whole populations are simulated rather than only surveys. For other approaches, we refer to Drechsler (2011) (Drechsler, Synthetic data sets for statistical disclosure control. Springer, New York, 2011) and other references below.

Matthias Templ

Chapter 7. Practical Guidelines

Abstract

This section offers some guidelines on how to implement traditional SDC methods in practice. A rough workflow is presented and described. In addition, a brief discussion on the selection of key variables, the acceptable risk of disclosure and the choice of SDC methods should guide the user to find the best methodology.

Matthias Templ

Chapter 8. Case Studies

Abstract

In this section, we show how to apply the concepts and methods introduced in the previous chapters using sdcMicro. Anonymized data are produced for the Family Income and Expenditure Survey (FIES), the Structural Earnings Survey (SES), International Income Distribution Database (I2D2), the Global Purchasing Power Parities and Real Expenditures data (P4), the so-called SHIP data, and the European Union Statistics on Income and Living Conditions (EU-SILC) data.

Matthias Templ

Backmatter

Titel: Statistical Disclosure Control for Microdata
verfasst von: Matthias Templ
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-50272-4
Print ISBN: 978-3-319-50270-0
DOI: https://doi.org/10.1007/978-3-319-50272-4