Skip to main content
main-content

Über dieses Buch

Statistical disclosure control is the discipline that deals with producing statistical data that are safe enough to be released to external researchers. This book concentrates on the methodology of the area. It deals with both microdata (individual data) and tabular (aggregated) data. The book attempts to develop the theory from what can be called the paradigm of statistical confidentiality: to modify unsafe data in such a way that safe (enough) data emerge, with minimum information loss. This book discusses what safe data, are, how information loss can be measured, and how to modify the data in a (near) optimal way. Once it has been decided how to measure safety and information loss, the production of safe data from unsafe data is often a matter of solving an optimization problem. Several such problems are discussed in the book, and most of them turn out to be hard problems that can be solved only approximately. The authors present new results that have not been published before. The book is not a description of an area that is closed, but, on the contrary, one that still has many spots awaiting to be more fully explored. Some of these are indicated in the book. The book will be useful for official, social and medical statisticians and others who are involved in releasing personal or business data for statistical use. Operations researchers may be interested in the optimization problems involved, particularly for the challenges they present. Leon Willenborg has worked at the Department of Statistical Methods at Statistics Netherlands since 1983, first as a researcher and since 1989 as a senior researcher. Since 1989 his main field of research and consultancy has been statistical disclosure control. From 1996-1998 he was the project coordinator of the EU co-funded SDC project.

Inhaltsverzeichnis

Frontmatter

1. Overview of the Area

Abstract
Organizations conducting surveys and other forms of data collection may release the results of these exercises to third party users as “statistical products” in a variety of formats. For example, they may release tables to the public through published reports or release microdata files to academics for secondary data analysis. The problem addressed in statistical disclosure control (SDC) is that it is conceivable that a person who is given access to one of these statistical products may, through inappropriate use of the data, be able to disclose confidential information about the individual units which originally provided the data. These units might, for example, be respondents to a survey or persons completing forms for administrative purposes.
Leon Willenborg, Ton de Waal

2. Disclosure Risks for Microdata

Abstract
In this chapter we consider the potential disclosure risk arising from the release of microdata. We suppose that the data consist of a standard rectangular data file containing values of variables which at this stage have undergone no SDC treatment. We consider first possible scenarios by which an intruder might attempt to achieve disclosure. This enables us to specify a framework within which disclosure risk may be defined in terms of an intruder’s predictive probability distribution for values of confidential variables. Following a discussion of this predictive approach to measuring disclosure risk, we present arguments for preferring instead to measure risk in terms of the probability of re-identification. The estimation of re-identification risk is discussed in general and for the important special case of discrete variables.
Leon Willenborg, Ton de Waal

3. Data Analytic Impact of SDC Techniques on Microdata

Abstract
The aim of this chapter is to discuss the impact of SDC techniques on the data analytic potential of microdata. There is no single correct way to define “analytic potential” since different users might analyze a given set of microdata in different unforeseen ways. We shall begin by assuming that the purpose of the analysis is to estimate a specified set of population parameters. These might be descriptive parameters, such as means or proportions or they may be analytic parameters, such as the coefficients of a regression model. We consider the impact of SDC techniques on the estimation of these parameters and, specifically, the impact of the SDC techniques discussed in Chapter 1.
Leon Willenborg, Ton de Waal

4. Application of Non-Perturbative SDC Techniques for Microdata

Abstract
The aim of the present chapter is to consider the problem of producing a safe microdata set by applying global recodings and local suppressions, as discussed in Chapter 1. In our discussion we assume disclosure scenarios of the type discussed in Chapter 2. These scenarios have in common that an intruder is supposed to use a number of low-dimensional combinations of key variables in an attempt to disclose private information. Global recoding and local suppression should be applied in such a way that this type of disclosure is prevented, or at least sufficiently hampered. This can be achieved by making sure unsafe combinations, i.e. with a frequencies below certain assumed threshold values, do not occur. This is precisely the case when the global recodings or local suppressions have yielded a safe microdata set (assuming the disclosure scenario adopted does apply) and with minimum information loss (using an information loss measure as discussed in Chapter 3). Clearly to obtain such a safe microdata through modification of the original unsafe microdata set requires an optimization problem to be solved.
Leon Willenborg, Ton de Waal

5. Application of Perturbative SDC Techniques for Microdata

Abstract
In this chapter we consider the application of some perturbative techniques to produce safe microdata. As discussed in Chapter 2, the risk of disclosure is conceived of as arising from the possibility that an intruder matches the values of key variables in the microdata to corresponding values in prior information. The approach in this chapter is to perturb the values of potential key variables in the microdata so that they cannot be matched to external data sources so easily.
Leon Willenborg, Ton de Waal

6. Disclosure Risk for Tabular Data

Abstract
In this chapter we consider the assessment of disclosure risk for tabular data. Disclosure risk may be defined either for the whole table or separately for each cell into which the table is organized. We shall sometimes use the term sensitivity as an alternative term for the disclosure risk of a table or cell. We suppose that a threshold may be specified as the maximum value below which the disclosure risk is deemed acceptable. Disclosure risk exceeding the threshold will call for the use of some form of SDC technique. For a measure of disclosure risk defined at the table level, we say that the table is sensitive if the disclosure risk of the table exceeds the given threshold. For a measure of disclosure risk defined at the cell level, we similarly say that a cell is sensitive if its disclosure risk is greater than the given threshold. In this book we restrict ourselves to measures of disclosure risk defined at the cell level. The objective of disclosure risk assessment will then be to determine which cells of a table are sensitive. We assume that a table containing sensitive cells may not be published. Having identified which cells are sensitive, the next step will be to treat these cells with an SDC technique such as cell suppression. This will be discussed in Chapters 8 and 9.
Leon Willenborg, Ton de Waal

7. Information Loss in Tabular Data

Abstract
In the present chapter we discuss the impact of SDC techniques on the statistical quality of tables. This impact on the quality is subsumed under the heading “information loss”.
Leon Willenborg, Ton de Waal

8. Application of Non-Perturbative Techniques for Tabular Data

Abstract
In the present chapter we discuss the application of two techniques to protect a table or a (hierarchical or linked) set of tables against disclosure, namely table redesign and secondary cell suppression. We start our discussion by considering a single table with its marginals, and then consider the more general case of hierarchical and linked tables. In fact the single table case is the one that has been given most attention in the literature, as it is the more basic one. Increasingly, however, attention is being paid to the hierarchical and linked table case. Since this generalization has not been settled at the time of writing we do not dwell very extensively on this subject.
Leon Willenborg, Ton de Waal

9. Application of Perturbative Techniques for Tabular Data

Abstract
In this final chapter we consider the SDC techniques adding noise, rounding and source data perturbation (SDP). Adding noise perturbs the cell values in a table by adding random values to them. The random values are generated according to a prescribed probability distribution. “Rounding” in fact refers not to a particular SDC-technique but rather to a class of SDC-techniques. Each of these SDC-techniques has its own advantages and disadvantages.
Leon Willenborg, Ton de Waal

Backmatter

Weitere Informationen