Top

2006 | Book

Read chapter Read first chapter

Privacy in Statistical Databases

CENEX-SDC Project International Conference, PSD 2006, Rome, Italy, December 13-15, 2006. Proceedings

Editors: Josep Domingo-Ferrer, Luisa Franconi

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Privacy in statistical databases is a discipline whose purpose is to provide - lutions to the con?ict between the increasing social, political and economical demand of accurate information, and the legal and ethical obligation to protect the privacy of the individuals and enterprises to which statistical data refer. - yond law and ethics, there are also practical reasons for statistical agencies and data collectors to invest in this topic: if individual and corporate respondents feel their privacyguaranteed,they arelikelyto providemoreaccurateresponses. There are at least two traditions in statistical database privacy: one stems from o?cial statistics, where the discipline is also known as statistical disclosure control (SDC), and the other originates from computer science and database technology.Bothstartedinthe1970s,butthe1980sandtheearly1990ssawlittle privacy activity on the computer science side. The Internet era has strengthened the interest of both statisticians and computer scientists in this area. Along with the traditional topics of tabular and microdata protection, some research lines have revived and/or appeared, such as privacy in queryable databases and protocols for private data computation.

Frontmatter

Methods for Tabular Protection

A Method for Preserving Statistical Distributions Subject to Controlled Tabular Adjustment

Controlled tabular adjustment preserves confidentiality and tabular structure. Quality-preserving controlled tabular adjustment in addition preserves parameters of the distribution of the original (unadjusted) data. Both methods are based on mathematical programming. We introduce a method for preserving the original distribution itself, a fortiori the distributional parameters. The accuracy of the approximation is measured by minimum discrimination information. MDI is computed using an optimal statistical algorithm—iterative proportional fitting.

Lawrence H. Cox, Jean G. Orelien, Babubhai V. Shah

Automatic Structure Detection in Constraints of Tabular Data

Methods for the protection of statistical tabular data—as controlled tabular adjustment, cell suppression, or controlled rounding—need to solve several linear programming subproblems. For large multidimensional linked and hierarchical tables, such subproblems turn out to be computationally challenging. One of the techniques used to reduce the solution time of mathematical programming problems is to exploit the constraints structure using some specialized algorithm. Two of the most usual structures are block-angular matrices with either linking rows (primal block-angular structure) or linking columns (dual block-angular structure). Although constraints associated to tabular data have intrinsically a lot of structure, current software for tabular data protection neither detail nor exploit it, and simply provide a single matrix, or at most a set of smallest submatrices. We provide in this work an efficient tool for the automatic detection of primal or dual block-angular structure in constraints matrices. We test it on some of the complex CSPLIB instances, showing that when the number of linking rows or columns is small, the computational savings are significant.

Jordi Castro, Daniel Baena

A New Approach to Round Tabular Data

Controlled Rounding is a technique to replace each cell value in a table with a multiple of a base number such that the new table satisfies the same equations as the original table. Statistical agencies prefer a solution where cell values already multiple of the base number remain unchanged, while the others are one of the two closest multiple of the base number (i.e., rounded up or rounded down). This solution is called zero-restricted rounding. Finding such a solution is a very complicated problems, and on some tables it may not exist. This paper presents a mathematical model and an algorithm to find a good-enough near-feasible solution for tables where a zero-restricted rounding is complicated. It also presents computational results showing the behavior of the proposal in practice.

Juan José Salazar González

Harmonizing Table Protection: Results of a Study

The paper reports results of a study aimed at the development of recommendations for harmonization of table protection in the German statistical system. We compare the performance of a selection of algorithms for secondary cell suppression under four different models for co-ordination of cell suppressions across agencies of a distributed system for official statistics, like the German or the European statistical system. For the special case of decentralized across-agency co-ordination as used in the European Statistical System, the paper also suggests a strategy to protect the data on the top level of the regional breakdown by perturbative methods rather than cell suppression.

Sarah Giessing, Stefan Dittrich

Utility and Risk in Tabular Protection

Effects of Rounding on the Quality and Confidentiality of Statistical Data

Statistical data may be rounded to integer values for statistical disclosure limitation. The principal issues in evaluating a disclosure limitation method are: (1) Is the method effective for limiting disclosure? and (2) Are the effects of the method on data quality acceptable? We examine the first question in terms of the posterior probability distribution of original data given rounded data and the second by computing expected increase in total mean square error and expected difference between pre- and post-rounding distributions, as measured by a conditional chi-square statistic, for four rounding methods.

Lawrence H. Cox, Jay J. Kim

Disclosure Analysis for Two-Way Contingency Tables

Disclosure analysis in two-way contingency tables is important in categorical data analysis. The disclosure analysis concerns whether a data snooper can infer any protected cell values, which contain privacy sensitive information, from available marginal totals (i.e., row sums and column sums) in a two-way contingency table. Previous research has been targeted on this problem from various perspectives. However, there is a lack of systematic definitions on the disclosure of cell values. Also, no previous study has been focused on the distribution of the cells that are subject to various types of disclosure. In this paper, we define four types of possible disclosure based on the exact upper bound and/or the lower bound of each cell that can be computed from the marginal totals. For each type of disclosure, we discover the distribution pattern of the cells subject to disclosure. Based on the distribution patterns discovered, we can speed up the search for all cells subject to disclosure.

Haibing Lu, Yingjiu Li, Xintao Wu

Statistical Disclosure Control Methods Through a Risk-Utility Framework

This paper discusses a disclosure risk – data utility framework for assessing statistical disclosure control (SDC) methods on statistical data. Disclosure risk is defined in terms of identifying individuals in small cells in the data which then leads to attribute disclosure of other sensitive variables. Information Loss measures are defined for assessing the impact of the SDC method on the utility of the data and its effects when carrying out standard statistical analysis tools. The quantitative disclosure risk and information loss measures can be plotted onto an R-U confidentiality map for determining optimal SDC methods. A user-friendly software application has been developed and implemented at the UK Office for National Statistics (ONS) to enable data suppliers to compare original and disclosure controlled statistical data and to make informed decisions on best methods for protecting their statistical data.

Natalie Shlomo, Caroline Young

A Generalized Negative Binomial Smoothing Model for Sample Disclosure Risk Estimation

We deal with the issue of risk estimation in a sample frequency table to be released by an agency. Risk arises from non-empty sample cells which represent small population cells and from population uniques in particular. Therefore risk estimation requires assessing which of the relevant population cells are indeed small. Various methods have been proposed for this task, and we present a new method in which estimation of a population cell frequency is based on smoothing using a local neighborhood of this cell, that is, cells having similar or close values in all attributes.

The statistical model we use is a

generalized Negative Binomial

model which subsumes the Poisson and Negative Binomial models. We provide some preliminary results and experiments with this method.

Comparisons of the new approach are made to a method based on

Poisson regressionlog-linear hierarchical model

, in which inference on a given cell is based on classical models of contingency tables. Such models connect each cell to a ‘neighborhood’ of cells with one or several common attributes, but some other attributes may differ significantly. We also compare to the

Argus

Negative Binomial method in which inference on a given cell is based only on sampling weights, without learning from any type of ‘neighborhood’ of the given cell and without making use of the structure of the table.

Yosef Rinott, Natalie Shlomo

Entry Uniqueness in Margined Tables

We consider a problem in secure disclosure of multiway table margins. If the value of an entry in all tables having the same margins as those released from a source table in a data base is unique, then the value of that entry can be exposed and disclosure is insecure. We settle the computational complexity of detecting whether this situation occurs. In particular, for multiway tables where one category is significantly richer than the others, that is, when each sample point can take many values in one category and only few values in the other categories, we provide, for the first time, a polynomial time algorithm for checking uniqueness, allowing disclosing agencies to check entry uniqueness and make learned decisions on secure disclosure. Our proofs use our recent results on universality of 3-way tables and on n-fold integer programming, which we survey on the way.

Shmuel Onn

Methods for Microdata Protection

Combinations of SDC Methods for Microdata Protection

A number of methods have been proposed in the literature for masking (protecting) microdata. Nearly all of these methods may be implemented with different degrees of intensity, by setting the value of an appropriate parameter. However, even parameter variation may not be sufficient to realize appropriate levels of disclosure risk and data utility. In this paper we propose a new approach to protection of numerical microdata: applying multiple stages of masking to the data in a way that increases utility but controls disclosure risk.

Anna Oganian, Alan F. Karr

A Fixed Structure Learning Automaton Micro-aggregation Technique for Secure Statistical Databases

We consider the problem of securing statistical databases and, more specifically, the micro-aggregation technique (

MAT

), which coalesces the individual records in the micro-data file into groups or classes, and on being queried, reports, for the all individual values, the aggregated means of the corresponding group. This problem is known to be NP-hard and has been tackled using many heuristic solutions. In this paper we present the first reported Learning Automaton (

) based solution to the

MAT

. The

modifies a fixed-structure solution to the

Equi-Partitioning Problem

(

EPP

) to solve the micro-aggregation problem. The scheme has been implemented, rigorously tested and evaluated for different real and simulated data sets. The results clearly demonstrate the applicability of

to the micro-aggregation problem, and to yield a solution that obtains a lower information loss when compared to the best available heuristic methods for micro-aggregation.

Ebaa Fayyoumi, B. John Oommen

Optimal Multivariate 2-Microaggregation for Microdata Protection: A 2-Approximation

Microaggregation is a special clustering problem where the goal is to cluster a set of points into groups of at least

points in such a way that groups are as homogeneous as possible. Microaggregation arises in connection with anonymization of statistical databases for privacy protection (

-anonymity), where points are assimilated to database records. A usual group homogeneity criterion is within-groups sum of squares minimization

SSE

. For multivariate points, optimal microaggregation,

i.e.

with minimum

SSE

, has been shown to be NP-hard. Recently, a polynomial-time

(

)-approximation heuristic has been proposed (previous heuristics in the literature offered no approximation bounds). The special case

=2 (2-microaggregation) is interesting in privacy protection scenarios with neither internal intruders nor outliers, because information loss is lower: smaller groups imply smaller information loss. For 2-microaggregation the existing general approximation can only guarantee a 54-approximation. We give here a new polynomial-time heuristic whose

SSE

is at most twice the minimum

SSE

(2-approximation).

Josep Domingo-Ferrer, Francesc Sebé

Using the Jackknife Method to Produce Safe Plots of Microdata

We discuss several methods for producing plots of uni- and bivariate distributions of confidential numeric microdata so that no single value is disclosed even in the presence of detailed additional knowledge, using the jackknife method of confidentiality protection. For histograms (as for frequency tables) this is similar to adding white noise of constant amplitude to all frequencies. Decreasing the bin size and smoothing, leading to kernel density estimation in the limit, gives more informative plots which need less noise for protection. Detail can be increased by choosing the bandwidth locally. Smoothing also the noise (i.e. using correlated noise) gives more visual improvement. Additional protection comes from robustifying the kernel density estimator or plotting only classified densities as in contour plots.

Jobst Heitzig

Combining Blanking and Noise Addition as a Data Disclosure Limitation Method

Statistical disclosure limitation is widely used by data collecting institutions to provide safe individual data. In this paper, we propose to combine two separate disclosure limitation techniques blanking and addition of independent noise in order to protect the original data. The proposed approach yields a decrease in the probability of reidentifying/disclosing the individual information, and can be applied to linear as well as nonlinear regression models.

We show how to combine the blanking method and the measurement error method, and how to estimate the model by the combination of the Simulation-Extrapolation (SIMEX) approach proposed by [4] and the Inverse Probability Weighting (IPW) approach going back to [8]. We produce Monte-Carlo evidence on how the reduction of data quality can be minimized by this masking procedure.

Anton Flossmann, Sandra Lechner

Why Swap When You Can Shuffle? A Comparison of the Proximity Swap and Data Shuffle for Numeric Data

The rank based proximity swap has been suggested as a data masking mechanism for numerical data. Recently, more sophisticated procedures for masking numerical data that are based on the concept of “shuffling” the data have been proposed. In this study, we compare and contrast the performance of the swapping and shuffling procedures. The results indicate that the shuffling procedures perform better than data swapping both in terms of data utility and disclosure risk.

Krish Muralidhar, Rathindra Sarathy, Ramesh Dandekar

Adjusting Survey Weights When Altering Identifying Design Variables Via Synthetic Data

Statistical agencies alter values of identifiers to protect respondents’ confidentiality. When these identifiers are survey design variables, leaving the original survey weights on the file can be a disclosure risk. Additionally, the original weights may not correspond to the altered values, which impacts the quality of design-based (weighted) inferences. In this paper, we discuss some strategies for altering survey weights when altering design variables. We do so in the context of simulating identifiers from probability distributions, i.e. partially synthetic data. Using simulation studies, we illustrate aspects of the quality of inferences based on the different strategies.

Robin Mitra, Jerome P. Reiter

Utility and Risk in Microdata Protection

Risk, Utility and PRAM

PRAM (Post Randomization Method) is a disclosure control method for microdata, introduced in 1997. Unfortunately, PRAM has not yet been applied extensively by statistical agencies in protecting their microdata. This is partly due to the fact that little knowledge is available on the effect of PRAM on disclosure control as well as on the loss of information it induces.

In this paper, we will try to make up for this lack of knowledge, by supplying some empirical information on the behaviour of PRAM. To be able to achieve this, some basic measures for loss of information and disclosure risk will be introduced. PRAM will be applied to one specific microdata file of over 6 million records, using several models in applying the procedure.

Peter-Paul de Wolf

Distance Based Re-identification for Time Series, Analysis of Distances

Record linkage is a technique for linking records from different files or databases that correspond to the same entity. Standard record linkage methods need the files to have some variables in common. Typically, variables are either numerical or categorical. These variables are the basis for permitting such linkage.

In this paper we study the problem when the files to link are formed by numerical time series instead of numerical variables. We study some extensions of distance base record linkage in order to take advantage of this kind of data.

Jordi Nin, Vicenç Torra

Beyond k-Anonymity: A Decision Theoretic Framework for Assessing Privacy Risk

An important issue any organization or individual has to face when managing data containing sensitive information, is the risk that can be incurred when releasing such data. Even though data may be sanitized, before being released, it is still possible for an adversary to reconstruct the original data by using additional information that may be available, for example, from other data sources. To date, however, no comprehensive approach exists to quantify such risks. In this paper we develop a framework, based on statistical decision theory, to assess the relationship between the disclosed data and the resulting privacy risk. We relate our framework with the

-anonymity disclosure method; we make the assumptions behind

-anonymity explicit, quantify them, and extend them in several natural directions.

Guy Lebanon, Monica Scannapieco, Mohamed R. Fouad, Elisa Bertino

Using Mahalanobis Distance-Based Record Linkage for Disclosure Risk Assessment

Distance-based record linkage (DBRL) is a common approach to empirically assessing the disclosure risk in SDC-protected microdata. Usually, the Euclidean distance is used. In this paper, we explore the potential advantages of using the Mahalanobis distance for DBRL. We illustrate our point for partially synthetic microdata and show that, in some cases, Mahalanobis DBRL can yield a very high re-identification percentage, far superior to the one offered by other record linkage methods.

Vicenç Torra, John M. Abowd, Josep Domingo-Ferrer

Improving Individual Risk Estimators

The release of survey microdata files requires a preliminary assessment of the disclosure risk of the data. Record-level risk measures can be useful for “local” protection (e.g. partially synthetic data [21], or local suppression [25]), and are also used in [22] and [16] to produce global risk measures [13] useful to assess data release. Whereas different proposals to estimating such risk measures are available in the literature, so far only a few attempts have been targeted to the evaluation of the statistical properties of these estimators. In this paper we pursue a simulation study that aims to evaluate the statistical properties of risk estimators. Besides presenting results about the Benedetti-Franconi individual risk estimator (see [11]), we also propose a strategy to produce improved risk estimates, and assess the latter by simulation.

The problem of estimating per record reidentification risk enjoys many similarities with that of small area estimation (see [19]): we propose to introduce external information, arising from a previous census, in risk estimation. To achieve this we consider a simple strategy, namely Structure Preserving Estimation (SPREE) of Purcell and Kish [18], and show by simulation that this procedure provides better estimates of the individual risk of reidentification disclosure, especially for records whose risk is high.

Loredana Di Consiglio, Silvia Polettini

Protocols for Private Computation

Single-Database Private Information Retrieval Schemes : Overview, Performance Study, and Usage with Statistical Databases

This paper presents an overview of the current single-database private information retrieval (PIR) schemes and proposes to explore the usage of these protocols with statistical databases. The vicinity of this research field with the one of Oblivious Transfer, and the different performance measures used for the last few years have resulted in re-discoveries and contradictory comparisons of performance in different publications. The contribution of this paper is twofold. First, we present the different schemes through the innovations they have brought to this field of research, which gives a global view of the evolution since the first of these schemes was presented by Kushilevitz and Ostrovsky in 1997. We know of no other survey of the current PIR protocols. We also compare the most representative of these schemes with a single set of communication performance measures. When compared to the usage of global communication cost as a single measure, we assert that this set simplifies the evaluation of the cost of using PIR and reveals the best adapted scheme to each situation. We conclude this overview and performance study by introducing some important issues resulting from PIR usage with statistical databases and highlighting some directions for further research.

Carlos Aguilar Melchor, Yves Deswarte

Privacy-Preserving Data Set Union

This paper describes a cryptographic protocol for merging two or more data sets without divulging those identifying records; technically, the protocol computes a

blind set-theoretic union

. Applications for this protocol arise, for example, in data analysis for biomedical application areas, where identifying fields (

e.g.,

patient names) are protected by governmental privacy regulations or by institutional research board policies.

Alberto Maria Segre, Andrew Wildenberg, Veronica Vieland, Ying Zhang

“Secure” Log-Linear and Logistic Regression Analysis of Distributed Databases

The machine learning community has focused on confidentiality problems associated with statistical analyses that “integrate” data stored in multiple, distributed databases where there are barriers to simply integrating the databases. This paper discusses various techniques which can be used to perform statistical analysis for categorical data, especially in the form of log-linear analysis and logistic regression over partitioned databases, while limiting confidentiality concerns. We show how ideas from the current literature that focus on “secure” summations and secure regression analysis can be adapted or generalized to the categorical data setting.

Stephen E. Fienberg, William J. Fulp, Aleksandra B. Slavkovic, Tracey A. Wrobel

Case Studies

Measuring the Impact of Data Protection Techniques on Data Utility: Evidence from the Survey of Consumer Finances

Despite the fact that much empirical economic research is based on public-use data files, the debate on the impact of disclosure protection on data quality has largely been conducted among statisticians and computer scientists. Remarkably, economists have shown very little interest in this subject, which has potentially profound implications for research. Without input from such subject-matter experts, statistical agencies may make decisions that unnecessarily obstruct analysis. This paper examines the impact of the application of disclosure protection techniques on a survey that is heavily used by both economists and policy-makers: the Survey of Consumer Finances. It evaluates the ability of different approaches to convey information about changes in data utility to subject matter experts.

Arthur Kennickell, Julia Lane

Protecting the Confidentiality of Survey Tabular Data by Adding Noise to the Underlying Microdata: Application to the Commodity Flow Survey

The Commodity Flow Survey (CFS) produces data on the movement of goods in the United States. The data from the CFS are used by analysts for transportation modeling, planning and decision-making. Cell suppression has been used over the years to protect responding companies’ values in CFS data. Data users, especially transportation modelers, would like to have access to data tables that do not have missing data due to suppression. To meet this need, we are testing the application of a noise protection method (Evans et al [3]) that involves adding noise to the underlying CFS microdata prior to tabulation to protect sensitive cells in CFS tables released to the public. Initial findings of this research have been positive. This paper describes detailed analyses that may be performed to evaluate the effectiveness of the noise protection.

Paul Massell, Laura Zayatz, Jeremy Funk

Italian Household Expenditure Survey: A Proposal for Data Dissemination

In this paper we define a proposal for an alternative data dissemination strategy of the Italian Household Expenditure Survey (HES). The proposal moves from partitioning the set of users in different groups homogeneous in terms of needs, type of statistical analyses and access to external information. Such a partition allows the release of different data products that are hierarchical in information content and that may be protected using different data disclosure limitation methods. A new masking procedure that combines Migroaggregation and Data Swapping is proposed to preserve sampling weights.

Mario Trottini, Luisa Franconi, Silvia Polettini

Software

The ARGUS Software in CENEX

In this paper we will give an overview of the CENEX project and concentrate on the current state of affairs with respect to the ARGUS-software twins. The CENEX (Centre of Excellence) is a new initiative by Eurostat. The main idea behind the CENEX-concept is to join the forces of the national NSI’s and together bring the skills of the NSI’s on a higher level. The CENEX on Statistical Disclosure Control is a first pilot CENEX-project both aiming at testing the feasibility of the CENEX idea and working on SDC. This project will make a start of writing a handbook on SDC, after an inventory and extend the ARGUS software with an emphasis on issues of practical use. Within this CENEX we will organise the transfer of technology via courses, a WEB-site and this conference. Finally a roadmap for future work will hopefully lead to a follow-up CENEX.

In this paper we will summarise this CENEX-project and give a short overview of the current versions of ARGUS.

Anco Hundepool

Software Development for SDC in R

The production of scientific-use files from economic microdata is a major problem. Many common methods change the data in a way which leaves the univariate distribution of each of the variables almost unchanged towards the distribution of the variables of the original data, the multivariate structure of the data, however, is often ruined.

Which method are suitable strongly depends on the underlying data. A program system with which one can apply different methods and evaluate and compare results from different algorithms in a flexible way is needed. The use of methods for protecting microdata as an exploratory data analysis tool requires a powerful program system, able to present the results in a number of easy to grasp graphics. For this purpose some of the most populare procedures for anonymising micro data are applied in a flexible R-package. The R system supports flexible data import/export facilities and advanced developement tools for the development of such a software for disclosure control.

Additionally to existing algorithms in other software (MDAV algorithm for microaggregation, ...) some new algorithms for anonymising microdata are implemented, e.g. a fast algorithm for microaggregation with a projection pursuit approach. This algorithm outperforms existing other algorithms for most of real data.

For all this algorithms/methods print, summary and plot methods and methods for validation are implemented.

In the field of economics suppression of cells in marginal tables is likely to be the most popular method to protect tables for statistical agencies. The use of linear programming for cell suppression seems to be the best way of protecting tables and hierarchical tables.

Some R-packages for various fields of disclosure control are being developed at the moment. It is easy to learn the applications of disclosure control even with little previous knowledge because of its integrated online-help with examples ready to be executed.

M. Templ

On Secure e-Health Systems

This paper is devoted to e-healthcare security systems based on modern security mechanisms and Public Key Infrastructure (PKI) systems. We signified that only general and multi-layered security infrastructure could cope with possible attacks to e-healthcare systems. We evaluated security mechanisms on application, transport and network layers of ISO/OSI reference model. These mechanisms include confidentiality protection based on symmetrical cryptographic algorithms and digital signature technology based on asymmetrical algorithms for authentication, integrity protection and non-repudiation. User strong authentication procedures based on smart cards, digital certificates and PKI systems are especially emphasized. We gave a brief description of smart cards, HSMs and main components of the PKI systems, emphasizing Certification Authority and its role in establishing cryptographically unique identities of the valid system users based on X.509 digital certificates. Emerging e-healthcare systems and possible appropriate security mechanisms based on proposed Generic CA model are analyzed.

Milan Marković

IPUMS-International High Precision Population Census Microdata Samples: Balancing the Privacy-Quality Tradeoff by Means of Restricted Access Extracts

A breakthrough in the tradeoff between privacy and data quality has been achieved for restricted access to population census microdata samples. The IPUMS-International website, as of June 2006, offers integrated microdata for 47 censuses, totaling more than 140 million person records, with 13 countries represented. Over the next four years, the global collaboratory led by the Minnesota Population Center, with major funding by the United States National Science Foundation and the National Institutes of Health, will disseminate samples for more than 100 additional censuses. The statistical authorities of more than 50 countries have already entrusted microdata to the project under a uniform memorandum of understanding which permits researchers to obtain custom extracts without charge and to analyze the microdata using their own hardware and software. This paper describes the disclosure control methods used by the IPUMS initiative to protect privacy and to provide access to high precision census microdata samples.

Robert McCaa, Steven Ruggles, Michael Davern, Tami Swenson, Krishna Mohan Palipudi

Backmatter

Title: Privacy in Statistical Databases
Editors: Josep Domingo-Ferrer
Luisa Franconi
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-49332-7
Print ISBN: 978-3-540-49330-3
DOI: https://doi.org/10.1007/11930242

Springer Professional