Skip to main content
Top

2024 | Book

Privacy in Statistical Databases

International Conference, PSD 2024, Antibes Juan-les-Pins, France, September 25–27, 2024, Proceedings

insite
SEARCH

About this book

​This book constitutes the refereed proceedings of the International Conference on Privacy in Statistical Databases, PSD 2024, held in Antibes Juan-les-Pins, France, during September 25-27, 2024.

The 28 papers presented in this volume were carefully reviewed and selected from 46 submissions. They were organized in topical sections as follows: Privacy models and concepts; Microdata protection; Statistical table protection; Synthetic data generation methods; Synthetic data generation software; Disclosure risk assessment; Spatial and georeferenced data; Machine learning and privacy; and Case studies.

Table of Contents

Frontmatter

Privacy Models and Concepts

Frontmatter
From Isolation to Identification
Abstract
We present a mathematical framework for understanding when successfully distinguishing a person from all other persons in a data set—a phenomenon which we call isolation—may enable identification, a notion which is central to deciding whether a release based on the data set is subject to data protection regulation. We show that a baseline degree of isolation is unavoidable in the sense that isolation can typically happen with high probability even before a release was made about the data set and hence identification is not enabled. We then describe settings where isolation resulting from a data release may enable identification.
Giuseppe D’Acquisto, Aloni Cohen, Maurizio Naldi, Kobbi Nissim
Differentially Private Quantile Regression
Abstract
Quantile regression (QR) is a powerful and robust statistical modeling method broadly used in many fields such as economics, ecology, and healthcare. However, it has not been well-explored in differential privacy (DP) since its loss function lacks strong convexity and twice differentiability, often required by many DP mechanisms. We implement the smoothed QR loss via convolution within the K-Norm Gradient mechanism (KNG) and prove the resulting estimate converges to the non-private one asymptotically. Additionally, our work is the first to extensively investigate the empirical performance of DP smoothing QR under pure-, approximate- and concentrated-DP and four mechanisms, and cases commonly encountered in practice such as heavy-tailed and heteroscedastic data. We find that the Objective Perturbation Mechanism and KNG are the top performers across the simulated settings.
Tran Tran, Matthew Reimherr, Aleksandra Slavkovic
Utility Analysis of Differentially Private Anonymized Data Based on Random Sampling
Abstract
It is possible to produce differentially private k-anonymized data based on the method of random sampling followed by full-domain generalization for k-anonymization. We previously evaluate the performance of that method, which is implemented as the SafePub algorithm in the ARX anonymization tool. However, since the SafePub algorithm uses the maximum sampling rate that satisfies the requirements for differential privacy, we observe paradoxical results where data utility diminishes as the privacy budget for differential privacy increases.
In this paper, we, therefore, conduct preliminary experiments to explore the parameter space for privacy budget and sampling rate by setting the sampling rates explicitly through modifications to the implementation of ARX. Our initial results show the possibility of improving the utility of anonymized data by properly setting the sampling rate below its maximum value.
Takumi Sugiyama, Hiroto Oosugi, Io Yamanaka, Kazuhiro Minami

Microdata Protection

Frontmatter
Asymptotic Utility of Spectral Anonymization
Abstract
In the contemporary data landscape characterized by multi-source data collection and third-party sharing, ensuring individual privacy stands as a critical concern. While various anonymization methods exist, their utility preservation and privacy guarantees remain challenging to quantify. In this work, we address this gap by studying the utility and privacy of the spectral anonymization (SA) algorithm, particularly in an asymptotic framework. Unlike conventional anonymization methods that directly modify the original data, SA operates by perturbing the data in a spectral basis and subsequently reverting them to their original basis. Alongside the original version \(\mathcal {P}\)-SA, employing random permutation transformation, we introduce two novel SA variants: \(\mathcal {J}\)-spectral anonymization and \(\mathcal {O}\)-spectral anonymization, which employ sign-change and orthogonal matrix transformations, respectively. We show how well, under some practical assumptions, these SA algorithms preserve the first and second moments of the original data. Our results reveal, in particular, that the asymptotic efficiency of all three SA algorithms in covariance estimation is exactly 50% when compared to the original data. To assess the applicability of these asymptotic results in practice, we conduct a simulation study with finite data and also evaluate the privacy protection offered by these algorithms using distance-based record linkage. Our research reveals that while no method exhibits clear superiority in finite-sample utility, \(\mathcal {O}\)-SA distinguishes itself for its exceptional privacy preservation, never producing identical records, albeit with increased computational complexity. Conversely, \(\mathcal {P}\)-SA emerges as a computationally efficient alternative, demonstrating unmatched efficiency in mean estimation.
Katariina Perkonoja, Joni Virta
Robin Hood: A De-identification Method to Preserve Minority Representation for Disparities Research
Abstract
Data stewards often turn to de-identification to make data available for research while complying with privacy law. A primary challenge to de-identification is balancing the privacy-utility tradeoff, but optimizing the tradeoff with respect to a complete dataset has been shown to create both privacy risk and data utility disparities between subgroups of individuals represented in the dataset. Notably, the minority populations incur the greatest utility loss and privacy risks. Recent studies have shown that utility inequalities can mask disparities and bias algorithms trained on such data. Yet achieving equal privacy and utility is inherently constrained by the fact that each subgroup has a different privacy-utility tradeoff, differences that are exacerbated by the deterministic transformations that standard de-identification models typically employ. To address this problem, we introduce Robin Hood, a de-identification method that leverages non-deterministic transformations to more equally distribute risk and utility in a de-identified dataset. It does so by transforming majority groups’ records in a way that gives minorities privacy. We show how Robin Hood can provide equal privacy protections to all records in a dataset at expectation while supporting more accurate and consistent disparity estimation than standard k-anonymity methods in simulated and real-world Census data.
James Thomas Brown, Ellen W. Clayton, Michael Matheny, Murat Kantarcioglu, Yevgeniy Vorobeychik, Bradley A. Malin

Statistical Table Protection

Frontmatter
Secondary Cell Suppression by Gaussian Elimination: An Algorithm Suitable for Handling Issues with Zeros and Singletons
Abstract
To protect tabular data through cell suppression, efficient algorithms are essential. Gaussian elimination can be used for secondary cell suppression to prevent exact disclosure. A beneficial feature of this method is that all tables created from the same microdata can be handled simultaneously. This paper presents a solution to the issue where suppressed zeros in frequency tables cannot protect each other. In magnitude tables, it outlines how the algorithm can be tailored to provide protection against singleton contributors using their own data for disclosure.
Øyvind Langsrud
Obtaining -Differential Privacy Guarantees When Using a Poisson Mechanism to Synthesize Contingency Tables
Abstract
We show that differential privacy type guarantees can be obtained when using a Poisson synthesis mechanism to protect counts in contingency tables. Specifically, we show how to obtain \((\epsilon , \delta )\)-probabilistic differential privacy guarantees via the Poisson distribution’s cumulative distribution function. We demonstrate this empirically with the synthesis of an administrative-type confidential database.
James Jackson, Robin Mitra, Brian Francis, Iain Dove

Synthetic Data Generation Methods

Frontmatter
Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator
Abstract
In recent years, more and more synthetic data generators (SDGs) based on various modeling strategies have been implemented as Python libraries or R packages. With this proliferation of ready-made SDGs comes a widely held perception that generating synthetic data is easy. We show that generating synthetic data is a complicated process that requires one to understand both the original dataset as well as the synthetic data generator. We make two contributions to the literature in this topic area. First, we show that it is just as important to pre-process or clean the data as it is to tune the SDG in order to create synthetic data with high levels of utility. Second, we illustrate that it is critical to understand the methodological details of the SDG to be aware of potential pitfalls and to understand for which types of analysis tasks one can expect high levels of analytical validity.
Jonathan Latner, Marcel Neunhoeffer, Jörg Drechsler
Evaluating the Pseudo Likelihood Approach for Synthesizing Surveys Under Informative Sampling
Abstract
In recent years, national statistical organizations have increasingly relied on synthetic data when releasing microdata containing sensitive personal or establishment information. This paper deals with the challenges of using synthetic data to protect the privacy of survey respondents. For this type of data it is often important to consider the survey design information when creating the synthesis models.
The paper discusses two techniques that can be used for generating survey microdata under informative sampling. Specifically, it examines an approach that combines design-based and model-based methods through the use of the pseudo-likelihood approach within the sequential regression framework. As far as we are aware, the pseudo-likelihood method has not been used in the context of sequential regression synthesis before.
This method is compared with another approach in which design variables are included as predictors in the regression models. In the latter approach, the survey weights have to be synthesized and included in the final data product, while the former generates synthetic simple random samples that are representative of the original population without weights.
Anna Oganian, Jörg Drechsler, Mehtab Iqbal
The Production of Bespoke Synthetic Teaching Datasets Without Access to the Original Data
Abstract
Teaching datasets are a pivotal component of the data discovery pipeline. These datasets often serve as the initial point of interaction for data users, allowing them to explore the contents of a dataset and assess its relevance to their needs. However, there are instances where their viability is limited, particularly where source data is only accessible within restricted settings, such as trusted research environments (TREs). In response to this challenge, this paper proposes the production of synthetic datasets tailored for specific teaching purposes by utilising already cleared (and published) analyses as the basis for the synthesis. Unlike generic synthetic datasets, the datasets created are designed to solely reproduce the specific analyses. Crucially, the datasets can be generated without access to the original data. Two experiments with census data demonstrate the viability of the method and a live use case is described. Issues arising such as marginal disclosure risk are then discussed.
Mark Elliot, Claire Little, Richard Allmendinger

Synthetic Data Generation Software

Frontmatter
A Comparison of SynDiffix Multi-table Versus Single-table Synthetic Data
Abstract
SynDiffix is a new open-source tool for structured data synthesis. It has anonymization features that allow it to generate multiple synthetic tables while maintaining strong anonymity. Compared to the more common single-table approach, multi-table leads to more accurate data, since only the features of interest for a given analysis need be synthesized. This paper compares SynDiffix with 15 other commercial and academic synthetic data techniques using the SDNIST analysis framework, modified by us to accommodate multi-table synthetic data. The results show that SynDiffix is many times more accurate than other approaches for low-dimension tables, but somewhat worse than the best single-table techniques for high-dimension tables.
Paul Francis
An Evaluation of Synthetic Data Generators Implemented in the Python Library Synthcity
Abstract
Generating synthetic data has never been so easy. With the increasing popularity of the approach more and more R packages and Python libraries offer ready-made synthesizers that promise generating synthetic data with almost no effort. These synthetic data generators rely on various modeling strategies, such as generative adversarial networks, Bayesian networks or variational autoencoders. Given the plethora of methods, users new to the approach have an increasingly hard time to decide where to even start when exploring the possibilities of synthetic data.
This paper aims at offering some guidance by empirically evaluating the analytical validity of 12 different synthesizers available in the Python library synthcity. While this comparison study offers only a small glimpse into the world of synthetic data (many more synthetic data generators exist and we also only rely on the default settings when training the various models), we still hope the evaluations offer some useful insights regarding the performance of the different synthesis strategies.
Emma Fössing, Jörg Drechsler
Evaluation of Synthetic Data Generators on Complex Tabular Data
Abstract
Synthetic data generators are widely utilized to produce synthetic data, serving as a complement or replacement for real data. However, the utility of data is often limited by its complexity. The aim of this paper is to show their performance using a complex data set that includes cluster structures and complex relationships. We compare different synthesizers such as synthpop, Synthetic Data Vault, simPop, Mostly AI, Gretel, Realtabformer, and arf, taking into account their different methodologies with (mostly) default settings, on two properties: syntactical accuracy and statistical accuracy. As a complex and popular data set, we used the European Statistics on Income and Living Conditions data set. Almost all synthesizers resulted in low data utility and low syntactical accuracy.
The results indicated that for such complex data, simPop, a computational and methodological framework for simulating complex data based on conditional modeling, emerged as the most effective approach for static tabular data and is superior compared to other conditional or joint modelling approaches.
Oscar Thees, Jiří Novák, Matthias Templ

Disclosure Risk Assessment

Frontmatter
An Examination of the Alleged Privacy Threats of Confidence-Ranked Reconstruction of Census Microdata
Abstract
The threat of reconstruction attacks has led the U.S. Census Bureau (USCB) to replace in the Decennial Census 2020 the traditional statistical disclosure limitation based on rank swapping with one based on differential privacy (DP), leading to substantial accuracy loss of released statistics. Yet, it has been argued that, if many different reconstructions are compatible with the released statistics, most of them do not correspond to actual original data, which protects against respondent reidentification. Recently, a new attack has been proposed, which incorporates the confidence that a reconstructed record was in the original data. The alleged risk of disclosure entailed by such confidence-ranked reconstruction has renewed the interest of the USCB to use DP-based solutions. To forestall a potential accuracy loss in future releases, we show that the proposed reconstruction is neither effective as a reconstruction method nor conducive to disclosure as claimed by its authors. Specifically, we report empirical results showing the proposed ranking cannot guide reidentification or attribute disclosure attacks, and hence fails to warrant the utility sacrifice entailed by the use of DP to release census statistical data.
David Sánchez, Najeeb Jebreel, Krishnamurty Muralidhar, Josep Domingo-Ferrer, Alberto Blanco-Justicia
Synthetic Data: Comparing Utility and Risk in Microdata and Tables
Abstract
Synthetic data has begun to show potential as an alternative to traditional SDC methods in specific use cases. This development and the increasing research efforts further hint at an emerging role in future privacy protection. However, since data synthesis predominantly happens at microdata level, development of utility and risk metrics is also focused on this domain. Statistical agencies on the other hand limit data publication mostly to aggregates, by selecting various subsets of variables for cross tabulation. We analyze the correlations between microdata and tabular data metrics for assessing utility and risk. Using a large real life data set as an example for data synthesis, we show that certain global metrics may disproportionately represent small subsets of variables, making them an inappropriate estimator for the quality of aggregates. On the other hand, we show strong similarities between certain microdata level risk metrics and risks of group disclosure in aggregated data.
Simon Xi Ning Kolb, Jui Andreas Tang, Sarah Giessing
Synthetic Data Outliers: Navigating Identity Disclosure
Abstract
Multiple synthetic data generation models have emerged, among which deep learning models have become the vanguard due to their ability to capture the underlying characteristics of the original data. However, the resemblance of the synthetic to the original data raises important questions on the protection of individuals’ privacy. As synthetic data is perceived as a means to fully protect personal information, most current related work disregards the impact of re-identification risk. In particular, limited attention has been given to exploring outliers, despite their privacy relevance. In this work, we analyze the privacy of synthetic data w.r.t the outliers. Our main findings suggest that outliers re-identification via linkage attack is feasible and easily achieved. Furthermore, additional safeguards such as differential privacy can prevent re-identification, albeit at the expense of the data utility.
Carolina Trindade, Luís Antunes, Tânia Carvalho, Nuno Moniz
Privacy Risk from Synthetic Data: Practical Proposals
Abstract
This paper proposes and compares measures of identity and attribute disclosure risk for synthetic data. Data custodians can use the methods proposed here to inform the decision as to whether to release synthetic versions of confidential data. Different measures are evaluated on two data sets. Insight into the measures is obtained by examining the details of the records identified as posing a disclosure risk. This leads to methods to identify, and possibly exclude, apparently risky records where the identification or attribution would be expected by someone with background knowledge of the data. The methods described are available as part of the synthpop package for R.
Gillian M. Raab
Attribute Disclosure Risk in Smart Meter Data
Abstract
This paper studies attribute disclosure risk in aggregated smart meter data. Smart meter data is commonly aggregated to preserve the privacy of individual contributions. The published data shows aggregated consumption, preventing the revelation of individual consumption patterns. There is, however, a potential risk associated to aggregated data. We analyze some datasets of smart meter data consumption to show the potential risk of attribute disclosure. We observe that, even if data is aggregated with the most favorable aggregation approach, it presents this attribute disclosure risk.
Guillermo Navarro-Arribas, Vicenç Torra
The statbarn: A New Model for Output Statistical Disclosure Control
Abstract
A major success for research this century has been the growth of secure facilities allowing research access to detailed sensitive personal data. This has also raised awareness of the problem of output disclosure risk, where statistics may inadvertently breach the confidentiality of data subjects, a risk that grows with the detail in the data.
Managing this risk is a concern for these secure facilities. While there is a well-established literature on the protection of frequency tables and linear aggregates, researchers in secure facilities produce a wide range of statistical outputs. The theory covering non-tabular outputs is small, fractured, and has grown ad hoc. This is also reflected in the guidance available to data service staff, which typically consists of a long list of outputs and some rules to be applied to them.
This paper describes a significant new concept in output statistical disclosure control: the statistical barn or ‘statbarn’. This is a framework to classify all statistical terms by their disclosure characteristics, including risk, exceptions and mitigation measures. This statbarn massively reduces the dimensionality of the disclosure checking problem, as well as providing improved clarity. It also creates a feasible basis for automatic disclosure control checking.
Elizabeth Green, Felix Ritche, Paul White

Spatial and Georeferenced Data

Frontmatter
Masking Georeferenced Health Data - An Analysis Taking the Example of Partially Synthetic Data on Sleep Disorder
Abstract
Spatial health data is becoming increasingly important in health research. However, the desired information can often not be extracted despite the inherent analytical content. The reason for this is that access to personal georeferenced data sets is severely restricted as they are subject to legal data protection. The method of donut masking attempts to alienate original data by shifting it in such a way that data protection is guaranteed without strongly reducing the analytical validity of the data. In this article donut masking is applied to partially synthetic data on sleep disorders. The degree of anonymity of the masked data set is measured by spatial k-anonymity reviewing additional knowledge of a potential data attacker. In addition to assessing the spatial similarity of the original and masked data set, an attempt is also made to assess the suitability of such data for analysis purposes.
Simon Cremer, Lydia Jehmlich, Rainer Lenz
Privacy and Disclosure Risks in Spatial Dynamic Microsimulations
Abstract
Microsimulations are used as a tool for evidence-based policy, to better understand the impact of policies on society. However, the quality of the outcome considerably depends on the quality of the data in use. In order to provide a rich set of variables on granular details, synthetic data may be involved. This becomes even more important in dynamic microsimulation where projection into the future is simulated or when providing an open data environment for the research community with a vast amount of variables including geocoded information.
The present paper discusses opportunities and challenges of such synthetic but realistic data generation in a microsimulation data lab, which includes two steps of disclosure control.
First, the base information which is used for the synthetic data generation has to be reviewed in terms of disclosure risks. Second, the data generating process must be ensured to not systematically reproduce rare events for (synthetic) individuals or replicating original input data. Otherwise, due to the large amount of additional information, this may lead to a cumulative effect of the individual re-identification risks. In order to make these output data of dynamic spatial microsimulations available in the sense of open and reproducible research, such statistical disclosure risks must be excluded a priori.
This study examines whether disclosure risks may occur when synthetic data is generated via anonymized data from official statistics sources and how these can be avoided in principle. Furthermore, we discuss methods within the framework of spatial dynamic microsimulation frameworks that automatically ensure the standards of statistical disclosure control as well as official statistics data providers during simulation runs.
Hanna Brenzel, Martin Palm, Jan Weymeirsch, Ralf Münnich

Machine Learning and Privacy

Frontmatter
Combinations of AI Models and XAI Metrics Vulnerable to Record Reconstruction Risk
Abstract
Explainable AI (XAI) metrics have gained attention because of a need to ensure fairness and transparency in machine learning models by providing users with some understanding of the models’ internal processes. Many services, including Amazon Web Services, the Google Cloud Platform, and Microsoft Azure run machine-learning-as-a-service platforms, which provide several indices, including Shapley values, that explain the relationship between the output of the black-box model and its private input features. However, in 2022, it was demonstrated that a Shapley-value-based explanation could lead to the reconstruction of private attributes, posing a privacy risk of information leakage from the model. It was shown that the leaked value would depend on the AI black-box model used. However, it was not clear which combinations of black-box model and XAI metric would be vulnerable to a reconstruction attack. The present study shows, both theoretically and experimentally, that Shapley values are indeed vulnerable to a reconstruction attack. We prove that Shapley values for a linear model can lead to a perfect reconstruction of records, that is, they can enable an accurate estimation of private values. In addition, we investigate the impact of various optimization algorithms used in attack models on the reconstruction risk.
Ryotaro Toma, Hiroaki Kikuchi
DISCOLEAF: Personalized DIScretization of COntinuous Attributes for LEArning with Federated Decision Trees
Abstract
Federated learning is a distributed machine learning framework, in which each client participating to the federation trains a machine learning model on its data, and shares the trained model information with a central server, which aggregates, and sends the aggregated information back to the distributed clients. The machine learning model we choose to work with is decision trees, due to their simplicity, and interpretability. On that note, we propose a full-fledged federated pipeline, which includes discretization and learning with decision trees for the horizontally partitioned data. Our federated discretization approach can be plugged-in as a prepossessing step before any other federated learning algorithm. During discretization, we ensure that each client creates the number of discrete bins, according to their own data/choice. Hence, our approach is both federated and personalized. After discretization, we propose to apply the post randomization method to protect the discretized data with differential privacy guarantees. After protecting its database, each client trains a decision tree classifier on its protected database locally, and shares the nodes, containing the split attribute, and the split value with the central server. The central server obtains the most occurred split attribute, and combines the split values. This process goes on until all the nodes to be merged are leaf nodes. The central server then shares the merged tree with the distributed clients. Hence, our proposed framework performs personalized, privacy-preserving federated learning with decision trees by discretizing continuous attributes, and then masking them prior to the training stage. We call our proposed framework discoleaf.
Saloni Kwatra, Vicenç Torra
Node Injection Link Stealing Attack
Abstract
We present a stealthy privacy attack that exposes links in Graph Neural Networks (GNNs). Focusing on dynamic GNNs, we propose to inject new nodes and attach them to a particular target node to infer its private edge information. Our approach significantly enhances the \(F_1\) score of the attack compared to the current state-of-the-art benchmarks. Specifically, for the Twitch dataset, our method improves the \(F_1\) score by 23.75%, and for the Flickr dataset, remarkably, it is more than three times better than the state-of-the-art. We also propose and evaluate defense strategies based on differentially private (DP) mechanisms relying on a newly defined DP notion. These solutions, on average, reduce the effectiveness of the attack by 71.9% while only incurring a minimal utility loss of about 3.2%.
Oualid Zari, Javier Parra-Arnau, Ayşe Ünsal, Melek Önen
Assessing the Potentials of LLMs and GANs as State-of-the-Art Tabular Synthetic Data Generation Methods
Abstract
The abundance of tabular microdata constitutes a valuable resource for research, policymaking, and innovation. However, due to stringent privacy regulations, a significant portion of this data remains inaccessible. To address this, synthetic data generation methods have emerged as a promising solution. Here, we assess the potentials of two state-of-the-art GAN and LLM tabular synthetic data generators using different utility & risk measures and propose a robust risk estimation for individual records based on shared nearest neighbors. LLMs outperform CTGAN by generating synthetic data that more closely matches real data distributions, as evidenced by lower Wasserstein distances. LLMs also generally provide better predictive performance compared to CTGAN, with higher \({F}_{1}\) and \({R}^{2}\) scores. Interestingly, this does not necessarily mean that LLMs better capture correlations. Our proposed risk measure, Shared Neighbor Identifiability (SNI), proves effective in accurately assessing identification risk, offering a robust tool for navigating the risk-utility trade-off. Furthermore, we identify the challenges posed by mixed feature types in distance calculation. Ultimately, the choice between LLMs and GANs depends on factors such as data complexity, computational resources, and the desired level of model interpretability, emphasizing the importance of informed decision-making in selecting the appropriate generative model for specific applications.
Marko Miletic, Murat Sariyar

Case Studies

Frontmatter
Escalation of Commitment: A Case Study of the United States Census Bureau Efforts to Implement Differential Privacy for the 2020 Decennial Census
Abstract
In 2017, the United States Census Bureau announced that because of high disclosure risk in the methodology (data swapping) used to produce tabular data for the 2010 census, a different protection mechanism based on differential privacy would be used for the 2020 census. While there have been many studies evaluating the result of this change, there has been no rigorous examination of disclosure risk claims resulting from the released 2010 tabular data. In this study we perform such an evaluation. We show that the procedures used to evaluate disclosure risk are unreliable and resulted in inflated disclosure risk. Demonstration data products released using the new procedure were also shown to have poor utility. However, since the Census Bureau had already committed to a different procedure, they had no option except to escalate their commitment. The result of such escalation is that the 2020 tabular data release offers neither privacy nor accuracy.
Krishnamurty Muralidhar, Steven Ruggles
Relational Or Single: A Comparative Analysis of Data Synthesis Approaches for Privacy and Utility on a Use Case from Statistical Office
Abstract
This paper presents a case study focused on synthesizing relational datasets within Official Statistics for software and technology testing purposes. Specifically, the focus is on generating synthetic data for testing and validating software code. Our study conducts a comprehensive comparative analysis of various synthesis approaches tailored for a multi-table relational database featuring a one-to-one relationship versus a single table. We leverage state-of-the-art single and multi-table synthesis methods to evaluate their potential to maintain the analytical validity of the data, ensure data utility, and mitigate risks associated with disclosure. The evaluation of analytical validity includes assessing how well synthetic data replicates the structure and characteristics of real datasets. First, we compare synthesis methods based on their ability to maintain constraints and conditional dependencies found in real data. Second, we evaluate the utility of synthetic data by training linear regression models on both real and synthetic datasets. Lastly, we measure the privacy risks associated with synthetic data by conducting attribute inference attacks to measure the disclosure risk of sensitive attributes. Our experimental results indicate that the single-table data synthesis method demonstrates superior performance in terms of analytical validity, utility, and privacy preservation compared to the multi-table synthesis method. However, we find promise in the premise of multi-table data synthesis in protecting against attribute disclosure, albeit calling for future exploration to improve the utility of the data.
Manel Slokom, Shruti Agrawal, Nynke C. Krol, Peter-Paul de Wolf
A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics
Abstract
In this paper, we investigate different approaches for generating synthetic microdata from open-source aggregated data. Specifically, we focus on macro-to-micro data synthesis. We explore the potential of the Gaussian copulas framework to estimate joint distributions from aggregated data. Our generated synthetic data is intended for educational and software testing use cases. We propose three scenarios to achieve realistic and high-quality synthetic microdata: (1) zero knowledge, (2) internal knowledge, and (3) external knowledge. The three scenarios involve different knowledge of the underlying properties of the real microdata, i.e., standard deviation, and covariate. Our evaluation includes matching tests to evaluate the privacy of the synthetic datasets. Our results indicate that macro-to-micro synthesis achieves better privacy preservation compared to other methods, demonstrating both the potential and challenges of synthetic data generation in maintaining data privacy while providing useful data for analysis.
Mohamed Aghaddar, Liu Nuo Su, Manel Slokom, Lucas Barnhoorn, Peter-Paul de Wolf
Backmatter
Metadata
Title
Privacy in Statistical Databases
Editors
Josep Domingo-Ferrer
Melek Önen
Copyright Year
2024
Electronic ISBN
978-3-031-69651-0
Print ISBN
978-3-031-69650-3
DOI
https://doi.org/10.1007/978-3-031-69651-0

Premium Partner