Skip to main content

2024 | OriginalPaper | Buchkapitel

Relational Or Single: A Comparative Analysis of Data Synthesis Approaches for Privacy and Utility on a Use Case from Statistical Office

verfasst von : Manel Slokom, Shruti Agrawal, Nynke C. Krol, Peter-Paul de Wolf

Erschienen in: Privacy in Statistical Databases

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper presents a case study focused on synthesizing relational datasets within Official Statistics for software and technology testing purposes. Specifically, the focus is on generating synthetic data for testing and validating software code. Our study conducts a comprehensive comparative analysis of various synthesis approaches tailored for a multi-table relational database featuring a one-to-one relationship versus a single table. We leverage state-of-the-art single and multi-table synthesis methods to evaluate their potential to maintain the analytical validity of the data, ensure data utility, and mitigate risks associated with disclosure. The evaluation of analytical validity includes assessing how well synthetic data replicates the structure and characteristics of real datasets. First, we compare synthesis methods based on their ability to maintain constraints and conditional dependencies found in real data. Second, we evaluate the utility of synthetic data by training linear regression models on both real and synthetic datasets. Lastly, we measure the privacy risks associated with synthetic data by conducting attribute inference attacks to measure the disclosure risk of sensitive attributes. Our experimental results indicate that the single-table data synthesis method demonstrates superior performance in terms of analytical validity, utility, and privacy preservation compared to the multi-table synthesis method. However, we find promise in the premise of multi-table data synthesis in protecting against attribute disclosure, albeit calling for future exploration to improve the utility of the data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
3.
Zurück zum Zitat Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical data protection. J. Comput. Appl. Math. 164–165, 285–293 (2004). proceedings of the 10th International Congress on Computational and Applied MathematicsMathSciNetCrossRef Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical data protection. J. Comput. Appl. Math. 164–165, 285–293 (2004). proceedings of the 10th International Congress on Computational and Applied MathematicsMathSciNetCrossRef
4.
Zurück zum Zitat Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB establishment panel. Trans. Data Priv. 1(3), 105–130 (2008)MathSciNet Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB establishment panel. Trans. Data Priv. 1(3), 105–130 (2008)MathSciNet
5.
Zurück zum Zitat Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef
6.
Zurück zum Zitat Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011) Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011)
7.
Zurück zum Zitat Nations Economic Commission for Europe, U., et al.: Synthetic data for official statistics: a starter guide (2023) Nations Economic Commission for Europe, U., et al.: Synthetic data for official statistics: a starter guide (2023)
8.
Zurück zum Zitat European Parliament and Council of the European Union: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 april 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). OJ 2016 L 119, pp. 1–88 (2016) European Parliament and Council of the European Union: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 april 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). OJ 2016 L 119, pp. 1–88 (2016)
9.
Zurück zum Zitat Fan, J., Chen, J., Liu, T., Shen, Y., Li, G., Du, X.: Relational data synthesis using generative adversarial networks: a design space exploration. Proc. VLDB Endow. 13(12), 1962–1975 (2020)CrossRef Fan, J., Chen, J., Liu, T., Shen, Y., Li, G., Du, X.: Relational data synthesis using generative adversarial networks: a design space exploration. Proc. VLDB Endow. 13(12), 1962–1975 (2020)CrossRef
10.
Zurück zum Zitat Gao, C., Jajodia, S., Pugliese, A., Subrahmanian, V.: FakeDB: generating fake synthetic databases. IEEE Trans. Dependable Secure Comput. 1–12 (2024) Gao, C., Jajodia, S., Pugliese, A., Subrahmanian, V.: FakeDB: generating fake synthetic databases. IEEE Trans. Dependable Secure Comput. 1–12 (2024)
11.
Zurück zum Zitat Hittmeir, M., Mayer, R., Ekelhart, A.: A baseline for attribute disclosure risk in synthetic data. In: Proceedings of the 10th ACM Conference on Data and Application Security and Privacy, pp. 133–143 (2020) Hittmeir, M., Mayer, R., Ekelhart, A.: A baseline for attribute disclosure risk in synthetic data. In: Proceedings of the 10th ACM Conference on Data and Application Security and Privacy, pp. 133–143 (2020)
12.
Zurück zum Zitat Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)CrossRef Hundepool, A., et al.: Statistical Disclosure Control. Wiley, Hoboken (2012)CrossRef
14.
Zurück zum Zitat Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. 10(3), 395–411 (1985)CrossRef Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. 10(3), 395–411 (1985)CrossRef
16.
Zurück zum Zitat Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)CrossRef Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74(11), 1–26 (2016)CrossRef
17.
Zurück zum Zitat Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on Generative Adversarial Networks. In: Proceedings of the 44th International Conference on Very Large Data Bases (VLDB Endowment), vol. 11, no. 10, pp. 1071–1083 (2018) Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on Generative Adversarial Networks. In: Proceedings of the 44th International Conference on Very Large Data Bases (VLDB Endowment), vol. 11, no. 10, pp. 1071–1083 (2018)
18.
Zurück zum Zitat Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: IEEE International Conference on Data Science and Advanced Analytics, pp. 399–410 (2016) Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: IEEE International Conference on Data Science and Advanced Analytics, pp. 399–410 (2016)
19.
Zurück zum Zitat Rubin, D.B.: Discussion statistical disclosure limitation. J. Official Stat. 9(2), 461–468 (1993) Rubin, D.B.: Discussion statistical disclosure limitation. J. Official Stat. 9(2), 461–468 (1993)
20.
Zurück zum Zitat Salter, C., Saydjari, O.S., Schneier, B., Wallner, J.: Toward a secure system engineering methodology. In: Proceedings of the Workshop on New Security Paradigms, pp. 2–10. NSPW (1998) Salter, C., Saydjari, O.S., Schneier, B., Wallner, J.: Toward a secure system engineering methodology. In: Proceedings of the Workshop on New Security Paradigms, pp. 2–10. NSPW (1998)
21.
Zurück zum Zitat Schlomo, N.: How to measure disclosure risk in microdata? Surv. Stat. 86, 13–21 (2022) Schlomo, N.: How to measure disclosure risk in microdata? Surv. Stat. 86, 13–21 (2022)
22.
Zurück zum Zitat Slokom, M., de Wolf, P.P., Larson, M.: When machine learning models leak: an exploration of synthetic training data. In: Domingo-Ferrer, J., Laurent, M. (eds.) Proceedings of the International Conference on Privacy in Statistical Databases: Corrected and updated version on arXiv at:https://arxiv.org/abs/2310.08775 (2022) Slokom, M., de Wolf, P.P., Larson, M.: When machine learning models leak: an exploration of synthetic training data. In: Domingo-Ferrer, J., Laurent, M. (eds.) Proceedings of the International Conference on Privacy in Statistical Databases: Corrected and updated version on arXiv at:https://​arxiv.​org/​abs/​2310.​08775 (2022)
25.
Zurück zum Zitat Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 7335–7345 (2019) Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 7335–7345 (2019)
Metadaten
Titel
Relational Or Single: A Comparative Analysis of Data Synthesis Approaches for Privacy and Utility on a Use Case from Statistical Office
verfasst von
Manel Slokom
Shruti Agrawal
Nynke C. Krol
Peter-Paul de Wolf
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-69651-0_27