Skip to main content

2024 | OriginalPaper | Buchkapitel

Evaluation of Synthetic Data Generators on Complex Tabular Data

verfasst von : Oscar Thees, Jiří Novák, Matthias Templ

Erschienen in: Privacy in Statistical Databases

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Synthetic data generators are widely utilized to produce synthetic data, serving as a complement or replacement for real data. However, the utility of data is often limited by its complexity. The aim of this paper is to show their performance using a complex data set that includes cluster structures and complex relationships. We compare different synthesizers such as synthpop, Synthetic Data Vault, simPop, Mostly AI, Gretel, Realtabformer, and arf, taking into account their different methodologies with (mostly) default settings, on two properties: syntactical accuracy and statistical accuracy. As a complex and popular data set, we used the European Statistics on Income and Living Conditions data set. Almost all synthesizers resulted in low data utility and low syntactical accuracy.
The results indicated that for such complex data, simPop, a computational and methodological framework for simulating complex data based on conditional modeling, emerged as the most effective approach for static tabular data and is superior compared to other conditional or joint modelling approaches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
In addition to the logistic regression, other methods, such as decision trees or random forest, can be employed to calculate propensity scores.
 
Literatur
3.
Zurück zum Zitat Drechsler, J., Reiter, J.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Off. Stat. 5(4), 589–603 (2009) Drechsler, J., Reiter, J.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Off. Stat. 5(4), 589–603 (2009)
6.
Zurück zum Zitat Endres, M., Mannarapotta Venugopal, A., Tran, T.S.: Synthetic data generation: a comparative study. In: Proceedings of the 26th International Database Engineered Applications Symposium, pp. 94–102. IDEAS ’22, Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3548785.3548793 Endres, M., Mannarapotta Venugopal, A., Tran, T.S.: Synthetic data generation: a comparative study. In: Proceedings of the 26th International Database Engineered Applications Symposium, pp. 94–102. IDEAS ’22, Association for Computing Machinery, New York, NY, USA (2022). https://​doi.​org/​10.​1145/​3548785.​3548793
15.
Zurück zum Zitat Davila, M.F.R., Wolfram Wingerath, F.P.: Benchmarking tabular data synthesis for user guidance. In: Proceedings of the Workshops of the EDBT/ICDT 2024 Joint Conference Co-located with the EDBT/ICDT 2024 Joint Conference, pp. 1–4, March 2024 Davila, M.F.R., Wolfram Wingerath, F.P.: Benchmarking tabular data synthesis for user guidance. In: Proceedings of the Workshops of the EDBT/ICDT 2024 Joint Conference Co-located with the EDBT/ICDT 2024 Joint Conference, pp. 1–4, March 2024
17.
Zurück zum Zitat Münnich, R., Schürle, J.: On the simulation of complex universes in the case of applying the German Microcensus. DACSEIS research paper series No. 4, University of Tübingen (2003) Münnich, R., Schürle, J.: On the simulation of complex universes in the case of applying the German Microcensus. DACSEIS research paper series No. 4, University of Tübingen (2003)
22.
Zurück zum Zitat Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. SSDBM ’17, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3085504.3091117 Ping, H., Stoyanovich, J., Howe, B.: DataSynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. SSDBM ’17, Association for Computing Machinery, New York, NY, USA (2017). https://​doi.​org/​10.​1145/​3085504.​3091117
23.
Zurück zum Zitat Qian, Z., Davis, R., van der Schaar, M.: Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. arXiv preprint arXiv:2301.07573 (2023) Qian, Z., Davis, R., van der Schaar, M.: Synthcity: a benchmark framework for diverse use cases of tabular synthetic data. arXiv preprint arXiv:​2301.​07573 (2023)
25.
Zurück zum Zitat Rubin, D.B.: Discussion of statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993) Rubin, D.B.: Discussion of statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
31.
Zurück zum Zitat Watson, D.S., Blesch, K., Kapar, J., Wright, M.N.: Adversarial random forests for density estimation and generative modeling. In: Ruiz, F., Dy, J., van de Meent, J.W. (eds.) Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 206, pp. 5357–5375. PMLR, 25–27 April 2023 Watson, D.S., Blesch, K., Kapar, J., Wright, M.N.: Adversarial random forests for density estimation and generative modeling. In: Ruiz, F., Dy, J., van de Meent, J.W. (eds.) Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 206, pp. 5357–5375. PMLR, 25–27 April 2023
32.
Zurück zum Zitat Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009) Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confid. 1(1), 111–124 (2009)
Metadaten
Titel
Evaluation of Synthetic Data Generators on Complex Tabular Data
verfasst von
Oscar Thees
Jiří Novák
Matthias Templ
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-69651-0_13