Skip to main content
Top

2024 | OriginalPaper | Chapter

An Evaluation of Synthetic Data Generators Implemented in the Python Library Synthcity

Authors : Emma Fössing, Jörg Drechsler

Published in: Privacy in Statistical Databases

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Generating synthetic data has never been so easy. With the increasing popularity of the approach more and more R packages and Python libraries offer ready-made synthesizers that promise generating synthetic data with almost no effort. These synthetic data generators rely on various modeling strategies, such as generative adversarial networks, Bayesian networks or variational autoencoders. Given the plethora of methods, users new to the approach have an increasingly hard time to decide where to even start when exploring the possibilities of synthetic data.
This paper aims at offering some guidance by empirically evaluating the analytical validity of 12 different synthesizers available in the Python library synthcity. While this comparison study offers only a small glimpse into the world of synthetic data (many more synthetic data generators exist and we also only rely on the default settings when training the various models), we still hope the evaluations offer some useful insights regarding the performance of the different synthesis strategies.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Akrami, H., Joshi, A.A., Li, J., Aydöre, S., Leahy, R.M.: A robust variational autoencoder using beta divergence. Knowl.-Based Syst. 238, 107886 (2022)CrossRef Akrami, H., Joshi, A.A., Li, J., Aydöre, S., Leahy, R.M.: A robust variational autoencoder using beta divergence. Knowl.-Based Syst. 238, 107886 (2022)CrossRef
2.
go back to reference Ankan, A., Panda, A.: pgmpy: probabilistic graphical models using python. In: SciPy, pp. 6–11. Citeseer (2015) Ankan, A., Panda, A.: pgmpy: probabilistic graphical models using python. In: SciPy, pp. 6–11. Citeseer (2015)
4.
go back to reference Breiman, L.: Classification and Regression Trees. Routledge, Milton Park (2017)CrossRef Breiman, L.: Classification and Regression Trees. Routledge, Milton Park (2017)CrossRef
5.
go back to reference Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)
6.
go back to reference Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 21–58 (2021)CrossRef Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 21–58 (2021)CrossRef
8.
go back to reference Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105, 1347–1357 (2010)MathSciNetCrossRef Drechsler, J., Reiter, J.P.: Sampling with synthesis: a new approach for releasing public use census microdata. J. Am. Stat. Assoc. 105, 1347–1357 (2010)MathSciNetCrossRef
9.
go back to reference Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011) Drechsler, J.: Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation, vol. 201. Springer, New York (2011)
12.
go back to reference Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computat. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computat. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef
13.
go back to reference Durkan, C., Bekasov, A., Murray, I., Papamakarios, G.: Neural spline flows. In: Advances in Neural Information Processing Systems, vol. 32 (2019) Durkan, C., Bekasov, A., Murray, I., Papamakarios, G.: Neural spline flows. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
14.
go back to reference Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014) Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
15.
go back to reference Hu, J., Bowen, C.M.: Advancing microdata privacy protection: a review of synthetic data methods. Wiley Interdisc. Rev. Comput. Stat. 16(1), e1636 (2024)MathSciNetCrossRef Hu, J., Bowen, C.M.: Advancing microdata privacy protection: a review of synthetic data methods. Wiley Interdisc. Rev. Comput. Stat. 16(1), e1636 (2024)MathSciNetCrossRef
16.
go back to reference Jordon, J., Yoon, J., Van Der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: International Conference on Learning Representations (2018) Jordon, J., Yoon, J., Van Der Schaar, M.: PATE-GAN: generating synthetic data with differential privacy guarantees. In: International Conference on Learning Representations (2018)
17.
go back to reference Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006)MathSciNetCrossRef Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. Am. Stat. 60(3), 224–232 (2006)MathSciNetCrossRef
20.
go back to reference Little, C., Elliot, M., Allmendinger, R., Samani, S.S.: Generative adversarial networks for synthetic data generation: a comparative study. arXiv:2112.01925 (2021) Little, C., Elliot, M., Allmendinger, R., Samani, S.S.: Generative adversarial networks for synthetic data generation: a comparative study. arXiv:​2112.​01925 (2021)
21.
go back to reference Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016)CrossRef Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016)CrossRef
22.
go back to reference Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22(57), 1–64 (2021)MathSciNet Papamakarios, G., Nalisnick, E., Rezende, D.J., Mohamed, S., Lakshminarayanan, B.: Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22(57), 1–64 (2021)MathSciNet
23.
go back to reference Qian, Z., Cebere, B.C., van der Schaar, M.: Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv preprint arXiv:2301.07573 (2023) Qian, Z., Cebere, B.C., van der Schaar, M.: Synthcity: facilitating innovative use cases of synthetic data in different data modalities. arXiv preprint arXiv:​2301.​07573 (2023)
24.
go back to reference Raab, G.M., Nowok, B., Dibben, C.: Assessing, visualizing and improving the utility of synthetic data. arXiv preprint arXiv:2109.12717 (2021) Raab, G.M., Nowok, B., Dibben, C.: Assessing, visualizing and improving the utility of synthetic data. arXiv preprint arXiv:​2109.​12717 (2021)
25.
go back to reference Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A 168, 185–205 (2005)MathSciNetCrossRef Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: an illustration and empirical study. J. R. Stat. Soc. Ser. A 168, 185–205 (2005)MathSciNetCrossRef
26.
go back to reference Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. J. Official Stat. 21, 441–462 (2005) Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. J. Official Stat. 21, 441–462 (2005)
27.
go back to reference Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018)MathSciNetCrossRef Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018)MathSciNetCrossRef
28.
go back to reference Watson, D.S., Blesch, K., Kapar, J., Wright, M.N.: Adversarial random forests for density estimation and generative modeling. In: International Conference on Artificial Intelligence and Statistics, pp. 5357–5375. PMLR (2023) Watson, D.S., Blesch, K., Kapar, J., Wright, M.N.: Adversarial random forests for density estimation and generative modeling. In: International Conference on Artificial Intelligence and Statistics, pp. 5357–5375. PMLR (2023)
29.
go back to reference Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1) (2009) Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1) (2009)
30.
go back to reference Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739 (2018) Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. arXiv preprint arXiv:​1802.​06739 (2018)
31.
go back to reference Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, vol. 32 (2019) Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
32.
go back to reference Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE J. Biomed. Health Inform. 24(8), 2378–2388 (2020)CrossRef Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE J. Biomed. Health Inform. 24(8), 2378–2388 (2020)CrossRef
33.
go back to reference Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)MathSciNetCrossRef Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)MathSciNetCrossRef
Metadata
Title
An Evaluation of Synthetic Data Generators Implemented in the Python Library Synthcity
Authors
Emma Fössing
Jörg Drechsler
Copyright Year
2024
DOI
https://doi.org/10.1007/978-3-031-69651-0_12

Premium Partner