Skip to main content

2024 | OriginalPaper | Buchkapitel

Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator

verfasst von : Jonathan Latner, Marcel Neunhoeffer, Jörg Drechsler

Erschienen in: Privacy in Statistical Databases

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In recent years, more and more synthetic data generators (SDGs) based on various modeling strategies have been implemented as Python libraries or R packages. With this proliferation of ready-made SDGs comes a widely held perception that generating synthetic data is easy. We show that generating synthetic data is a complicated process that requires one to understand both the original dataset as well as the synthetic data generator. We make two contributions to the literature in this topic area. First, we show that it is just as important to pre-process or clean the data as it is to tune the SDG in order to create synthetic data with high levels of utility. Second, we illustrate that it is critical to understand the methodological details of the SDG to be aware of potential pitfalls and to understand for which types of analysis tasks one can expect high levels of analytical validity.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Throughout the paper, when we refer to data, we are referring to classical microdata (i.e., one observation per individual unit), as opposed to summary tables or images.
 
2
Note that there are different philosophies about the definition of original data and how much pre-processing (e.g., dealing with missing values or outliers) one should do to the original data before data synthesis depending on the synthesis goals (replacement of the original data vs. tool for preparing to work with the original data in a safe environment). In Sect. 3 we describe the data and any pre-processing steps in detail.
 
5
Early versions CTGAN could not be used on data with missing values (https://​github.​com/​sdv-dev/​CTGAN/​issues/​39).
 
6
Default means that CART models are used for synthesis with complexity parameter = 0.001 (smaller values will grow larger trees), and minbucket = 5 (the minimum number of observations in any terminal node).
 
7
The record with a BMI of 450 has height (cm) = 149 and weight (kg) = NA. If we calculate weight from bmi and height, then weight equals 999 or one metric ton.
 
8
We note that the DataSynthesizer paper states [12], “when invoked in correlated attribute mode, DataDescriber samples attribute values in appropriate order from the Bayesian network.” However, in the code, it seems that data are created by uniform sampling within a bin (https://​github.​com/​DataResponsibly/​DataSynthesizer/​blob/​90722857e7f6ed73​6aaa25068ecf9e77​f34f896a/​DataSynthesizer/​datatypes/​AbstractAttribut​e.​py#L125). This illustrates the challenge in understanding the methodological details of a given SDG.
 
9
In terms of computing power, SDGs were run on a 2022 Macbook Air with 16GB of RAM and an M2 Chip with 8-Core CPU, 8-Core GPU, and a 16-Core Neural Engine. All SDGs were run one at a time in order to minimize computational power problems from parallelization.
 
Literatur
1.
Zurück zum Zitat Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 21–58 (2021)CrossRef Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11(5), 21–58 (2021)CrossRef
3.
Zurück zum Zitat Drechsler, J., Reiter, J.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Official Stat. 25(4), 589–603 (2009) Drechsler, J., Reiter, J.: Disclosure risk and data utility for partially synthetic data: an empirical study using the German IAB establishment survey. J. Official Stat. 25(4), 589–603 (2009)
4.
Zurück zum Zitat Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014) Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
5.
Zurück zum Zitat Jordon, J., et al.: Synthetic data – what, why and how? (2022) Jordon, J., et al.: Synthetic data – what, why and how? (2022)
6.
Zurück zum Zitat Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. (TODS) 10(3), 395–411 (1985)CrossRef Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. (TODS) 10(3), 395–411 (1985)CrossRef
8.
Zurück zum Zitat Little, R.J., et al.: Statistical analysis of masked data. J. Official Stat. 9, 407–407 (1993) Little, R.J., et al.: Statistical analysis of masked data. J. Official Stat. 9, 407–407 (1993)
9.
Zurück zum Zitat Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016)CrossRef Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016)CrossRef
10.
Zurück zum Zitat Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384 (2018) Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. arXiv preprint arXiv:​1806.​03384 (2018)
12.
Zurück zum Zitat Ping, H., Stoyanovich, J., Howe, B.: Datasynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5 (2017) Ping, H., Stoyanovich, J., Howe, B.: Datasynthesizer: privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, pp. 1–5 (2017)
13.
14.
Zurück zum Zitat Rubin, D.B.: Statistical disclosure limitation. J. Official Stat. 9(2), 461–468 (1993) Rubin, D.B.: Statistical disclosure limitation. J. Official Stat. 9(2), 461–468 (1993)
15.
Zurück zum Zitat Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018)MathSciNetCrossRef Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018)MathSciNetCrossRef
16.
Zurück zum Zitat Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1) (2009) Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1) (2009)
17.
Zurück zum Zitat Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems (2019) Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems (2019)
18.
Zurück zum Zitat Young, J., Graham, P., Penny, R.: Using Bayesian networks to create synthetic data. J. Official Stat. 25(4), 549–567 (2009) Young, J., Graham, P., Penny, R.: Using Bayesian networks to create synthetic data. J. Official Stat. 25(4), 549–567 (2009)
19.
Zurück zum Zitat Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)MathSciNetCrossRef Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)MathSciNetCrossRef
Metadaten
Titel
Generating Synthetic Data is Complicated: Know Your Data and Know Your Generator
verfasst von
Jonathan Latner
Marcel Neunhoeffer
Jörg Drechsler
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-69651-0_8