Skip to main content
Top

2024 | OriginalPaper | Chapter

Synthetic Data: Comparing Utility and Risk in Microdata and Tables

Authors : Simon Xi Ning Kolb, Jui Andreas Tang, Sarah Giessing

Published in: Privacy in Statistical Databases

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Synthetic data has begun to show potential as an alternative to traditional SDC methods in specific use cases. This development and the increasing research efforts further hint at an emerging role in future privacy protection. However, since data synthesis predominantly happens at microdata level, development of utility and risk metrics is also focused on this domain. Statistical agencies on the other hand limit data publication mostly to aggregates, by selecting various subsets of variables for cross tabulation. We analyze the correlations between microdata and tabular data metrics for assessing utility and risk. Using a large real life data set as an example for data synthesis, we show that certain global metrics may disproportionately represent small subsets of variables, making them an inappropriate estimator for the quality of aggregates. On the other hand, we show strong similarities between certain microdata level risk metrics and risks of group disclosure in aggregated data.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Burgette, L.F., Reiter, J.P.: Multiple imputation for missing data via sequential regression trees. Am. J. Epidemiol. 172(9), 1070–1076 (2010)CrossRef Burgette, L.F., Reiter, J.P.: Multiple imputation for missing data via sequential regression trees. Am. J. Epidemiol. 172(9), 1070–1076 (2010)CrossRef
3.
go back to reference Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef
4.
go back to reference Elliot, M.J., Manning, A.M., Ford, R.W.: A computational algorithm for handling the special uniques problem. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 493–509 (2002)CrossRef Elliot, M.J., Manning, A.M., Ford, R.W.: A computational algorithm for handling the special uniques problem. Internat. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 493–509 (2002)CrossRef
5.
go back to reference Giomi, M., Boenisch, F., Wehmeyer, C., Tasnádi, B.: A unified framework for quantifying privacy risk in synthetic data (2022) Giomi, M., Boenisch, F., Wehmeyer, C., Tasnádi, B.: A unified framework for quantifying privacy risk in synthetic data (2022)
6.
go back to reference Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014) Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
8.
go back to reference Hornby, R., Hu, J.: Identification risks evaluation of partially synthetic data with the IdentificationRiskCalculation R package. arXiv preprint arXiv:2006.01298 (2020) Hornby, R., Hu, J.: Identification risks evaluation of partially synthetic data with the IdentificationRiskCalculation R package. arXiv preprint arXiv:​2006.​01298 (2020)
9.
go back to reference Kursa, M.B., Rudnicki, W.R.: Feature selection with the Boruta package. J. Stat. Softw. 36(11), 1–13 (2010)CrossRef Kursa, M.B., Rudnicki, W.R.: Feature selection with the Boruta package. J. Stat. Softw. 36(11), 1–13 (2010)CrossRef
10.
go back to reference Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993) Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)
11.
go back to reference Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explor. Newsl. 3(1), 27–32 (2001)CrossRef Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explor. Newsl. 3(1), 27–32 (2001)CrossRef
12.
go back to reference Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016)CrossRef Nowok, B., Raab, G.M., Dibben, C.: synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74, 1–26 (2016)CrossRef
14.
go back to reference Raab, G.M., Nowok, B., Dibben, C.: Assessing, visualizing and improving the utility of synthetic data. arXiv preprint arXiv:2109.12717 (2021) Raab, G.M., Nowok, B., Dibben, C.: Assessing, visualizing and improving the utility of synthetic data. arXiv preprint arXiv:​2109.​12717 (2021)
15.
go back to reference Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005) Reiter, J.P.: Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21(3), 441–462 (2005)
16.
go back to reference Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confidentiality 1(1) (2009) Reiter, J.P., Mitra, R.: Estimating risks of identification disclosure in partially synthetic data. J. Priv. Confidentiality 1(1) (2009)
17.
go back to reference Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993) Rubin, D.B.: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
18.
go back to reference Shlomo, N.: Statistical disclosure control methods for census frequency tables. Int. Stat. Rev./Revue Internationale de Statistique 75(2), 199–217 (2007) Shlomo, N.: Statistical disclosure control methods for census frequency tables. Int. Stat. Rev./Revue Internationale de Statistique 75(2), 199–217 (2007)
19.
go back to reference Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018)MathSciNetCrossRef Snoke, J., Raab, G.M., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 181(3), 663–688 (2018)MathSciNetCrossRef
20.
go back to reference Taub, J., Elliot, M.: The synthetic data challenge. In: Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (2019) Taub, J., Elliot, M.: The synthetic data challenge. In: Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (2019)
21.
go back to reference Therneau, T., Atkinson, B.: RPART: recursive partitioning and regression trees (2022) Therneau, T., Atkinson, B.: RPART: recursive partitioning and regression trees (2022)
22.
go back to reference Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1) (2009) Woo, M.J., Reiter, J.P., Oganian, A., Karr, A.F.: Global measures of data utility for microdata masked for disclosure limitation. J. Priv. Confidentiality 1(1) (2009)
Metadata
Title
Synthetic Data: Comparing Utility and Risk in Microdata and Tables
Authors
Simon Xi Ning Kolb
Jui Andreas Tang
Sarah Giessing
Copyright Year
2024
DOI
https://doi.org/10.1007/978-3-031-69651-0_15

Premium Partner