Skip to main content

2024 | OriginalPaper | Buchkapitel

Privacy and Utility Evaluation of Synthetic Tabular Data for Machine Learning

verfasst von : Felix Hermsen, Avikarsha Mandal

Erschienen in: Privacy and Identity Management. Sharing in a Digital World

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Synthetic data generation approaches have attracted a lot of attention as a potential substitute for classical anonymization methods. However, synthetic data still pose a wide range of privacy risks, for example, dataset containing data points close to real data points, thus, increasing risks of linkage attacks. While differentially private generative models are generally considered immune to privacy attacks, it is not immediately evident how these models maintain privacy with reasonable utility. In this study, we evaluate the privacy and utility trade-offs in synthetic data generated by the state-of-the-art generative model CTGAN and its differentially private variant DPCTGAN for mixed tabular data domain. We conduct experiments using widely recognized benchmark datasets to highlight the importance of selecting optimal hyperparameters such that the model converges during training and produces synthetic data with satisfactory utility. Our experiments show that synthetic data generators, which were trained with differential privacy, may experience collapse during the training phase. While the addition of a smaller noise allows the training to converge, still could limit risks against privacy attacks such as membership inference and linkage.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016) Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)
2.
Zurück zum Zitat Abufadda, M., Mansour, K.: A survey of synthetic data generation for machine learning. In: 2021 22nd International Arab Conference on Information Technology (ACIT), pp. 1–7 (2021) Abufadda, M., Mansour, K.: A survey of synthetic data generation for machine learning. In: 2021 22nd International Arab Conference on Information Technology (ACIT), pp. 1–7 (2021)
3.
Zurück zum Zitat Asuncion, A., Newman, D.: UCI machine learning repository (2007) Asuncion, A., Newman, D.: UCI machine learning repository (2007)
4.
Zurück zum Zitat Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., Kasneci, G.: Language models are realistic tabular data generators (2022) Borisov, V., Seßler, K., Leemann, T., Pawelczyk, M., Kasneci, G.: Language models are realistic tabular data generators (2022)
5.
Zurück zum Zitat Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using generative adversarial networks. In: Machine Learning for Healthcare Conference, pp. 286–305. PMLR (2017) Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using generative adversarial networks. In: Machine Learning for Healthcare Conference, pp. 286–305. PMLR (2017)
6.
Zurück zum Zitat Dankar, F.K., Ibrahim, M., Castelli, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. (2076-3417) 11(5), 2158 (2021)CrossRef Dankar, F.K., Ibrahim, M., Castelli, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. (2076-3417) 11(5), 2158 (2021)CrossRef
9.
Zurück zum Zitat Ford, N.: List of data breaches and cyber attacks in 2023 (2023) Ford, N.: List of data breaches and cyber attacks in 2023 (2023)
10.
Zurück zum Zitat Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. (CSUR) 42(4), 1–53 (2010)CrossRef Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. (CSUR) 42(4), 1–53 (2010)CrossRef
11.
Zurück zum Zitat Giomi, M., Boenisch, F., Wehmeyer, C., Tasnádi, B.: A unified framework for quantifying privacy risk in synthetic data (2023) Giomi, M., Boenisch, F., Wehmeyer, C., Tasnádi, B.: A unified framework for quantifying privacy risk in synthetic data (2023)
12.
Zurück zum Zitat Goswami, P., Madan, S.: Privacy preserving data publishing and data anonymization approaches: a review. In: 2017 International Conference on Computing, Communication and Automation (ICCCA), pp. 139–142. IEEE (2017) Goswami, P., Madan, S.: Privacy preserving data publishing and data anonymization approaches: a review. In: 2017 International Conference on Computing, Communication and Automation (ICCCA), pp. 139–142. IEEE (2017)
13.
Zurück zum Zitat Hernandez, M., Epelde, G., Alberdi, A., Cilla, R., Rankin, D.: Synthetic data generation for tabular health records: a systematic review. Neurocomputing 493, 28–45 (2022)CrossRef Hernandez, M., Epelde, G., Alberdi, A., Cilla, R., Rankin, D.: Synthetic data generation for tabular health records: a systematic review. Neurocomputing 493, 28–45 (2022)CrossRef
14.
Zurück zum Zitat Hilprecht, B., Härterich, M., Bernau, D.: Monte Carlo and reconstruction membership inference attacks against generative models. Proc. Priv. Enhanc. Technol. 2019(4), 232–249 (2019) Hilprecht, B., Härterich, M., Bernau, D.: Monte Carlo and reconstruction membership inference attacks against generative models. Proc. Priv. Enhanc. Technol. 2019(4), 232–249 (2019)
15.
16.
Zurück zum Zitat Li, Z., Zhao, Y., Botta, N., Ionescu, C., Hu, X.: COPOD: copula-based outlier detection. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 1118–1123. IEEE (2020) Li, Z., Zhao, Y., Botta, N., Ionescu, C., Hu, X.: COPOD: copula-based outlier detection. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 1118–1123. IEEE (2020)
18.
Zurück zum Zitat Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), pp. 111–125. IEEE (2008) Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), pp. 111–125. IEEE (2008)
19.
Zurück zum Zitat Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets: a decade later. May 21, 2019 (2019) Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets: a decade later. May 21, 2019 (2019)
21.
Zurück zum Zitat Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018)CrossRef Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on generative adversarial networks. Proc. VLDB Endow. 11(10), 1071–1083 (2018)CrossRef
22.
Zurück zum Zitat Ponomareva, N., et al.: How to DP-fy ML: a practical guide to machine learning with differential privacy. J. Artif. Intell. Res. 77, 1113–1201 (2023)MathSciNetCrossRef Ponomareva, N., et al.: How to DP-fy ML: a practical guide to machine learning with differential privacy. J. Artif. Intell. Res. 77, 1113–1201 (2023)MathSciNetCrossRef
23.
Zurück zum Zitat Rigaki, M., Garcia, S.: A survey of privacy attacks in machine learning. ACM Comput. Surv. 56(4), 1–34 (2023)CrossRef Rigaki, M., Garcia, S.: A survey of privacy attacks in machine learning. ACM Comput. Surv. 56(4), 1–34 (2023)CrossRef
24.
Zurück zum Zitat Rosenblatt, L., Liu, X., Pouyanfar, S., de Leon, W., Desai, A., Allen, J.: Differentially private synthetic data: applied evaluations and enhancements. arXiv preprint arXiv:2011.05537 (2020) Rosenblatt, L., Liu, X., Pouyanfar, S., de Leon, W., Desai, A., Allen, J.: Differentially private synthetic data: applied evaluations and enhancements. arXiv preprint arXiv:​2011.​05537 (2020)
25.
Zurück zum Zitat Solatorio, A.V., Dupriez, O.: REaLTabFormer: generating realistic relational and tabular data using transformers (2023) Solatorio, A.V., Dupriez, O.: REaLTabFormer: generating realistic relational and tabular data using transformers (2023)
26.
Zurück zum Zitat Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic data–anonymisation groundhog day. In: 31st USENIX Security Symposium (USENIX Security 2022), pp. 1451–1468 (2022) Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic data–anonymisation groundhog day. In: 31st USENIX Security Symposium (USENIX Security 2022), pp. 1451–1468 (2022)
29.
Zurück zum Zitat Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, vol. 32 (2019) Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
30.
Zurück zum Zitat Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inform. 24(8), 2378–2388 (2020)CrossRef Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J. Biomed. Health Inform. 24(8), 2378–2388 (2020)CrossRef
31.
Zurück zum Zitat Zhao, Z., Kunar, A., Birke, R., Chen, L.Y.: CTAB-GAN: effective table data synthesizing. In: Asian Conference on Machine Learning, pp. 97–112. PMLR (2021) Zhao, Z., Kunar, A., Birke, R., Chen, L.Y.: CTAB-GAN: effective table data synthesizing. In: Asian Conference on Machine Learning, pp. 97–112. PMLR (2021)
32.
Zurück zum Zitat Zhao, Z., Kunar, A., Birke, R., Chen, L.Y.: CTAB-GAN: effective table data synthesizing. In: Balasubramanian, V.N., Tsang, I. (eds.) Proceedings of the 13th Asian Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 157, pp. 97–112. PMLR (2021) Zhao, Z., Kunar, A., Birke, R., Chen, L.Y.: CTAB-GAN: effective table data synthesizing. In: Balasubramanian, V.N., Tsang, I. (eds.) Proceedings of the 13th Asian Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 157, pp. 97–112. PMLR (2021)
Metadaten
Titel
Privacy and Utility Evaluation of Synthetic Tabular Data for Machine Learning
verfasst von
Felix Hermsen
Avikarsha Mandal
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-57978-3_17