Skip to main content

2024 | OriginalPaper | Buchkapitel

A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics

verfasst von : Mohamed Aghaddar, Liu Nuo Su, Manel Slokom, Lucas Barnhoorn, Peter-Paul de Wolf

Erschienen in: Privacy in Statistical Databases

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we investigate different approaches for generating synthetic microdata from open-source aggregated data. Specifically, we focus on macro-to-micro data synthesis. We explore the potential of the Gaussian copulas framework to estimate joint distributions from aggregated data. Our generated synthetic data is intended for educational and software testing use cases. We propose three scenarios to achieve realistic and high-quality synthetic microdata: (1) zero knowledge, (2) internal knowledge, and (3) external knowledge. The three scenarios involve different knowledge of the underlying properties of the real microdata, i.e., standard deviation, and covariate. Our evaluation includes matching tests to evaluate the privacy of the synthetic datasets. Our results indicate that macro-to-micro synthesis achieves better privacy preservation compared to other methods, demonstrating both the potential and challenges of synthetic data generation in maintaining data privacy while providing useful data for analysis.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Acharya, A., Sikdar, S., Das, S., Rangwala, H.: GenSyn: a multi-stage framework for generating synthetic microdata using macro data sources. In: IEEE International Conference on Big Data (Big Data), pp. 685–692 (2022) Acharya, A., Sikdar, S., Das, S., Rangwala, H.: GenSyn: a multi-stage framework for generating synthetic microdata using macro data sources. In: IEEE International Conference on Big Data (Big Data), pp. 685–692 (2022)
2.
Zurück zum Zitat Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using Generative Adversarial Networks. In: Doshi-Velez, F., Fackler, J., Kale, D., Ranganath, R., Wallace, B., Wiens, J. (eds.) Proceedings of the 2nd Machine Learning for Healthcare Conference, vol. 68, pp. 286–305 (2017) Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using Generative Adversarial Networks. In: Doshi-Velez, F., Fackler, J., Kale, D., Ranganath, R., Wallace, B., Wiens, J. (eds.) Proceedings of the 2nd Machine Learning for Healthcare Conference, vol. 68, pp. 286–305 (2017)
3.
Zurück zum Zitat Choupani, A.A., Mamdoohi, A.R.: Population synthesis using iterative proportional fitting (IPF): a review and future research. Transp. Res. Procedia 17, 223–233 (2016)CrossRef Choupani, A.A., Mamdoohi, A.R.: Population synthesis using iterative proportional fitting (IPF): a review and future research. Transp. Res. Procedia 17, 223–233 (2016)CrossRef
6.
Zurück zum Zitat Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical data protection. J. Comput. Appl. Math. 164–165, 285–293 (2004). Proceedings of the 10th International Congress on Computational and Applied MathematicsMathSciNetCrossRef Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical data protection. J. Comput. Appl. Math. 164–165, 285–293 (2004). Proceedings of the 10th International Congress on Computational and Applied MathematicsMathSciNetCrossRef
7.
Zurück zum Zitat Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef
9.
Zurück zum Zitat Hundepool, A., et al.: Statistical Disclosure Control. Wiley, NewYork (2012)CrossRef Hundepool, A., et al.: Statistical Disclosure Control. Wiley, NewYork (2012)CrossRef
10.
Zurück zum Zitat Kim, J., Lee, S.: A simulated annealing algorithm for the creation of synthetic population in activity-based travel demand model. KSCE J. Civ. Eng. 20, 2513–2523 (2015)CrossRef Kim, J., Lee, S.: A simulated annealing algorithm for the creation of synthetic population in activity-based travel demand model. KSCE J. Civ. Eng. 20, 2513–2523 (2015)CrossRef
11.
Zurück zum Zitat Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources (2020) Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources (2020)
12.
Zurück zum Zitat Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. 10(3), 395–411 (1985)CrossRef Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. 10(3), 395–411 (1985)CrossRef
14.
Zurück zum Zitat Muralidhar, K., Domingo-Ferrer, J.: Database reconstruction is not so easy and is different from reidentification. J. Off. Stat. 39(3), 381–398 (2023)CrossRef Muralidhar, K., Domingo-Ferrer, J.: Database reconstruction is not so easy and is different from reidentification. J. Off. Stat. 39(3), 381–398 (2023)CrossRef
15.
Zurück zum Zitat Murata, T., Harada, T.: Nation-wide synthetic reconstruction method. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–6 (2017) Murata, T., Harada, T.: Nation-wide synthetic reconstruction method. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–6 (2017)
16.
Zurück zum Zitat Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on Generative Adversarial Networks. In: Proceedings of the 44th International Conference on Very Large Data Bases (VLDB Endowment), vol. 11, no. 10, pp. 1071–1083 (2018) Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on Generative Adversarial Networks. In: Proceedings of the 44th International Conference on Very Large Data Bases (VLDB Endowment), vol. 11, no. 10, pp. 1071–1083 (2018)
17.
Zurück zum Zitat Rubin, D.B.: Discussion statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993) Rubin, D.B.: Discussion statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
18.
Zurück zum Zitat Thogarchety, P., Das, K.: Synthetic data generation using genetic algorithm. In: 2023 2nd International Conference for Innovation in Technology (INOCON), pp. 1–6 (2023) Thogarchety, P., Das, K.: Synthetic data generation using genetic algorithm. In: 2023 2nd International Conference for Innovation in Technology (INOCON), pp. 1–6 (2023)
20.
Zurück zum Zitat Voas, D., Williamson, P.: An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata. Int. J. Popul. Geogr. 6, 349–366 (2000)CrossRef Voas, D., Williamson, P.: An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata. Int. J. Popul. Geogr. 6, 349–366 (2000)CrossRef
21.
Zurück zum Zitat Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 7335–7345 (2019) Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 7335–7345 (2019)
Metadaten
Titel
A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics
verfasst von
Mohamed Aghaddar
Liu Nuo Su
Manel Slokom
Lucas Barnhoorn
Peter-Paul de Wolf
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-69651-0_28