Skip to main content
Top

2024 | OriginalPaper | Chapter

A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics

Authors : Mohamed Aghaddar, Liu Nuo Su, Manel Slokom, Lucas Barnhoorn, Peter-Paul de Wolf

Published in: Privacy in Statistical Databases

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we investigate different approaches for generating synthetic microdata from open-source aggregated data. Specifically, we focus on macro-to-micro data synthesis. We explore the potential of the Gaussian copulas framework to estimate joint distributions from aggregated data. Our generated synthetic data is intended for educational and software testing use cases. We propose three scenarios to achieve realistic and high-quality synthetic microdata: (1) zero knowledge, (2) internal knowledge, and (3) external knowledge. The three scenarios involve different knowledge of the underlying properties of the real microdata, i.e., standard deviation, and covariate. Our evaluation includes matching tests to evaluate the privacy of the synthetic datasets. Our results indicate that macro-to-micro synthesis achieves better privacy preservation compared to other methods, demonstrating both the potential and challenges of synthetic data generation in maintaining data privacy while providing useful data for analysis.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Acharya, A., Sikdar, S., Das, S., Rangwala, H.: GenSyn: a multi-stage framework for generating synthetic microdata using macro data sources. In: IEEE International Conference on Big Data (Big Data), pp. 685–692 (2022) Acharya, A., Sikdar, S., Das, S., Rangwala, H.: GenSyn: a multi-stage framework for generating synthetic microdata using macro data sources. In: IEEE International Conference on Big Data (Big Data), pp. 685–692 (2022)
2.
go back to reference Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using Generative Adversarial Networks. In: Doshi-Velez, F., Fackler, J., Kale, D., Ranganath, R., Wallace, B., Wiens, J. (eds.) Proceedings of the 2nd Machine Learning for Healthcare Conference, vol. 68, pp. 286–305 (2017) Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using Generative Adversarial Networks. In: Doshi-Velez, F., Fackler, J., Kale, D., Ranganath, R., Wallace, B., Wiens, J. (eds.) Proceedings of the 2nd Machine Learning for Healthcare Conference, vol. 68, pp. 286–305 (2017)
3.
go back to reference Choupani, A.A., Mamdoohi, A.R.: Population synthesis using iterative proportional fitting (IPF): a review and future research. Transp. Res. Procedia 17, 223–233 (2016)CrossRef Choupani, A.A., Mamdoohi, A.R.: Population synthesis using iterative proportional fitting (IPF): a review and future research. Transp. Res. Procedia 17, 223–233 (2016)CrossRef
6.
go back to reference Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical data protection. J. Comput. Appl. Math. 164–165, 285–293 (2004). Proceedings of the 10th International Congress on Computational and Applied MathematicsMathSciNetCrossRef Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical data protection. J. Comput. Appl. Math. 164–165, 285–293 (2004). Proceedings of the 10th International Congress on Computational and Applied MathematicsMathSciNetCrossRef
7.
go back to reference Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef Drechsler, J., Reiter, J.P.: An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal. 55(12), 3232–3243 (2011)MathSciNetCrossRef
9.
go back to reference Hundepool, A., et al.: Statistical Disclosure Control. Wiley, NewYork (2012)CrossRef Hundepool, A., et al.: Statistical Disclosure Control. Wiley, NewYork (2012)CrossRef
10.
go back to reference Kim, J., Lee, S.: A simulated annealing algorithm for the creation of synthetic population in activity-based travel demand model. KSCE J. Civ. Eng. 20, 2513–2523 (2015)CrossRef Kim, J., Lee, S.: A simulated annealing algorithm for the creation of synthetic population in activity-based travel demand model. KSCE J. Civ. Eng. 20, 2513–2523 (2015)CrossRef
11.
go back to reference Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources (2020) Li, Z., Zhao, Y., Fu, J.: Sync: a copula based framework for generating synthetic data from aggregated sources (2020)
12.
go back to reference Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. 10(3), 395–411 (1985)CrossRef Liew, C.K., Choi, U.J., Liew, C.J.: A data distortion by probability distribution. ACM Trans. Database Syst. 10(3), 395–411 (1985)CrossRef
14.
go back to reference Muralidhar, K., Domingo-Ferrer, J.: Database reconstruction is not so easy and is different from reidentification. J. Off. Stat. 39(3), 381–398 (2023)CrossRef Muralidhar, K., Domingo-Ferrer, J.: Database reconstruction is not so easy and is different from reidentification. J. Off. Stat. 39(3), 381–398 (2023)CrossRef
15.
go back to reference Murata, T., Harada, T.: Nation-wide synthetic reconstruction method. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–6 (2017) Murata, T., Harada, T.: Nation-wide synthetic reconstruction method. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–6 (2017)
16.
go back to reference Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on Generative Adversarial Networks. In: Proceedings of the 44th International Conference on Very Large Data Bases (VLDB Endowment), vol. 11, no. 10, pp. 1071–1083 (2018) Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., Kim, Y.: Data synthesis based on Generative Adversarial Networks. In: Proceedings of the 44th International Conference on Very Large Data Bases (VLDB Endowment), vol. 11, no. 10, pp. 1071–1083 (2018)
17.
go back to reference Rubin, D.B.: Discussion statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993) Rubin, D.B.: Discussion statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
18.
go back to reference Thogarchety, P., Das, K.: Synthetic data generation using genetic algorithm. In: 2023 2nd International Conference for Innovation in Technology (INOCON), pp. 1–6 (2023) Thogarchety, P., Das, K.: Synthetic data generation using genetic algorithm. In: 2023 2nd International Conference for Innovation in Technology (INOCON), pp. 1–6 (2023)
20.
go back to reference Voas, D., Williamson, P.: An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata. Int. J. Popul. Geogr. 6, 349–366 (2000)CrossRef Voas, D., Williamson, P.: An evaluation of the combinatorial optimisation approach to the creation of synthetic microdata. Int. J. Popul. Geogr. 6, 349–366 (2000)CrossRef
21.
go back to reference Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 7335–7345 (2019) Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 7335–7345 (2019)
Metadata
Title
A Case Study Exploring Data Synthesis Strategies on Tabular vs. Aggregated Data Sources for Official Statistics
Authors
Mohamed Aghaddar
Liu Nuo Su
Manel Slokom
Lucas Barnhoorn
Peter-Paul de Wolf
Copyright Year
2024
DOI
https://doi.org/10.1007/978-3-031-69651-0_28

Premium Partner