Abstract
Statistical simulation in survey statistics is usually based on repeatedly drawing samples from population data. Furthermore, population data may be used in courses on survey statistics to explain issues regarding, e.g., sampling designs. Since the availability of real population data is in general very limited, it is necessary to generate synthetic data for such applications. The simulated data need to be as realistic as possible, while at the same time ensuring data confidentiality. This paper proposes a method for generating close-to-reality population data for complex household surveys. The procedure consists of four steps for setting up the household structure, simulating categorical variables, simulating continuous variables and splitting continuous variables into different components. It is not required to perform all four steps so that the framework is applicable to a broad class of surveys. In addition, the proposed method is evaluated in an application to the European Union Statistics on Income and Living Conditions (EU-SILC).
Similar content being viewed by others
References
Alfons A (2010) \({\tt{simFrame}}\): simulation framework. R package version 0.3.7
Alfons A, Kraft S (2010) \({\tt{simPopulation}}\): simulation of synthetic populations for surveys based on sample data. R package version 0.2.1
Alfons A, Templ M, Filzmoser P (2010a) An object-oriented framework for statistical simulation: the R package \({\tt{simFrame}}\). J Stat Softw 37(3): 1–36
Alfons A, Templ M, Filzmoser P (2010b) Simulation of EU-SILC population data: using the R package \({\tt{simPopulation}}\). Research Report CS-2010-5, Department of Statistics and Probability Theory, Vienna University of Technology
Atkinson T, Cantillon B, Marlier E, Nolan B (2002) Social indicators: the EU and social inclusion. Oxford University Press, New York ISBN 0-19-925349-8
Clarke G (1996) Microsimulation: an introduction. In: Clarke G (ed) Microsimulation for urban and regional policy analysis. Pion, London
Drechsler J, Bender S, Rässler S (2008) Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel. Trans Data Priv 1(3): 105–130
Embrechts P, Klüppelberg G, Mikosch T (1997) Modelling extremal events for insurance and finance. Springer, New York ISBN 3-540-60931-8
Eurostat (2004) Description of target variables: cross-sectional and longitudinal. EU-SILC 065/04, Eurostat, Luxembourg
Horvitz D, Thompson D (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260): 663–685
Kendall M, Stuart A (1967) The advanced theory of statistics, vol 2, 2nd edn. Charles Griffin & Co. Ltd, London
Kleiber C, Kotz S (2003) Statistical size distributions in economics and actuarial sciences. Wiley, Hoboken ISBN 0-471-15064-9
Kraft S (2009) Simulation of a population for the European living and income conditions survey. Master’s thesis, Vienna University of Technology
Meyer D, Zeileis A, Hornik K (2006) The \({\tt{strucplot}}\) framework: visualizing multi-way contingency tables with \({\tt{vcd}}\). J Stat Softw 17(3): 1–48
Meyer D, Zeileis A, Hornik K (2010) \({\tt{vcd}}\): visualizing categorical data. R package version 1.2–9
Münnich R, Schürle J (2003) On the simulation of complex universes in the case of applying the German Microcensus. DACSEIS research paper series No. 4, University of Tübingen
Münnich R, Schürle J, Bihler W, Boonstra HJ, Knotterus P, Nieuwenbroek N, Haslinger A, Laaksonen S, Eckmair D, Quatember A, Wagner H, Renfer JP, Oetliker U, Wiegert R (2003) Monte Carlo simulation study of European surveys. DACSEIS Deliverables D3.1 and D3.2, University of Tübingen
Raghunathan T, Reiter J, Rubin D (2003) Multiple imputation for statistical disclosure limitation. J Off Stat 19(1): 1–16
R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0
Reiter J (2009) Using multiple imputation to integrate and disseminate confidential microdata. Int Stat Rev 77(2): 179–195
Rubin D (1993) Discussion: statistical disclosure limitation. J Off Stat 9(2): 461–468
Sarkar D (2008) Lattice: multivariate data visualization with R. Springer, New York ISBN 978-0-387-75968-5
Sarkar D (2011) \({\tt{lattice}}\): lattice graphics. R package version 0.19-17
Simonoff J (2003) Analyzing categorical data. Springer, New York ISBN 0-387-00749-0
Templ M, Alfons A (2010) Disclosure risk of synthetic population data with application in the case of EU-SILC. In: Domingo-Ferrer J, Magkos E (eds) Privacy in statistical databases. Lecture notes in computer science, vol 6344. Springer, Heidelberg, pp 174–186
Walker A (1977) An efficient method for generating discrete random variables with general distributions. ACM Trans Math Softw 3(3): 253–256
Weisberg S (2005) Applied linear regression, 3rd edn. Wiley, Hoboken ISBN 0-471-66379-4
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partly funded by the European Union (represented by the European Commission) within the 7th framework programme for research (Theme 8, Socio-Economic Sciences and Humanities, Project AMELI (Advanced Methodology for European Laeken Indicators), Grant Agreement No. 217322). Visit http://ameli.surveystatistics.net for more information on the project.
Rights and permissions
About this article
Cite this article
Alfons, A., Kraft, S., Templ, M. et al. Simulation of close-to-reality population data for household surveys with application to EU-SILC. Stat Methods Appl 20, 383–407 (2011). https://doi.org/10.1007/s10260-011-0163-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-011-0163-2