Top

Quality & Quantity

Published in:

01-05-2014

An alternative procedure for imputing missing data based on principal components analysis

Author: Giovanni Di Franco

Published in: Quality & Quantity | Issue 3/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This work entailed tackling the significant problem of missing data which was solved by identifying a new substitution procedure, following an empirical approach based on the analysis of the information contained in the entire set of data collected. This procedures offers a number of advantages compared to other techniques commonly mentioned in the statistical–methodological literature.

next article The impacts of investor sentiment on returns and conditional volatility of international stock markets

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

In general, it is always advisable to examine the socio-demographic profiles of cases with missing data very carefully. Furthermore, an anomalous incidence of missing information may indicate the emergence of problems during the data collection (difficulty in understanding the questions encountered by the interviewees, inadequate administration of the interviews by the interviewers, etc.).

In Italy this situation is typical of all surveys investigating voting intentions: nearly always the non-responses reach, and at time even surpass, 50 % of the sample.

The literature dealing with missing data substitution is ample and rapidly evolving. For further details see, among others, Enders (2010), Holenberghs and Kenward (2007), Chantala and Suchindran (2003), Akritas et al. (2002), Little and Rubin (2002), Allison (2001), Huisman et al. (1998), Little (1997) and Little and Schenker (1994).

When, due to limites economic and/or time resources, it is not possible to apply substitution, the so-called ‘mortality’ of the sample is compensated for by recourse to opportune weighting techniques used to restore the original sample size. As these operation lead to serious problems, the researcher must apply them with the utmost caution (for a critique of weighting techniques see Di Franco 2010).

Accordingly, the entire sample is divided into a number of sub-samples on the basis of socio-demographic variables (usually age and educational level) used as stratifying criteria, and each missing value is replaced by the central value of the class it belongs to. Through this procedure, one seeks to take into account the potential differences between the cases showing missing data across the different classes. For instance, it is possible that the central tendency measures of variables constructed on the basis of some opinion questions vary according to the age or educational level of the respondants. Throught this expedient, it should be possible, to improve the quality of each of the substituted data and, at the same time, limit the size of the shrinkage of variance. To avoid the reduction of variance, a random procedure has also been suggested. This may be used with any kind of variables and involves substituting each missing value with a different random value. These random values should be taken from the same socio-demographic class of the case, and should be chosen in such a way that not all the values have the same probability, but probabilities which are proportional to the frequencies of the cases which present data for that variable. This procedure leaves unaltered the original central value, the variance, and the distribution of cases both for the complete sample and the sub-types identified. Clearly, it is not claimed that the substitution of missing data for the single case comes close to the actual state on that variable.

In the case of categorical missing data imputation through regression may be carried out through logistic regression analysis (Di Franco 2011a). For nominal variables, substitution through hot-deck or cold-deck imputation is more appropriate.

The N versions of the complete data matrix are analysed using standard statistical techniques and the results combined using simple rules in order to reach single joint estimates, which formally incorporate the intrinsic uncertainty of the missing data. Therefore, the results of the estimates are averages computed on the N matrices of complete data.

The data used in the research project was drawn from European and international data banks (Eurostat, The World Bank, ITU, etc.). Some referred to structural characteristics of the countries; others were drawn from surveys regarding the use of computers as well as access to and use of Internet (Di Franco 2011b).

For further details concerning the method used to construct the index see Di Franco (2011b).

For an in-depth discussion of PCA (see Di Franco and Marradi 2003).

Akritas, M.G., Kuha, J., Osgood, D.W.: A nonparametric approach to matched pairs with missing data. Sociol. Methods Res. 30(3), 425–454 (2002)CrossRef

Allison, P.D.: Missing Data. Quantitative Applications in the Social Sciences. Sage, Thousand Oaks (2001)

Chantala, K., Suchindran, C.: Multiple Imputation for Missing Data. SAS OnlineDocTM, Version 8 (2003)

Di Franco, G.: Tecniche e modelli di analisi multivariata. FrancoAngeli, Milan (2011a)

Di Franco, G.: Appendix: EDDI European digital development index: definition of methodology. Guerrieri e Bentivegna, 220–259 (2011b)

Di Franco, G.: Il campionamento nelle scienze umane. Teoria e pratica. FrancoAngeli, Milan (2010)

Di Franco, G.: EDS: esplorare, descrivere e sintetizzare i dati. Guida pratica all’analisi dei dati nella ricerca sociale. FrancoAngeli, Milan (2001)

Di Franco, G., Marradi, A.: Analisi fattoriale e analisi in componenti pricipali. Bonanno, Rome/Catania (2003)

Enders, C.K.: Applied Missing Data Analysis. Guilford, Londra/New York (2010)

Guerrieri, P., Bentivegna, S. (eds.): The Economic Impact of Digital Technologies. Measuring Inclusion and Diffusion in Europe. Edward Elgar, Cheltenham/Northampton (2011)

Holenberghs, G., Kenward, M.G.: Missing Data in Clinical Studies. Wiley, Londra (2007)CrossRef

Huisman, M., Van Sondersen, E.: Handling missing data by re-approcching non-respondents. Qual. Quant. 32, 77–91 (1998)

Little, R.J.A.: Biostatistical analysis with missing data. In: Armitage, P., Colton, T. (eds.) Encyclopaedia of Biostatistics. Wiley, Londra (1997)

Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Hoboken (2002)

Little, R.J.A., Schenker, N.: Missing data. In: Arminger, G., Clogg, C.C., Sobel, M.E. (eds.) Handbook for Statistical Modeling in the Social and Behavioral Sciences, pp. 39–75. Plenum, New York (1994)

Marradi, A.: Analisi monovariata. FrancoAngeli, Milan (1993)

OECD: Handbook on Constructing Composite Indicators: Methodology and User Guide, ISBN 978-92-64-04345-9, \({\copyright }\) OECD JRC European Commission (2008)

Title: An alternative procedure for imputing missing data based on principal components analysis
Author: Giovanni Di Franco
Publication date: 01-05-2014
Publisher: Springer Netherlands
Published in: Quality & Quantity / Issue 3/2014
Print ISSN: 0033-5177
Electronic ISSN: 1573-7845
DOI: https://doi.org/10.1007/s11135-013-9826-4

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 3/2014

Combining grey relation analysis and entropy model for evaluating the operational performance: an empirical study

Measuring health expenditures and outcomes in saarc region: health is a luxury?

On minimizing the welfare cost of fiscal policy: evidence from South Asia

Employability and mental health in dismissed workers: the contribution of lay-off justice and participation in outplacement services

Higher education regimes: an empirical classification of higher education systems and its relationship with student accessibility

Should I take this seriously? A simple checklist for calling bullshit on policy supporting research

Premium Partner