Skip to main content
Top
Published in: Quality & Quantity 3/2014

01-05-2014

An alternative procedure for imputing missing data based on principal components analysis

Author: Giovanni Di Franco

Published in: Quality & Quantity | Issue 3/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This work entailed tackling the significant problem of missing data which was solved by identifying a new substitution procedure, following an empirical approach based on the analysis of the information contained in the entire set of data collected. This procedures offers a number of advantages compared to other techniques commonly mentioned in the statistical–methodological literature.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
In general, it is always advisable to examine the socio-demographic profiles of cases with missing data very carefully. Furthermore, an anomalous incidence of missing information may indicate the emergence of problems during the data collection (difficulty in understanding the questions encountered by the interviewees, inadequate administration of the interviews by the interviewers, etc.).
 
2
In Italy this situation is typical of all surveys investigating voting intentions: nearly always the non-responses reach, and at time even surpass, 50 % of the sample.
 
3
The literature dealing with missing data substitution is ample and rapidly evolving. For further details see, among others, Enders (2010), Holenberghs and Kenward (2007), Chantala and Suchindran (2003), Akritas et al. (2002), Little and Rubin (2002), Allison (2001), Huisman et al. (1998), Little (1997) and Little and Schenker (1994).
 
4
When, due to limites economic and/or time resources, it is not possible to apply substitution, the so-called ‘mortality’ of the sample is compensated for by recourse to opportune weighting techniques used to restore the original sample size. As these operation lead to serious problems, the researcher must apply them with the utmost caution (for a critique of weighting techniques see Di Franco 2010).
 
5
Accordingly, the entire sample is divided into a number of sub-samples on the basis of socio-demographic variables (usually age and educational level) used as stratifying criteria, and each missing value is replaced by the central value of the class it belongs to. Through this procedure, one seeks to take into account the potential differences between the cases showing missing data across the different classes. For instance, it is possible that the central tendency measures of variables constructed on the basis of some opinion questions vary according to the age or educational level of the respondants. Throught this expedient, it should be possible, to improve the quality of each of the substituted data and, at the same time, limit the size of the shrinkage of variance. To avoid the reduction of variance, a random procedure has also been suggested. This may be used with any kind of variables and involves substituting each missing value with a different random value. These random values should be taken from the same socio-demographic class of the case, and should be chosen in such a way that not all the values have the same probability, but probabilities which are proportional to the frequencies of the cases which present data for that variable. This procedure leaves unaltered the original central value, the variance, and the distribution of cases both for the complete sample and the sub-types identified. Clearly, it is not claimed that the substitution of missing data for the single case comes close to the actual state on that variable.
 
6
In the case of categorical missing data imputation through regression may be carried out through logistic regression analysis (Di Franco 2011a). For nominal variables, substitution through hot-deck or cold-deck imputation is more appropriate.
 
7
The N versions of the complete data matrix are analysed using standard statistical techniques and the results combined using simple rules in order to reach single joint estimates, which formally incorporate the intrinsic uncertainty of the missing data. Therefore, the results of the estimates are averages computed on the N matrices of complete data.
 
8
The data used in the research project was drawn from European and international data banks (Eurostat, The World Bank, ITU, etc.). Some referred to structural characteristics of the countries; others were drawn from surveys regarding the use of computers as well as access to and use of Internet (Di Franco 2011b).
 
9
For further details concerning the method used to construct the index see Di Franco (2011b).
 
10
For an in-depth discussion of PCA (see Di Franco and Marradi 2003).
 
Literature
go back to reference Akritas, M.G., Kuha, J., Osgood, D.W.: A nonparametric approach to matched pairs with missing data. Sociol. Methods Res. 30(3), 425–454 (2002)CrossRef Akritas, M.G., Kuha, J., Osgood, D.W.: A nonparametric approach to matched pairs with missing data. Sociol. Methods Res. 30(3), 425–454 (2002)CrossRef
go back to reference Allison, P.D.: Missing Data. Quantitative Applications in the Social Sciences. Sage, Thousand Oaks (2001) Allison, P.D.: Missing Data. Quantitative Applications in the Social Sciences. Sage, Thousand Oaks (2001)
go back to reference Chantala, K., Suchindran, C.: Multiple Imputation for Missing Data. SAS OnlineDocTM, Version 8 (2003) Chantala, K., Suchindran, C.: Multiple Imputation for Missing Data. SAS OnlineDocTM, Version 8 (2003)
go back to reference Di Franco, G.: Tecniche e modelli di analisi multivariata. FrancoAngeli, Milan (2011a) Di Franco, G.: Tecniche e modelli di analisi multivariata. FrancoAngeli, Milan (2011a)
go back to reference Di Franco, G.: Appendix: EDDI European digital development index: definition of methodology. Guerrieri e Bentivegna, 220–259 (2011b) Di Franco, G.: Appendix: EDDI European digital development index: definition of methodology. Guerrieri e Bentivegna, 220–259 (2011b)
go back to reference Di Franco, G.: Il campionamento nelle scienze umane. Teoria e pratica. FrancoAngeli, Milan (2010) Di Franco, G.: Il campionamento nelle scienze umane. Teoria e pratica. FrancoAngeli, Milan (2010)
go back to reference Di Franco, G.: EDS: esplorare, descrivere e sintetizzare i dati. Guida pratica all’analisi dei dati nella ricerca sociale. FrancoAngeli, Milan (2001) Di Franco, G.: EDS: esplorare, descrivere e sintetizzare i dati. Guida pratica all’analisi dei dati nella ricerca sociale. FrancoAngeli, Milan (2001)
go back to reference Di Franco, G., Marradi, A.: Analisi fattoriale e analisi in componenti pricipali. Bonanno, Rome/Catania (2003) Di Franco, G., Marradi, A.: Analisi fattoriale e analisi in componenti pricipali. Bonanno, Rome/Catania (2003)
go back to reference Enders, C.K.: Applied Missing Data Analysis. Guilford, Londra/New York (2010) Enders, C.K.: Applied Missing Data Analysis. Guilford, Londra/New York (2010)
go back to reference Guerrieri, P., Bentivegna, S. (eds.): The Economic Impact of Digital Technologies. Measuring Inclusion and Diffusion in Europe. Edward Elgar, Cheltenham/Northampton (2011) Guerrieri, P., Bentivegna, S. (eds.): The Economic Impact of Digital Technologies. Measuring Inclusion and Diffusion in Europe. Edward Elgar, Cheltenham/Northampton (2011)
go back to reference Holenberghs, G., Kenward, M.G.: Missing Data in Clinical Studies. Wiley, Londra (2007)CrossRef Holenberghs, G., Kenward, M.G.: Missing Data in Clinical Studies. Wiley, Londra (2007)CrossRef
go back to reference Huisman, M., Van Sondersen, E.: Handling missing data by re-approcching non-respondents. Qual. Quant. 32, 77–91 (1998) Huisman, M., Van Sondersen, E.: Handling missing data by re-approcching non-respondents. Qual. Quant. 32, 77–91 (1998)
go back to reference Little, R.J.A.: Biostatistical analysis with missing data. In: Armitage, P., Colton, T. (eds.) Encyclopaedia of Biostatistics. Wiley, Londra (1997) Little, R.J.A.: Biostatistical analysis with missing data. In: Armitage, P., Colton, T. (eds.) Encyclopaedia of Biostatistics. Wiley, Londra (1997)
go back to reference Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Hoboken (2002) Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, Hoboken (2002)
go back to reference Little, R.J.A., Schenker, N.: Missing data. In: Arminger, G., Clogg, C.C., Sobel, M.E. (eds.) Handbook for Statistical Modeling in the Social and Behavioral Sciences, pp. 39–75. Plenum, New York (1994) Little, R.J.A., Schenker, N.: Missing data. In: Arminger, G., Clogg, C.C., Sobel, M.E. (eds.) Handbook for Statistical Modeling in the Social and Behavioral Sciences, pp. 39–75. Plenum, New York (1994)
go back to reference Marradi, A.: Analisi monovariata. FrancoAngeli, Milan (1993) Marradi, A.: Analisi monovariata. FrancoAngeli, Milan (1993)
go back to reference OECD: Handbook on Constructing Composite Indicators: Methodology and User Guide, ISBN 978-92-64-04345-9, \({\copyright }\) OECD JRC European Commission (2008) OECD: Handbook on Constructing Composite Indicators: Methodology and User Guide, ISBN 978-92-64-04345-9, \({\copyright }\) OECD JRC European Commission (2008)
Metadata
Title
An alternative procedure for imputing missing data based on principal components analysis
Author
Giovanni Di Franco
Publication date
01-05-2014
Publisher
Springer Netherlands
Published in
Quality & Quantity / Issue 3/2014
Print ISSN: 0033-5177
Electronic ISSN: 1573-7845
DOI
https://doi.org/10.1007/s11135-013-9826-4

Other articles of this Issue 3/2014

Quality & Quantity 3/2014 Go to the issue

Premium Partner