Skip to main content
Top
Published in: Advances in Data Analysis and Classification 1/2016

01-03-2016 | Regular Article

A principal component method to impute missing values for mixed data

Authors: Vincent Audigier, François Husson, Julie Josse

Published in: Advances in Data Analysis and Classification | Issue 1/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We propose a new method to impute missing values in mixed data sets. It is based on a principal component method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the principal components. Because the imputation uses the principal axes and components, the prediction of the missing values is based on the similarity between individuals and on the relationships between variables. The properties of the method are illustrated via simulations and the quality of the imputation is assessed using real data sets. The method is compared to a recent method (Stekhoven and Buhlmann Bioinformatics 28:113–118, 2011) based on random forest and shows better performance especially for the imputation of categorical variables and situations with highly linear relationships between continuous variables.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Benzécri JP (1973) L’analyse des données. L’analyse des correspondances. Dunod, Tome II Benzécri JP (1973) L’analyse des données. L’analyse des correspondances. Dunod, Tome II
go back to reference Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Cross-validation of component model: a critical look at current methods. Anal Bioanal Chem 390:1241–1251CrossRef Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Cross-validation of component model: a critical look at current methods. Anal Bioanal Chem 390:1241–1251CrossRef
go back to reference Cornillon PA, Guyader A, Husson F, Jégou N, Josse J, Kloareg M, Matzner-Løber E, Rouvière L (2012) R for Statistics. Chapman and Hall/CRC, Boca Raton Cornillon PA, Guyader A, Husson F, Jégou N, Josse J, Kloareg M, Matzner-Løber E, Rouvière L (2012) R for Statistics. Chapman and Hall/CRC, Boca Raton
go back to reference Escofier B (1979) Traitement simultané de variables quantitatives et qualitatives en analyse factorielle. Les cahiers de l’analyse des données 4(2):137–146 Escofier B (1979) Traitement simultané de variables quantitatives et qualitatives en analyse factorielle. Les cahiers de l’analyse des données 4(2):137–146
go back to reference Gifi A (1990) Nonlinear multivariate analysis. Wiley, ChichesterMATH Gifi A (1990) Nonlinear multivariate analysis. Wiley, ChichesterMATH
go back to reference Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman and Hall/CRC. Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman and Hall/CRC.
go back to reference Josse J, Husson F (2011) Selecting the number of components in PCA using cross-validation approximations. Comput Statist Data Anal 56(6):1869–1879CrossRefMathSciNet Josse J, Husson F (2011) Selecting the number of components in PCA using cross-validation approximations. Comput Statist Data Anal 56(6):1869–1879CrossRefMathSciNet
go back to reference Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21MathSciNet Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21MathSciNet
go back to reference Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150:28–51MATH Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150:28–51MATH
go back to reference Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29:91–116CrossRefMathSciNet Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29:91–116CrossRefMathSciNet
go back to reference Kiers HAL (1991) Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika 56:197–212CrossRefMathSciNetMATH Kiers HAL (1991) Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika 56:197–212CrossRefMathSciNetMATH
go back to reference Lebart L, Morineau A, Werwick KM (1984) Multivariate descriptive statistical analysis. Wiley, New YorkMATH Lebart L, Morineau A, Werwick KM (1984) Multivariate descriptive statistical analysis. Wiley, New YorkMATH
go back to reference Little RJA, Rubin DB (1987, 2002) Statistical analysis with missing data. Wiley series in probability and statistics, New York Little RJA, Rubin DB (1987, 2002) Statistical analysis with missing data. Wiley series in probability and statistics, New York
go back to reference Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11:2287–2322MathSciNetMATH Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11:2287–2322MathSciNetMATH
go back to reference R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R-project.org/, ISBN 3-900051-07-0 R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL http://​www.​R-project.​org/​, ISBN 3-900051-07-0
go back to reference Stekhoven D, Bühlmann P (2011) Missforest - nonparametric missing value imputation for mixed-type data. Bioinformatics 28:113–118 Stekhoven D, Bühlmann P (2011) Missforest - nonparametric missing value imputation for mixed-type data. Bioinformatics 28:113–118
go back to reference Tenenhaus M, Young FW (1985) An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50:91–119CrossRefMathSciNetMATH Tenenhaus M, Young FW (1985) An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50:91–119CrossRefMathSciNetMATH
go back to reference Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(62001):520–525CrossRef Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(62001):520–525CrossRef
go back to reference van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Statist Method Med Res 16:219–242CrossRefMATH van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Statist Method Med Res 16:219–242CrossRefMATH
go back to reference van Buuren S, Boshuizen H, Knook D (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18:681–694CrossRef van Buuren S, Boshuizen H, Knook D (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18:681–694CrossRef
go back to reference van der Heijden P, Escofier B (2003) Multiple correspondence analysis with missing data. In: Analyse des correspondances, Presse universitaire de Rennes, pp 153–170 van der Heijden P, Escofier B (2003) Multiple correspondence analysis with missing data. In: Analyse des correspondances, Presse universitaire de Rennes, pp 153–170
go back to reference Vermunt JK, van Ginkel JR, van der Ark LA, Sijtsma K (2008) Multiple imputation of incomplete categorical data using latent class analysis. Sociol Methodol 33:369–397 Vermunt JK, van Ginkel JR, van der Ark LA, Sijtsma K (2008) Multiple imputation of incomplete categorical data using latent class analysis. Sociol Methodol 33:369–397
Metadata
Title
A principal component method to impute missing values for mixed data
Authors
Vincent Audigier
François Husson
Julie Josse
Publication date
01-03-2016
Publisher
Springer Berlin Heidelberg
Published in
Advances in Data Analysis and Classification / Issue 1/2016
Print ISSN: 1862-5347
Electronic ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-014-0195-1

Other articles of this Issue 1/2016

Advances in Data Analysis and Classification 1/2016 Go to the issue

Premium Partner