Skip to main content
Erschienen in:

20.02.2021

How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment

verfasst von: Svetlana Zhuchkova, Aleksei Rotmistrov

Erschienen in: Quality & Quantity | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The study is devoted to a comparison of three approaches to handling missing data of categorical variables: complete case analysis, multiple imputation (based on random forest), and the missing-indicator method. Focusing on OLS regression, we describe how the choice of the approach depends on the missingness mechanism, its proportion, and model specification. The results of a simulated statistical experiment show that each approach may lead to either almost unbiased or dramatically biased estimates. The choice of the appropriate approach should be primarily based on the missingness mechanism: one should choose CCA under MCAR, MI under MAR, and, again, CCA under MNAR. Although MIM produces almost unbiased estimates under MCAR and MNAR as well, it leads to inefficient regression coefficients—ones with too big standard errors and, consequently, incorrect p-values.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The only exceptions are the articles (Choi et al. 2019) and (Donders et al. 2006) where the authors use simulated data but limit their analysis to continuous variables only.
 
2
The exception is paper (Henry et al. 2013) where the variable of race contains missing values, but in this study the authors use real data and do not control any factors that may affect the results of comparison.
 
3
All the technical files are available upon request.
 
4
Random forest-based multiple imputation was carried out with ‘sklearn’ package (specifically its IterativeImputer class) in Python (Pedregosa et al. 2011), which is equivalent to ‘mice’ package in R.
 
5
θ is the true value of a parameter.
 
6
AnOVa was carried out with ‘statsmodels’ package (specifically its anova_lm function) in Python (Seabold and Perktold 2010).
 
7
ChAID was carried out with ‘randan’ package (specifically its CHAIDRegressor class) in Python.
 
Literatur
Zurück zum Zitat Allison, P.D.: Imputation of categorical variables with PROC MI. Proc. SAS Users Group Int. Conf. (SUGI) 30, 113–130 (2005) Allison, P.D.: Imputation of categorical variables with PROC MI. Proc. SAS Users Group Int. Conf. (SUGI) 30, 113–130 (2005)
Zurück zum Zitat Dougherty, C.: Introduction to Econometrics. Oxford University Press, Oxford (2016) Dougherty, C.: Introduction to Econometrics. Oxford University Press, Oxford (2016)
Zurück zum Zitat Gentle, J.E. (ed.): Handbook of Computational Statistics: Concepts and Methods. Springer, Berlin (2012) Gentle, J.E. (ed.): Handbook of Computational Statistics: Concepts and Methods. Springer, Berlin (2012)
Zurück zum Zitat Groenwold, R.H.H., White, I.R., Donders, A.R.T., Carpenter, J.R., Altman, D.G., Moons, K.G.M.: Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. Can. Med. Assoc. J. 184, 1265–1269 (2012). https://doi.org/10.1503/cmaj.110977CrossRef Groenwold, R.H.H., White, I.R., Donders, A.R.T., Carpenter, J.R., Altman, D.G., Moons, K.G.M.: Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. Can. Med. Assoc. J. 184, 1265–1269 (2012). https://​doi.​org/​10.​1503/​cmaj.​110977CrossRef
Zurück zum Zitat Maimon, O., Rokach, L. (eds.): Data Mining and Knowledge Discovery Handbook. Springer, New York (2010) Maimon, O., Rokach, L. (eds.): Data Mining and Knowledge Discovery Handbook. Springer, New York (2010)
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Müller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Zurück zum Zitat Ratner, B.: Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. CRC Press, Boca Raton (2017) Ratner, B.: Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data. CRC Press, Boca Raton (2017)
Zurück zum Zitat Rubin, D.B. (ed.): Multiple imputation for nonresponse in surveys. Wiley, Hoboken (1987) Rubin, D.B. (ed.): Multiple imputation for nonresponse in surveys. Wiley, Hoboken (1987)
Zurück zum Zitat Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, Boca Ratan (1997)CrossRef Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press, Boca Ratan (1997)CrossRef
Zurück zum Zitat Seabold, S., & Perktold, J. Statsmodels: Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference (2010) Seabold, S., & Perktold, J. Statsmodels: Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference (2010)
Zurück zum Zitat Strebkov, D., Shevchuk, A., Lukina, A., Melianova, E., Tyulyupo, A.: Social factors of contractor selection on freelance online marketplace: study of contests using big data. J. Econ. Sociol. 20, 25–65 (2019)CrossRef Strebkov, D., Shevchuk, A., Lukina, A., Melianova, E., Tyulyupo, A.: Social factors of contractor selection on freelance online marketplace: study of contests using big data. J. Econ. Sociol. 20, 25–65 (2019)CrossRef
Zurück zum Zitat Sundararajan, A., Sarwat, A.I.: Evaluation of missing data imputationmethods for an enhanced distributed pvgeneration prediction. In: Arai, K., Bhatia, R., and Kapoor, S. (eds.) Proceedings of the Future Technologies Conference (FTC) 2019. pp. 590–609. Springer, Cham (2020) Sundararajan, A., Sarwat, A.I.: Evaluation of missing data imputationmethods for an enhanced distributed pvgeneration prediction. In: Arai, K., Bhatia, R., and Kapoor, S. (eds.) Proceedings of the Future Technologies Conference (FTC) 2019. pp. 590–609. Springer, Cham (2020)
Zurück zum Zitat Van der Heijden, G.J., Donders, A.R.T., Stijnen, T., Moons, K.G.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)CrossRef Van der Heijden, G.J., Donders, A.R.T., Stijnen, T., Moons, K.G.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)CrossRef
Zurück zum Zitat van Kuijk, S.M., Viechtbauer, W., Peeters, L.L., Smits, L.: Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study. Epidemiol., Biostat. Public Health (2016). https://doi.org/10.2427/11598CrossRef van Kuijk, S.M., Viechtbauer, W., Peeters, L.L., Smits, L.: Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study. Epidemiol., Biostat. Public Health (2016). https://​doi.​org/​10.​2427/​11598CrossRef
Zurück zum Zitat Vermunt, J.K., Van Ginkel, J.R., Van Der Ark, L.A., Sijtsma, K.: 9 Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(1), 369–397 (2008)CrossRef Vermunt, J.K., Van Ginkel, J.R., Van Der Ark, L.A., Sijtsma, K.: 9 Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(1), 369–397 (2008)CrossRef
Zurück zum Zitat Zhuchkova, S., Rotmistrov, A.: Handling missing data with CHAID: results of a statistical experiment. Sociology: methodology, methods, mathematical modeling. 46, 85–122 (2018) Zhuchkova, S., Rotmistrov, A.: Handling missing data with CHAID: results of a statistical experiment. Sociology: methodology, methods, mathematical modeling. 46, 85–122 (2018)
Metadaten
Titel
How to choose an approach to handling missing categorical data: (un)expected findings from a simulated statistical experiment
verfasst von
Svetlana Zhuchkova
Aleksei Rotmistrov
Publikationsdatum
20.02.2021
Verlag
Springer Netherlands
Erschienen in
Quality & Quantity / Ausgabe 1/2022
Print ISSN: 0033-5177
Elektronische ISSN: 1573-7845
DOI
https://doi.org/10.1007/s11135-021-01114-w