Skip to main content

2023 | OriginalPaper | Buchkapitel

Dropping Incomplete Records is (not so) Straightforward

verfasst von : Rianne M. Schouten, Victoria Taşcău, Gabriel G. Ziegler, Davide Casano, Marco Ardizzone, Michael-Angelos Erotokritou

Erschienen in: Advances in Intelligent Data Analysis XXI

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A straightforward approach to handling missing values is dropping incomplete records from the dataset. However, for many forms of missingness, this method is known to affect the center and spread of the data distribution. In this paper, we perform an extensive empirical evaluation of the effect of the drop method on the data distribution. In particular, we analyze two scenarios that are likely to occur in practice but are not often considered in simulation studies: 1) when features are skewed rather than symmetrically distributed and 2) when multiple forms of missingness occur simultaneously in one feature. Furthermore, we investigate implications of the drop method for classification accuracy and demonstrate that dropping incomplete records is doubtful, even when test cases are dropped as well.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
N.B.: in the general case, this may affect training and test distribution, but it is unclear how. Homogeneity might increase, but the data might also become more scattered and hence variance might increase. Since the distribution can be affected in a wide variety of possible ways, we will simply ignore this effect; note that technically this might affect the definition of accuracy.
 
Literatur
1.
Zurück zum Zitat Acuna, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds.) Classification, Clustering, and Data Mining Applications. Studies in Classification, Data Analysis, and Knowledge Organisation, pp. 639–647. Springer, Berlin, Heidelberg (2004). https://doi.org/10.1007/978-3-642-17103-1_60 Acuna, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds.) Classification, Clustering, and Data Mining Applications. Studies in Classification, Data Analysis, and Knowledge Organisation, pp. 639–647. Springer, Berlin, Heidelberg (2004). https://​doi.​org/​10.​1007/​978-3-642-17103-1_​60
2.
Zurück zum Zitat Brand, J.P., van Buuren, S., Groothuis-Oudshoorn, K., Gelsema, E.S.: A toolkit in SAS for the evaluation of multiple imputation methods. Stat. Neerl. 57(1), 36–45 (2003)MathSciNetCrossRef Brand, J.P., van Buuren, S., Groothuis-Oudshoorn, K., Gelsema, E.S.: A toolkit in SAS for the evaluation of multiple imputation methods. Stat. Neerl. 57(1), 36–45 (2003)MathSciNetCrossRef
3.
Zurück zum Zitat van Buuren, S.: Flexible Imputation of Missing Data, 2nd edn. Chapman and Hall/CRC, Boca Raton (2018) van Buuren, S.: Flexible Imputation of Missing Data, 2nd edn. Chapman and Hall/CRC, Boca Raton (2018)
4.
Zurück zum Zitat van Buuren, S., Brand, J.P., Groothuis-Oudshoorn, C.G., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76(12), 1049–1064 (2006)MathSciNetCrossRefMATH van Buuren, S., Brand, J.P., Groothuis-Oudshoorn, C.G., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76(12), 1049–1064 (2006)MathSciNetCrossRefMATH
5.
Zurück zum Zitat van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011)CrossRef van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011)CrossRef
6.
Zurück zum Zitat Correia, A., Peharz, R., de Campos, C.P.: Joints in random forests. Adv. Neural Inf. Process. Syst. 33, 11404–11415 (2020) Correia, A., Peharz, R., de Campos, C.P.: Joints in random forests. Adv. Neural Inf. Process. Syst. 33, 11404–11415 (2020)
7.
Zurück zum Zitat García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)CrossRef García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)CrossRef
8.
Zurück zum Zitat Garciarena, U., Santana, R.: An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst. Appl. 89, 52–65 (2017)CrossRef Garciarena, U., Santana, R.: An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst. Appl. 89, 52–65 (2017)CrossRef
9.
Zurück zum Zitat Hoogland, J., et al.: Handling missing predictor values when validating and applying a prediction model to new patients. Stat. Med. 39(25), 3591–3607 (2020)MathSciNetCrossRef Hoogland, J., et al.: Handling missing predictor values when validating and applying a prediction model to new patients. Stat. Med. 39(25), 3591–3607 (2020)MathSciNetCrossRef
10.
Zurück zum Zitat Little, R.J.: Regression with missing X’s: a review. J. Am. Stat. Assoc. 87(420), 1227–1237 (1992) Little, R.J.: Regression with missing X’s: a review. J. Am. Stat. Assoc. 87(420), 1227–1237 (1992)
11.
Zurück zum Zitat Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, Wiley Series in Probability and Statistics, vol. 793. Wiley, Hoboken (2019) Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, Wiley Series in Probability and Statistics, vol. 793. Wiley, Hoboken (2019)
12.
Zurück zum Zitat Mangasarian, O.L., Street, W.N., Wolberg, W.H.: Breast cancer diagnosis and prognosis via linear programming. Oper. Res. 43(4), 570–577 (1995)MathSciNetCrossRefMATH Mangasarian, O.L., Street, W.N., Wolberg, W.H.: Breast cancer diagnosis and prognosis via linear programming. Oper. Res. 43(4), 570–577 (1995)MathSciNetCrossRefMATH
13.
Zurück zum Zitat Miller, I., Miller, M., Freund, J.E.: John E. Freund’s Mathematical Statistics, 6th edn. Prentice Hall, Upper Saddle River, N.J. (1999) Miller, I., Miller, M., Freund, J.E.: John E. Freund’s Mathematical Statistics, 6th edn. Prentice Hall, Upper Saddle River, N.J. (1999)
14.
Zurück zum Zitat Raji, I.D., Kumar, I.E., Horowitz, A., Selbst, A.: The fallacy of AI functionality. In: ACM Conference on Fairness, Accountability, and Transparency, pp. 959–972 (2022) Raji, I.D., Kumar, I.E., Horowitz, A., Selbst, A.: The fallacy of AI functionality. In: ACM Conference on Fairness, Accountability, and Transparency, pp. 959–972 (2022)
16.
Zurück zum Zitat Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)CrossRef Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)CrossRef
17.
Zurück zum Zitat Schouten, R.M., Lugtig, P., Vink, G.: Generating missing values for simulation purposes: a multivariate amputation procedure. J. Stat. Comput. Simul. 88(15), 2909–2930 (2018)MathSciNetCrossRefMATH Schouten, R.M., Lugtig, P., Vink, G.: Generating missing values for simulation purposes: a multivariate amputation procedure. J. Stat. Comput. Simul. 88(15), 2909–2930 (2018)MathSciNetCrossRefMATH
18.
Zurück zum Zitat Schouten, R.M., Vink, G.: The dance of the mechanisms: how observed information influences the validity of missingness assumptions. Sociol. Methods Res. 50(3), 1243–1258 (2021)MathSciNetCrossRef Schouten, R.M., Vink, G.: The dance of the mechanisms: how observed information influences the validity of missingness assumptions. Sociol. Methods Res. 50(3), 1243–1258 (2021)MathSciNetCrossRef
20.
Zurück zum Zitat Street, W.N., Wolberg, W.H., Mangasarian, O.L.: Nuclear feature extraction for breast tumor diagnosis. In: Acharya, R.S., Goldgof, D.B. (eds.) Biomedical Image Processing and Biomedical Visualization. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 1905, pp. 861–870, July 1993 Street, W.N., Wolberg, W.H., Mangasarian, O.L.: Nuclear feature extraction for breast tumor diagnosis. In: Acharya, R.S., Goldgof, D.B. (eds.) Biomedical Image Processing and Biomedical Visualization. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 1905, pp. 861–870, July 1993
21.
Zurück zum Zitat Toutenburg, H., Srivastava, V.K.: Shalabh: amputation versus imputation of missing values through ratio method in sample surveys. Stat. Pap. 49(2), 237–247 (2008)CrossRefMATH Toutenburg, H., Srivastava, V.K.: Shalabh: amputation versus imputation of missing values through ratio method in sample surveys. Stat. Pap. 49(2), 237–247 (2008)CrossRefMATH
Metadaten
Titel
Dropping Incomplete Records is (not so) Straightforward
verfasst von
Rianne M. Schouten
Victoria Taşcău
Gabriel G. Ziegler
Davide Casano
Marco Ardizzone
Michael-Angelos Erotokritou
Copyright-Jahr
2023
DOI
https://doi.org/10.1007/978-3-031-30047-9_30

Premium Partner