Skip to main content
Erschienen in: Advances in Data Analysis and Classification 3/2023

01.09.2022 | Regular Article

Clustering with missing data: which equivalent for Rubin’s rules?

verfasst von: Vincent Audigier, Ndèye Niang

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Al-Najdi A, Pasquier N, Precioso F (2016) Frequent closed patterns based multiple consensus clustering. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Artificial intelligence and soft computing. Springer, Cham, pp 14–26 Al-Najdi A, Pasquier N, Precioso F (2016) Frequent closed patterns based multiple consensus clustering. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Artificial intelligence and soft computing. Springer, Cham, pp 14–26
Zurück zum Zitat Audigier V, Niang N, Resche-Rigon M (2021) Clustering with missing data: Which imputation model for which cluster analysis method? arXiv:2106.04424 Audigier V, Niang N, Resche-Rigon M (2021) Clustering with missing data: Which imputation model for which cluster analysis method? arXiv:​2106.​04424
Zurück zum Zitat Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal NR, Sugeno M (eds) Advances in soft computing: AFSS 2002. Springer, Berlin, pp 332–338CrossRef Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal NR, Sugeno M (eds) Advances in soft computing: AFSS 2002. Springer, Berlin, pp 332–338CrossRef
Zurück zum Zitat Dudoit S, Fridly J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics:1090–1099 Dudoit S, Fridly J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics:1090–1099
Zurück zum Zitat Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–780 Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–780
Zurück zum Zitat Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29(1):91–116MathSciNetCrossRefMATH Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29(1):91–116MathSciNetCrossRefMATH
Zurück zum Zitat Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley
Zurück zum Zitat Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of the 2007 seventh IEEE international conference on data mining, IEEE Computer Society, USA, ICDM ’07, pp 577–582. https://doi.org/10.1109/ICDM.2007.98 Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of the 2007 seventh IEEE international conference on data mining, IEEE Computer Society, USA, ICDM ’07, pp 577–582. https://​doi.​org/​10.​1109/​ICDM.​2007.​98
Zurück zum Zitat Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) Cluster: cluster analysis basics and extensions Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) Cluster: cluster analysis basics and extensions
Zurück zum Zitat Marshall A, Altman D, Holder R, Royston P (2009) Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol 9(1):57CrossRef Marshall A, Altman D, Holder R, Royston P (2009) Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol 9(1):57CrossRef
Zurück zum Zitat McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering, vol 38. M. Dekker New York McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering, vol 38. M. Dekker New York
Zurück zum Zitat Meng XL (1994) Multiple-imputation inferences with uncongenial sources of input (with discussion). Stat Sci 10:538–573 Meng XL (1994) Multiple-imputation inferences with uncongenial sources of input (with discussion). Stat Sci 10:538–573
Zurück zum Zitat Mourer A, Forest F, Lebbah M, Azzag H, Lacaille J (2020) Selecting the number of clusters \(k\) with a stability trade-off: an internal validation criterion. arXiv:2006.08530 Mourer A, Forest F, Lebbah M, Azzag H, Lacaille J (2020) Selecting the number of clusters \(k\) with a stability trade-off: an internal validation criterion. arXiv:​2006.​08530
Zurück zum Zitat Rubin D (1987) Multiple imputation for non-response in survey. Wiley, New-YorkCrossRef Rubin D (1987) Multiple imputation for non-response in survey. Wiley, New-YorkCrossRef
Zurück zum Zitat Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall/CRC, LondonCrossRefMATH Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall/CRC, LondonCrossRefMATH
Zurück zum Zitat Schafer J (2003) Multiple imputation in multivariate problems when the imputation and analysis models differ. Statistica Neerlandica 57(1):19–35MathSciNetCrossRef Schafer J (2003) Multiple imputation in multivariate problems when the imputation and analysis models differ. Statistica Neerlandica 57(1):19–35MathSciNetCrossRef
Zurück zum Zitat Strehl A, Ghosh J, Cardie C (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATH Strehl A, Ghosh J, Cardie C (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATH
Zurück zum Zitat van Buuren S (2018) Flexible imputation of missing data (Chapman & Hall/CRC Interdisciplinary Statistics). Chapman and Hall/CRC van Buuren S (2018) Flexible imputation of missing data (Chapman & Hall/CRC Interdisciplinary Statistics). Chapman and Hall/CRC
Zurück zum Zitat Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. IJPRAI 25(3):337–372MathSciNet Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. IJPRAI 25(3):337–372MathSciNet
Metadaten
Titel
Clustering with missing data: which equivalent for Rubin’s rules?
verfasst von
Vincent Audigier
Ndèye Niang
Publikationsdatum
01.09.2022
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 3/2023
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-022-00519-1

Weitere Artikel der Ausgabe 3/2023

Advances in Data Analysis and Classification 3/2023 Zur Ausgabe

Premium Partner