nach oben

Advances in Data Analysis and Classification

Erschienen in:

01.09.2022 | Regular Article

Clustering with missing data: which equivalent for Rubin’s rules?

verfasst von: Vincent Audigier, Ndèye Niang

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy, while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.

Vorheriger Artikel Robust regression for interval-valued data based on midpoints and log-ranges

Nächster Artikel New models for symbolic data analysis

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Al-Najdi A, Pasquier N, Precioso F (2016) Frequent closed patterns based multiple consensus clustering. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Artificial intelligence and soft computing. Springer, Cham, pp 14–26

Andrzej CAHP, Zdunek R, ichi AS, (2009) Alternating least squares and related algorithms for NMF and SCA problems. Wiley, vol 4, pp 203–266. https://doi.org/10.1002/9780470747278.ch4, https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470747278.ch4

Audigier V, Niang N, Resche-Rigon M (2021) Clustering with missing data: Which imputation model for which cluster analysis method? arXiv:2106.04424

Basagana X, Barrera-Gomez J, Benet M, Anto JM, Garcia-Aymerich J (2013) A framework for multiple imputation in cluster analysis. Am J Epidemiol 177(7):718–725. https://doi.org/10.1093/aje/kws289CrossRef

Belbin L, Faith DP, Milligan GW (1992) A comparison of two approaches to beta-flexible clustering. Multivar Behav Res 27(3):417–433. https://doi.org/10.1207/s15327906mbr2703_6CrossRef

Bruckers L, Molenberghs G, Dendale P (2017) Clustering multiply imputed multivariate high-dimensional longitudinal profiles. Biomet J 59(5):998–1015. https://doi.org/10.1002/bimj.201500027MathSciNetCrossRefMATH

Chi JT, Chi EC, Baraniuk RG (2016) k-pod: a method for k-means clustering of missing data. Am Stat 70(1):91–99. https://doi.org/10.1080/00031305.2015.1086685MathSciNetCrossRefMATH

Day W (1986) Foreword: comparison and consensus of classifications. J Classif 3(2):183–185. https://doi.org/10.1007/BF01894187CrossRef

Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal NR, Sugeno M (eds) Advances in soft computing: AFSS 2002. Springer, Berlin, pp 332–338CrossRef

Doove L, van Buuren S, Dusseldorp E (2014) Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal 72:92–104. https://doi.org/10.1016/j.csda.2013.10.025MathSciNetCrossRefMATH

Dudoit S, Fridly J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics:1090–1099

Fang Y, Wang J (2012) Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal 56(3):468–477. https://doi.org/10.1016/j.csda.2011.09.003MathSciNetCrossRefMATH

Faucheux L, Resche-Rigon M, Curis E, Soumelis V, Chevret S (2020) Clustering with missing and left-censored data: a simulation study comparing multiple-imputation-based procedures. Biomet J. https://doi.org/10.1002/bimj.201900366CrossRef

Filkov V, Skiena S (2004) Integrating microarray data by consensus clustering. Int J Artif Intell Tools 13(04):863–880. https://doi.org/10.1142/S0218213004001867CrossRefMATH

Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–780

Hennig C (2007) Cluster-wise assessment of cluster stability. Comput Stat Data Anal 52(1):258–271. https://doi.org/10.1016/j.csda.2006.11.025MathSciNetCrossRefMATH

Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefMATH

Jain A, Moreau J (1987) Bootstrap technique in cluster analysis. Pattern Recogn 20(5):547–568. https://doi.org/10.1016/0031-3203(87)90081-1CrossRef

Jain BJ (2017) Consistency of mean partitions in consensus clustering. Pattern Recogn 71:26–35. https://doi.org/10.1016/j.patcog.2017.04.021CrossRef

Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29(1):91–116MathSciNetCrossRefMATH

Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley

Kim HJ, Reiter JP, Wang Q, Cox LH, Karr A (2014) Multiple imputation of missing or faulty values under linear constraints. J Bus Econ Stat 32(3):375–386. https://doi.org/10.1080/07350015.2014.885435MathSciNetCrossRef

Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of the 2007 seventh IEEE international conference on data mining, IEEE Computer Society, USA, ICDM ’07, pp 577–582. https://doi.org/10.1109/ICDM.2007.98

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) Cluster: cluster analysis basics and extensions

Marshall A, Altman D, Holder R, Royston P (2009) Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol 9(1):57CrossRef

McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering, vol 38. M. Dekker New York

Meng XL (1994) Multiple-imputation inferences with uncongenial sources of input (with discussion). Stat Sci 10:538–573

Mourer A, Forest F, Lebbah M, Azzag H, Lacaille J (2020) Selecting the number of clusters \(k\) with a stability trade-off: an internal validation criterion. arXiv:2006.08530

Plaehn D (2019) Revisiting french tomato data: cluster analysis with incomplete data. Food Qual Pref 76:146–159. https://doi.org/10.1016/j.foodqual.2019.03.014CrossRef

Rubin D (1976) Inference and missing data. Biometrika 63:581–592MathSciNetCrossRefMATH

Rubin D (1987) Multiple imputation for non-response in survey. Wiley, New-YorkCrossRef

Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall/CRC, LondonCrossRefMATH

Schafer J (2003) Multiple imputation in multivariate problems when the imputation and analysis models differ. Statistica Neerlandica 57(1):19–35MathSciNetCrossRef

Strehl A, Ghosh J, Cardie C (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATH

van Buuren S (2018) Flexible imputation of missing data (Chapman & Hall/CRC Interdisciplinary Statistics). Chapman and Hall/CRC

Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. IJPRAI 25(3):337–372MathSciNet

Wang J (2010) Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4):893–904. https://doi.org/10.1093/biomet/asq061MathSciNetCrossRefMATH

Ward JHJ (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244. https://doi.org/10.1080/01621459.1963.10500845MathSciNetCrossRef

Titel: Clustering with missing data: which equivalent for Rubin’s rules?
verfasst von: Vincent Audigier
Ndèye Niang
Publikationsdatum: 01.09.2022
Verlag: Springer Berlin Heidelberg
Erschienen in: Advances in Data Analysis and Classification / Ausgabe 3/2023
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI: https://doi.org/10.1007/s11634-022-00519-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2023

A dual subspace parsimonious mixture of matrix normal distributions

On smoothing and scaling language model for sentiment based information retrieval

New models for symbolic data analysis

Sequential classification of customer behavior based on sequence-to-sequence learning with gated-attention neural networks

Editorial for ADAC issue 3 of volume 17 (2023)

Semiparametric finite mixture of regression models with Bayesian P-splines

Premium Partner