Top

Advances in Data Analysis and Classification

Published in:

12-02-2023 | Regular Article

Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components

Authors: Marie du Roy de Chaumaray, Matthieu Marbac

Published in: Advances in Data Analysis and Classification | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package MNARclust available on CRAN.

previous article Robust instance-dependent cost-sensitive classification

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Allman ES, Matias C, Rhodes JA et al (2009) Identifiability of parameters in latent structure models with many observed variables. Ann Stat 37(6A):3099–3132MathSciNetMATH

Audigier V, Niang N (2020) Clustering with missing data: which equivalent for Rubin’s rules? arXiv:2011.13694

Audigier V, Niang N, Resche-Rigon M (2021) Clustering with missing data: which imputation model for which cluster analysis method? arXiv:2106.04424

Basagaña X, Barrera-Gómez J, Benet M, Antó JM, Garcia-Aymerich J (2013) A framework for multiple imputation in cluster analysis. Am J Epidemiol 177(7):718–725

Benaglia T, Chauveau D, Hunter DR (2009) An EM-like algorithm for semi-and nonparametric estimation in multivariate mixtures. J Comput Graph Stat 18(2):505–526MathSciNetMATH

Benaglia T, Chauveau D, Hunter DR (2011) Bandwidth selection in an EM-like algorithm for nonparametric multivariate mixtures. In: Nonparametric statistics and mixture models: a festschrift in honor of Thomas P Hettmansperger, pp 15–27. World Scientific

Biernacki C, Celeux G, Govaert G (2010) Exact and Monte Carlo calculations of integrated likelihoods for the latent class model. J Stat Plan Inference 140(11):2991–3002MathSciNetMATH

Bonhomme S, Jochmans K, Robin J-M (2016) Estimating multivariate latent-structure models. Ann Stat 44(2):540–563MathSciNetMATH

Bruckers L, Molenberghs G, Dendale P (2017) Clustering multiply imputed multivariate high-dimensional longitudinal profiles. Biometr J 59(5):998–1015MathSciNetMATH

Chauveau D, Hoang VTL (2016) Nonparametric mixture models with conditionally independent multivariate component densities. Comput Stat Data Anal 103:1–16MathSciNetMATH

Chauveau D, Hunter DR, Levine M et al (2015) Semi-parametric estimation for conditional independence multivariate finite mixture models. Stat Surv 9:1–31MathSciNetMATH

Chi JT, Chi EC (2014) kpodclustr: An r package for clustering partially observed data. version 1.0

Chi JT, Chi EC, Baraniuk RG (2016) k-pod: a method for k-means clustering of missing data. Am Stat 70(1):91–99MathSciNetMATH

Chow C, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3):462–467MATH

Fruhwirth-Schnatter S, Celeux G, Robert CP (2019) Handbook of mixture analysis. CRC Press, Boca RatonMATH

Hall P, Zhou X-H et al (2003) Nonparametric estimation of component distributions in a multivariate mixture. Ann Stat 31(1):201–224MathSciNetMATH

Hand DJ, Yu K (2001) Idiot’s Bayes-not so stupid after all? Int Stat Rev 69(3):385–398MATH

Härdle W, Müller M, Sperlich S, Werwatz A (2004) Nonparametric and semiparametric models, vol 1. Springer, BerlinMATH

Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218MATH

Hunter DR, Lange K (2004) A tutorial on mm algorithms. Am Stat 58(1):30–37MathSciNet

Kasahara H, Shimotsu K (2014) Non-parametric identification and estimation of the number of components in multivariate mixtures. J R Stat Soc Ser B (Stat Methodol) 76(1):97–111MathSciNetMATH

Kwon C, Mbakop E (2020) Estimation of the number of components of non-parametric multivariate finite mixture models. Ann Stat (to appear)

Levine M, Hunter DR, Chauveau D (2011) Maximum smoothed likelihood for multivariate mixtures. Biometrika 98(2):403–416MathSciNetMATH

Little RJ (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88(421):125–134MATH

Little RJ, Rubin DB (2002) Statistical analysis with missing data, vol 793. Wiley, New YorkMATH

Little RJ, Rubin DB, Zangeneh SZ (2017) Conditions for ignoring the missing-data mechanism in likelihood inferences for parameter subsets. J Am Stat Assoc 112(517):314–320MathSciNet

Marbac M, Sedki M (2017) A family of block-wise one-factor distributions for modeling high-dimensional binary data. Comput Stat Data Anal 114:130–145MathSciNetMATH

Marbac M, Sedki M (2019) VarSelLCM: an R/C++ package for variable selection in model-based clustering of mixed-data with missing values. Bioinformatics 35(7):1255–1257

McLachlan G, Peel D (2000) Finite mixutre models. Wiley series in probability and statistics: applied probability and statistics. Wiley-Interscience, New York

Meila M, Jordan MI (2000) Learning with mixtures of trees. J Mach Learn Res 1(Oct):1–48MathSciNetMATH

Miao W, Ding P, Geng Z (2016) Identifiability of normal and normal mixture models with nonignorable missing data. J Am Stat Assoc 111(516):1673–1683MathSciNet

Molenberghs G, Beunckens C, Sotto C, Kenward MG (2008) Every missingness not at random model has a missingness at random counterpart with equal fit. J R Stat Soc Ser B (Stat Methodol) 70(2):371–388MathSciNetMATH

Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G (2014) Handbook of missing data methodology. CRC Press, Boca RatonMATH

Morris TP, White IR, Crowther MJ (2019) Using simulation studies to evaluate statistical methods. Stat Med 38(11):2074–2102MathSciNet

Panagiotelis A, Czado C, Joe H (2012) Pair copula constructions for multivariate discrete data. J Am Stat Assoc 107(499):1063–1072MathSciNetMATH

Rotnitzky A, Robins J (1997) Analysis of semi-parametric regression models with non-ignorable non-response. Stat Med 16(1):81–102

Salzberg SL (1988) Exemplar-based learning: theory and implementation. Harvard University, Center for Research in Computing Technology, Aiken

Schafer JL (1997) Analysis of incomplete multivariate data. CRC Press, Boca RatonMATH

Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464MathSciNetMATH

Silverman BW (2018) Density estimation for statistics and data analysis. Routledge, Milton Park

Stephens CR, Huerta HF, Linares AR (2018) When is the Naive Bayes approximation not so Naive? Mach Learn 107(2):397–441MathSciNetMATH

Tsiatis A (2007) Semiparametric theory and missing data. Springer, BerlinMATH

Van Buuren S (2018) Flexible imputation of missing data. CRC Press, Boca RatonMATH

Webb GI, Boughton JR, Wang Z (2005) Not so Naive Bayes: aggregating one-dependence estimators. Mach Learn 58(1):5–24MATH

Weir I, Pettitt A (2000) Binary probability maps using a hidden conditional autoregressive gaussian process with an application to Finnish common toad data. J R Stat Soc Ser C (Appl Stat) 49(4):473–484MathSciNetMATH

Zheng C, Wu Y (2019) Nonparametric estimation of multivariate mixtures. J Am Stat Assoc 115(531):1456–1471MathSciNetMATH

Zhu X, Hunter DR (2016) Theoretical grounding for estimation in conditional independence multivariate finite mixture models. J Nonparametric Stat 28(4):683–701MathSciNetMATH

Zhu X, Hunter DR (2019) Clustering via finite nonparametric ICA mixture models. Adv Data Analy Classif 13(1):65–87MathSciNetMATH

Title: Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components
Authors: Marie du Roy de Chaumaray
Matthieu Marbac
Publication date: 12-02-2023
Publisher: Springer Berlin Heidelberg
Published in: Advances in Data Analysis and Classification / Issue 4/2023
Print ISSN: 1862-5347
Electronic ISSN: 1862-5355
DOI: https://doi.org/10.1007/s11634-023-00534-w

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2023

Monitoring photochemical pollutants based on symbolic interval-valued data analysis

Robust instance-dependent cost-sensitive classification

Attraction-repulsion clustering: a way of promoting diversity linked to demographic parity in fair clustering

Determinantal consensus clustering

LASSO regularization within the LocalGLMnet architecture

Proximal methods for sparse optimal scoring and discriminant analysis

Premium Partner