Skip to main content
Top
Published in: Advances in Data Analysis and Classification 4/2023

12-02-2023 | Regular Article

Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components

Authors: Marie du Roy de Chaumaray, Matthieu Marbac

Published in: Advances in Data Analysis and Classification | Issue 4/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package MNARclust available on CRAN.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Allman ES, Matias C, Rhodes JA et al (2009) Identifiability of parameters in latent structure models with many observed variables. Ann Stat 37(6A):3099–3132MathSciNetMATH Allman ES, Matias C, Rhodes JA et al (2009) Identifiability of parameters in latent structure models with many observed variables. Ann Stat 37(6A):3099–3132MathSciNetMATH
go back to reference Audigier V, Niang N, Resche-Rigon M (2021) Clustering with missing data: which imputation model for which cluster analysis method? arXiv:2106.04424 Audigier V, Niang N, Resche-Rigon M (2021) Clustering with missing data: which imputation model for which cluster analysis method? arXiv:​2106.​04424
go back to reference Basagaña X, Barrera-Gómez J, Benet M, Antó JM, Garcia-Aymerich J (2013) A framework for multiple imputation in cluster analysis. Am J Epidemiol 177(7):718–725 Basagaña X, Barrera-Gómez J, Benet M, Antó JM, Garcia-Aymerich J (2013) A framework for multiple imputation in cluster analysis. Am J Epidemiol 177(7):718–725
go back to reference Benaglia T, Chauveau D, Hunter DR (2009) An EM-like algorithm for semi-and nonparametric estimation in multivariate mixtures. J Comput Graph Stat 18(2):505–526MathSciNetMATH Benaglia T, Chauveau D, Hunter DR (2009) An EM-like algorithm for semi-and nonparametric estimation in multivariate mixtures. J Comput Graph Stat 18(2):505–526MathSciNetMATH
go back to reference Benaglia T, Chauveau D, Hunter DR (2011) Bandwidth selection in an EM-like algorithm for nonparametric multivariate mixtures. In: Nonparametric statistics and mixture models: a festschrift in honor of Thomas P Hettmansperger, pp 15–27. World Scientific Benaglia T, Chauveau D, Hunter DR (2011) Bandwidth selection in an EM-like algorithm for nonparametric multivariate mixtures. In: Nonparametric statistics and mixture models: a festschrift in honor of Thomas P Hettmansperger, pp 15–27. World Scientific
go back to reference Biernacki C, Celeux G, Govaert G (2010) Exact and Monte Carlo calculations of integrated likelihoods for the latent class model. J Stat Plan Inference 140(11):2991–3002MathSciNetMATH Biernacki C, Celeux G, Govaert G (2010) Exact and Monte Carlo calculations of integrated likelihoods for the latent class model. J Stat Plan Inference 140(11):2991–3002MathSciNetMATH
go back to reference Bonhomme S, Jochmans K, Robin J-M (2016) Estimating multivariate latent-structure models. Ann Stat 44(2):540–563MathSciNetMATH Bonhomme S, Jochmans K, Robin J-M (2016) Estimating multivariate latent-structure models. Ann Stat 44(2):540–563MathSciNetMATH
go back to reference Bruckers L, Molenberghs G, Dendale P (2017) Clustering multiply imputed multivariate high-dimensional longitudinal profiles. Biometr J 59(5):998–1015MathSciNetMATH Bruckers L, Molenberghs G, Dendale P (2017) Clustering multiply imputed multivariate high-dimensional longitudinal profiles. Biometr J 59(5):998–1015MathSciNetMATH
go back to reference Chauveau D, Hoang VTL (2016) Nonparametric mixture models with conditionally independent multivariate component densities. Comput Stat Data Anal 103:1–16MathSciNetMATH Chauveau D, Hoang VTL (2016) Nonparametric mixture models with conditionally independent multivariate component densities. Comput Stat Data Anal 103:1–16MathSciNetMATH
go back to reference Chauveau D, Hunter DR, Levine M et al (2015) Semi-parametric estimation for conditional independence multivariate finite mixture models. Stat Surv 9:1–31MathSciNetMATH Chauveau D, Hunter DR, Levine M et al (2015) Semi-parametric estimation for conditional independence multivariate finite mixture models. Stat Surv 9:1–31MathSciNetMATH
go back to reference Chi JT, Chi EC (2014) kpodclustr: An r package for clustering partially observed data. version 1.0 Chi JT, Chi EC (2014) kpodclustr: An r package for clustering partially observed data. version 1.0
go back to reference Chi JT, Chi EC, Baraniuk RG (2016) k-pod: a method for k-means clustering of missing data. Am Stat 70(1):91–99MathSciNetMATH Chi JT, Chi EC, Baraniuk RG (2016) k-pod: a method for k-means clustering of missing data. Am Stat 70(1):91–99MathSciNetMATH
go back to reference Chow C, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3):462–467MATH Chow C, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3):462–467MATH
go back to reference Fruhwirth-Schnatter S, Celeux G, Robert CP (2019) Handbook of mixture analysis. CRC Press, Boca RatonMATH Fruhwirth-Schnatter S, Celeux G, Robert CP (2019) Handbook of mixture analysis. CRC Press, Boca RatonMATH
go back to reference Hall P, Zhou X-H et al (2003) Nonparametric estimation of component distributions in a multivariate mixture. Ann Stat 31(1):201–224MathSciNetMATH Hall P, Zhou X-H et al (2003) Nonparametric estimation of component distributions in a multivariate mixture. Ann Stat 31(1):201–224MathSciNetMATH
go back to reference Hand DJ, Yu K (2001) Idiot’s Bayes-not so stupid after all? Int Stat Rev 69(3):385–398MATH Hand DJ, Yu K (2001) Idiot’s Bayes-not so stupid after all? Int Stat Rev 69(3):385–398MATH
go back to reference Härdle W, Müller M, Sperlich S, Werwatz A (2004) Nonparametric and semiparametric models, vol 1. Springer, BerlinMATH Härdle W, Müller M, Sperlich S, Werwatz A (2004) Nonparametric and semiparametric models, vol 1. Springer, BerlinMATH
go back to reference Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218MATH Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218MATH
go back to reference Kasahara H, Shimotsu K (2014) Non-parametric identification and estimation of the number of components in multivariate mixtures. J R Stat Soc Ser B (Stat Methodol) 76(1):97–111MathSciNetMATH Kasahara H, Shimotsu K (2014) Non-parametric identification and estimation of the number of components in multivariate mixtures. J R Stat Soc Ser B (Stat Methodol) 76(1):97–111MathSciNetMATH
go back to reference Kwon C, Mbakop E (2020) Estimation of the number of components of non-parametric multivariate finite mixture models. Ann Stat (to appear) Kwon C, Mbakop E (2020) Estimation of the number of components of non-parametric multivariate finite mixture models. Ann Stat (to appear)
go back to reference Levine M, Hunter DR, Chauveau D (2011) Maximum smoothed likelihood for multivariate mixtures. Biometrika 98(2):403–416MathSciNetMATH Levine M, Hunter DR, Chauveau D (2011) Maximum smoothed likelihood for multivariate mixtures. Biometrika 98(2):403–416MathSciNetMATH
go back to reference Little RJ (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88(421):125–134MATH Little RJ (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88(421):125–134MATH
go back to reference Little RJ, Rubin DB (2002) Statistical analysis with missing data, vol 793. Wiley, New YorkMATH Little RJ, Rubin DB (2002) Statistical analysis with missing data, vol 793. Wiley, New YorkMATH
go back to reference Little RJ, Rubin DB, Zangeneh SZ (2017) Conditions for ignoring the missing-data mechanism in likelihood inferences for parameter subsets. J Am Stat Assoc 112(517):314–320MathSciNet Little RJ, Rubin DB, Zangeneh SZ (2017) Conditions for ignoring the missing-data mechanism in likelihood inferences for parameter subsets. J Am Stat Assoc 112(517):314–320MathSciNet
go back to reference Marbac M, Sedki M (2017) A family of block-wise one-factor distributions for modeling high-dimensional binary data. Comput Stat Data Anal 114:130–145MathSciNetMATH Marbac M, Sedki M (2017) A family of block-wise one-factor distributions for modeling high-dimensional binary data. Comput Stat Data Anal 114:130–145MathSciNetMATH
go back to reference Marbac M, Sedki M (2019) VarSelLCM: an R/C++ package for variable selection in model-based clustering of mixed-data with missing values. Bioinformatics 35(7):1255–1257 Marbac M, Sedki M (2019) VarSelLCM: an R/C++ package for variable selection in model-based clustering of mixed-data with missing values. Bioinformatics 35(7):1255–1257
go back to reference McLachlan G, Peel D (2000) Finite mixutre models. Wiley series in probability and statistics: applied probability and statistics. Wiley-Interscience, New York McLachlan G, Peel D (2000) Finite mixutre models. Wiley series in probability and statistics: applied probability and statistics. Wiley-Interscience, New York
go back to reference Miao W, Ding P, Geng Z (2016) Identifiability of normal and normal mixture models with nonignorable missing data. J Am Stat Assoc 111(516):1673–1683MathSciNet Miao W, Ding P, Geng Z (2016) Identifiability of normal and normal mixture models with nonignorable missing data. J Am Stat Assoc 111(516):1673–1683MathSciNet
go back to reference Molenberghs G, Beunckens C, Sotto C, Kenward MG (2008) Every missingness not at random model has a missingness at random counterpart with equal fit. J R Stat Soc Ser B (Stat Methodol) 70(2):371–388MathSciNetMATH Molenberghs G, Beunckens C, Sotto C, Kenward MG (2008) Every missingness not at random model has a missingness at random counterpart with equal fit. J R Stat Soc Ser B (Stat Methodol) 70(2):371–388MathSciNetMATH
go back to reference Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G (2014) Handbook of missing data methodology. CRC Press, Boca RatonMATH Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G (2014) Handbook of missing data methodology. CRC Press, Boca RatonMATH
go back to reference Morris TP, White IR, Crowther MJ (2019) Using simulation studies to evaluate statistical methods. Stat Med 38(11):2074–2102MathSciNet Morris TP, White IR, Crowther MJ (2019) Using simulation studies to evaluate statistical methods. Stat Med 38(11):2074–2102MathSciNet
go back to reference Panagiotelis A, Czado C, Joe H (2012) Pair copula constructions for multivariate discrete data. J Am Stat Assoc 107(499):1063–1072MathSciNetMATH Panagiotelis A, Czado C, Joe H (2012) Pair copula constructions for multivariate discrete data. J Am Stat Assoc 107(499):1063–1072MathSciNetMATH
go back to reference Rotnitzky A, Robins J (1997) Analysis of semi-parametric regression models with non-ignorable non-response. Stat Med 16(1):81–102 Rotnitzky A, Robins J (1997) Analysis of semi-parametric regression models with non-ignorable non-response. Stat Med 16(1):81–102
go back to reference Salzberg SL (1988) Exemplar-based learning: theory and implementation. Harvard University, Center for Research in Computing Technology, Aiken Salzberg SL (1988) Exemplar-based learning: theory and implementation. Harvard University, Center for Research in Computing Technology, Aiken
go back to reference Schafer JL (1997) Analysis of incomplete multivariate data. CRC Press, Boca RatonMATH Schafer JL (1997) Analysis of incomplete multivariate data. CRC Press, Boca RatonMATH
go back to reference Silverman BW (2018) Density estimation for statistics and data analysis. Routledge, Milton Park Silverman BW (2018) Density estimation for statistics and data analysis. Routledge, Milton Park
go back to reference Stephens CR, Huerta HF, Linares AR (2018) When is the Naive Bayes approximation not so Naive? Mach Learn 107(2):397–441MathSciNetMATH Stephens CR, Huerta HF, Linares AR (2018) When is the Naive Bayes approximation not so Naive? Mach Learn 107(2):397–441MathSciNetMATH
go back to reference Tsiatis A (2007) Semiparametric theory and missing data. Springer, BerlinMATH Tsiatis A (2007) Semiparametric theory and missing data. Springer, BerlinMATH
go back to reference Van Buuren S (2018) Flexible imputation of missing data. CRC Press, Boca RatonMATH Van Buuren S (2018) Flexible imputation of missing data. CRC Press, Boca RatonMATH
go back to reference Webb GI, Boughton JR, Wang Z (2005) Not so Naive Bayes: aggregating one-dependence estimators. Mach Learn 58(1):5–24MATH Webb GI, Boughton JR, Wang Z (2005) Not so Naive Bayes: aggregating one-dependence estimators. Mach Learn 58(1):5–24MATH
go back to reference Weir I, Pettitt A (2000) Binary probability maps using a hidden conditional autoregressive gaussian process with an application to Finnish common toad data. J R Stat Soc Ser C (Appl Stat) 49(4):473–484MathSciNetMATH Weir I, Pettitt A (2000) Binary probability maps using a hidden conditional autoregressive gaussian process with an application to Finnish common toad data. J R Stat Soc Ser C (Appl Stat) 49(4):473–484MathSciNetMATH
go back to reference Zheng C, Wu Y (2019) Nonparametric estimation of multivariate mixtures. J Am Stat Assoc 115(531):1456–1471MathSciNetMATH Zheng C, Wu Y (2019) Nonparametric estimation of multivariate mixtures. J Am Stat Assoc 115(531):1456–1471MathSciNetMATH
go back to reference Zhu X, Hunter DR (2016) Theoretical grounding for estimation in conditional independence multivariate finite mixture models. J Nonparametric Stat 28(4):683–701MathSciNetMATH Zhu X, Hunter DR (2016) Theoretical grounding for estimation in conditional independence multivariate finite mixture models. J Nonparametric Stat 28(4):683–701MathSciNetMATH
go back to reference Zhu X, Hunter DR (2019) Clustering via finite nonparametric ICA mixture models. Adv Data Analy Classif 13(1):65–87MathSciNetMATH Zhu X, Hunter DR (2019) Clustering via finite nonparametric ICA mixture models. Adv Data Analy Classif 13(1):65–87MathSciNetMATH
Metadata
Title
Clustering data with non-ignorable missingness using semi-parametric mixture models assuming independence within components
Authors
Marie du Roy de Chaumaray
Matthieu Marbac
Publication date
12-02-2023
Publisher
Springer Berlin Heidelberg
Published in
Advances in Data Analysis and Classification / Issue 4/2023
Print ISSN: 1862-5347
Electronic ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-023-00534-w

Other articles of this Issue 4/2023

Advances in Data Analysis and Classification 4/2023 Go to the issue

Premium Partner