Skip to main content
Erschienen in: Advances in Data Analysis and Classification 1/2022

22.01.2022 | Regular Article

Model-based clustering and outlier detection with missing data

verfasst von: Hung Tong, Cristina Tortora

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm—a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student’s t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student’s t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aitken A (1926) A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb 45(1):14–22MATHCrossRef Aitken A (1926) A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb 45(1):14–22MATHCrossRef
Zurück zum Zitat Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575MathSciNetMATHCrossRef Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575MathSciNetMATHCrossRef
Zurück zum Zitat Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388MATHCrossRef Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388MATHCrossRef
Zurück zum Zitat Buck S (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J R Stat Soc B 22:302–306MathSciNetMATH Buck S (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J R Stat Soc B 22:302–306MathSciNetMATH
Zurück zum Zitat Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust gaussian clustering. J Am Stat Assoc 111(516):1648–1659MathSciNetCrossRef Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust gaussian clustering. J Am Stat Assoc 111(516):1648–1659MathSciNetCrossRef
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–22MathSciNetMATH Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–22MathSciNetMATH
Zurück zum Zitat García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A et al (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345MathSciNetMATHCrossRef García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A et al (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345MathSciNetMATHCrossRef
Zurück zum Zitat Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2019) mvtnorm: multivariate normal and t distributions. R package version 1.0-10 Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2019) mvtnorm: multivariate normal and t distributions. R package version 1.0-10
Zurück zum Zitat Ghahramani Z, Jordan MI (1994) Learning from incomplete data. Technical report, USA Ghahramani Z, Jordan MI (1994) Learning from incomplete data. Technical report, USA
Zurück zum Zitat Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41(3–4):577–590MathSciNetMATHCrossRef Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41(3–4):577–590MathSciNetMATHCrossRef
Zurück zum Zitat Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods, pp 405–416 Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods, pp 405–416
Zurück zum Zitat Lin TI (2014) Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195MathSciNetMATHCrossRef Lin TI (2014) Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195MathSciNetMATHCrossRef
Zurück zum Zitat Liu C, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4):633–648MathSciNetMATHCrossRef Liu C, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4):633–648MathSciNetMATHCrossRef
Zurück zum Zitat Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K, Studer M, Roudier P (2016) cluster: cluster analysis extended Rousseeuw et al. R package version 2.0.4 Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K, Studer M, Roudier P (2016) cluster: cluster analysis extended Rousseeuw et al. R package version 2.0.4
Zurück zum Zitat McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723MathSciNetMATHCrossRef McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723MathSciNetMATHCrossRef
Zurück zum Zitat Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278MathSciNetMATHCrossRef Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278MathSciNetMATHCrossRef
Zurück zum Zitat Novi Inverardi PL, Taufer E (2020) Outlier detection through mixtures with an improper component. Electron J Appl Stat Anal 13(1):146–163 Novi Inverardi PL, Taufer E (2020) Outlier detection through mixtures with an improper component. Electron J Appl Stat Anal 13(1):146–163
Zurück zum Zitat Peel D, McLachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10(4):339–348CrossRef Peel D, McLachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10(4):339–348CrossRef
Zurück zum Zitat Punzo A, McNicholas PD (2016) Parsimonious mixtures of multivariate contaminated normal distributions. Biom J 58(6):1506–1537MathSciNetMATHCrossRef Punzo A, McNicholas PD (2016) Parsimonious mixtures of multivariate contaminated normal distributions. Biom J 58(6):1506–1537MathSciNetMATHCrossRef
Zurück zum Zitat Punzo A, Tortora C (2021) Multiple scaled contaminated normal distribution and its application in clustering. Stat Model 21(4):332–358 Punzo A, Tortora C (2021) Multiple scaled contaminated normal distribution and its application in clustering. Stat Model 21(4):332–358
Zurück zum Zitat Punzo A, Mazza A, McNicholas PD (2016) Contaminatedmixt: an R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. arXiv preprint arXiv:1606.03766 Punzo A, Mazza A, McNicholas PD (2016) Contaminatedmixt: an R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. arXiv preprint arXiv:​1606.​03766
Zurück zum Zitat R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Zurück zum Zitat Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850CrossRef Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850CrossRef
Zurück zum Zitat Ritter G (2014) Robust cluster analysis and variable selection. CRC Press, Boca RatonMATHCrossRef Ritter G (2014) Robust cluster analysis and variable selection. CRC Press, Boca RatonMATHCrossRef
Zurück zum Zitat Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, HobokenMATH Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, HobokenMATH
Zurück zum Zitat Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, ChichesterMATH Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, ChichesterMATH
Zurück zum Zitat Tortora C, ElSherbiny A, Browne RP, Franczak BC, McNicholas PD, Amos DD (2020) MixGHD: model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. https://CRAN.R-project.org/package=MixGHD. R package version 2.3.4 Tortora C, ElSherbiny A, Browne RP, Franczak BC, McNicholas PD, Amos DD (2020) MixGHD: model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. https://​CRAN.​R-project.​org/​package=​MixGHD. R package version 2.3.4
Zurück zum Zitat van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67CrossRef van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67CrossRef
Zurück zum Zitat Wang WL, Lin TI (2015) Robust model-based clustering via mixtures of skew-t distributions with missing information. Adv Data Anal Classif 9(4):423–445MathSciNetMATHCrossRef Wang WL, Lin TI (2015) Robust model-based clustering via mixtures of skew-t distributions with missing information. Adv Data Anal Classif 9(4):423–445MathSciNetMATHCrossRef
Zurück zum Zitat Wang H, Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate \(t\)-distribution with missing information. Pattern Recognit Lett 25(6):701–710 Wang H, Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate \(t\)-distribution with missing information. Pattern Recognit Lett 25(6):701–710
Zurück zum Zitat Wei Y, Tang Y, McNicholas PD (2019) Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Comput Stat Data Anal 130:18–41MathSciNetMATHCrossRef Wei Y, Tang Y, McNicholas PD (2019) Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Comput Stat Data Anal 130:18–41MathSciNetMATHCrossRef
Zurück zum Zitat Wilks SS (1932) Moments and distributions of estimates of population parameters from fragmentary samples. Ann Math Stat 3(3):163–195MATHCrossRef Wilks SS (1932) Moments and distributions of estimates of population parameters from fragmentary samples. Ann Math Stat 3(3):163–195MATHCrossRef
Zurück zum Zitat Yu C, Chen K, Yao W (2015) Outlier detection and robust mixture modeling using nonconvex penalized likelihood. J Stat Plan Inference 164:27–38MathSciNetMATHCrossRef Yu C, Chen K, Yao W (2015) Outlier detection and robust mixture modeling using nonconvex penalized likelihood. J Stat Plan Inference 164:27–38MathSciNetMATHCrossRef
Metadaten
Titel
Model-based clustering and outlier detection with missing data
verfasst von
Hung Tong
Cristina Tortora
Publikationsdatum
22.01.2022
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 1/2022
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-021-00476-1

Weitere Artikel der Ausgabe 1/2022

Advances in Data Analysis and Classification 1/2022 Zur Ausgabe

Premium Partner