Top

Advances in Data Analysis and Classification

Published in:

22-01-2022 | Regular Article

Model-based clustering and outlier detection with missing data

Authors: Hung Tong, Cristina Tortora

Published in: Advances in Data Analysis and Classification | Issue 1/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The use of the multivariate contaminated normal (MCN) distribution in model-based clustering is recommended to cluster data characterized by mild outliers, the model can at the same time detect outliers automatically and produce robust parameter estimates in each cluster. However, one of the limitations of this approach is that it requires complete data, i.e. the MCN cannot be used directly on data with missing values. In this paper, we develop a framework for fitting a mixture of MCN distributions to incomplete data sets, i.e. data sets with some values missing at random. Parameter estimation is obtained using the expectation-conditional maximization algorithm—a variant of the expectation-maximization algorithm in which the traditional maximization steps are instead replaced by simpler conditional maximization steps. We perform a simulation study to compare the results of our model to a mixture of multivariate normal and Student’s t distributions for incomplete data. The simulation also includes a study on the effect of the percentage of missing data on the performance of the three algorithms. The model is then applied to the Automobile data set (UCI machine learning repository). The results show that, while the Student’s t distribution gives similar classification performance, the MCN works better in detecting outliers with a lower false positive rate of outlier detection. The performance of all the techniques decreases linearly as the percentage of missing values increases.

previous article Editorial for ADAC issue 1 of volume 16 (2022)

next article Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

https://cran.r-project.org/package=MixtureMissing.

https://archive.ics.uci.edu/ml/datasets/automobile.

Aitken A (1926) A series formula for the roots of algebraic and transcendental equations. Proc R Soc Edinb 45(1):14–22MATHCrossRef

Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575MathSciNetMATHCrossRef

Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay BG (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Ann Inst Stat Math 46(2):373–388MATHCrossRef

Buck S (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J R Stat Soc B 22:302–306MathSciNetMATH

Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust gaussian clustering. J Am Stat Assoc 111(516):1648–1659MathSciNetCrossRef

Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–22MathSciNetMATH

García-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A et al (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345MathSciNetMATHCrossRef

Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2019) mvtnorm: multivariate normal and t distributions. R package version 1.0-10

Ghahramani Z, Jordan MI (1994) Learning from incomplete data. Technical report, USA

Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218MATHCrossRef

Karlis D, Xekalaki E (2003) Choosing initial values for the EM algorithm for finite mixtures. Comput Stat Data Anal 41(3–4):577–590MathSciNetMATHCrossRef

Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. In: Dodge Y (ed) Statistical data analysis based on the L1-norm and related methods, pp 405–416

Lin TI (2014) Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition. Comput Stat Data Anal 71:183–195MathSciNetMATHCrossRef

Liu C, Rubin DB (1994) The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4):633–648MathSciNetMATHCrossRef

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K, Studer M, Roudier P (2016) cluster: cluster analysis extended Rousseeuw et al. R package version 2.0.4

McNicholas PD (2016) Mixture model-based classification. CRC Press, Boca RatonMATHCrossRef

McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723MathSciNetMATHCrossRef

Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278MathSciNetMATHCrossRef

Novi Inverardi PL, Taufer E (2020) Outlier detection through mixtures with an improper component. Electron J Appl Stat Anal 13(1):146–163

Peel D, McLachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10(4):339–348CrossRef

Punzo A, McNicholas PD (2016) Parsimonious mixtures of multivariate contaminated normal distributions. Biom J 58(6):1506–1537MathSciNetMATHCrossRef

Punzo A, Tortora C (2021) Multiple scaled contaminated normal distribution and its application in clustering. Stat Model 21(4):332–358

Punzo A, Mazza A, McNicholas PD (2016) Contaminatedmixt: an R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. arXiv preprint arXiv:1606.03766

Qiu W, Joe H (2020) clusterGeneration: random cluster generation (with specified degree of separation). https://CRAN.R-project.org/package=clusterGeneration. R package version 1.3.7

R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850CrossRef

Ritter G (2014) Robust cluster analysis and variable selection. CRC Press, Boca RatonMATHCrossRef

Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592MathSciNetMATHCrossRef

Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, HobokenMATH

Salgado CM, Azevedo C, Proença H, Vieira SM (2016) Noise versus outliers, pp 163–183. https://doi.org/10.1007/978-3-319-43742-2_14

Serafini A, Murphy TB, Scrucca L (2020) Handling missing data in model-based clustering. arXiv preprint arXiv:2006.02954

Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, ChichesterMATH

Tortora C, ElSherbiny A, Browne RP, Franczak BC, McNicholas PD, Amos DD (2020) MixGHD: model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. https://CRAN.R-project.org/package=MixGHD. R package version 2.3.4

van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67CrossRef

Wang WL, Lin TI (2015) Robust model-based clustering via mixtures of skew-t distributions with missing information. Adv Data Anal Classif 9(4):423–445MathSciNetMATHCrossRef

Wang H, Zhang Q, Luo B, Wei S (2004) Robust mixture modelling using multivariate \(t\)-distribution with missing information. Pattern Recognit Lett 25(6):701–710

Wei Y, Tang Y, McNicholas PD (2019) Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Comput Stat Data Anal 130:18–41MathSciNetMATHCrossRef

Wilks SS (1932) Moments and distributions of estimates of population parameters from fragmentary samples. Ann Math Stat 3(3):163–195MATHCrossRef

Yu C, Chen K, Yao W (2015) Outlier detection and robust mixture modeling using nonconvex penalized likelihood. J Stat Plan Inference 164:27–38MathSciNetMATHCrossRef

Yu C, Yao W, Chen K (2017) A new method for robust mixture regression. Can J Stat 45(1):77–94MathSciNetMATHCrossRef

Title: Model-based clustering and outlier detection with missing data
Authors: Hung Tong
Cristina Tortora
Publication date: 22-01-2022
Publisher: Springer Berlin Heidelberg
Published in: Advances in Data Analysis and Classification / Issue 1/2022
Print ISSN: 1862-5347
Electronic ISSN: 1862-5355
DOI: https://doi.org/10.1007/s11634-021-00476-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 1/2022

Robust optimal classification trees under noisy labels

Unobserved classes and extra variables in high-dimensional discriminant analysis

Multivariate cluster weighted models using skewed distributions

Editorial for ADAC issue 1 of volume 16 (2022)

Robust clustering of functional directional data

Strong consistency of the MLE under two-parameter Gamma mixture models with a structural scale parameter

Premium Partner