Elsevier

Neural Networks

Volume 79, July 2016, Pages 20-36
Neural Networks

Robust mixture of experts modeling using the t distribution

https://doi.org/10.1016/j.neunet.2016.03.002Get rights and content

Abstract

Mixture of Experts (MoE) is a popular framework for modeling heterogeneity in data for regression, classification, and clustering. For regression and cluster analyses of continuous data, MoE usually uses normal experts following the Gaussian distribution. However, for a set of data containing a group or groups of observations with heavy tails or atypical observations, the use of normal experts is unsuitable and can unduly affect the fit of the MoE model. We introduce a robust MoE modeling using the t distribution. The proposed t MoE (TMoE) deals with these issues regarding heavy-tailed and noisy data. We develop a dedicated expectation–maximization (EM) algorithm to estimate the parameters of the proposed model by monotonically maximizing the observed data log-likelihood. We describe how the presented model can be used in prediction and in model-based clustering of regression data. The proposed model is validated on numerical experiments carried out on simulated data, which show the effectiveness and the robustness of the proposed model in terms of modeling non-linear regression functions as well as in model-based clustering. Then, it is applied to the real-world data of tone perception for musical data analysis, and the one of temperature anomalies for the analysis of climate change data. The obtained results show the usefulness of the TMoE model for practical applications.

Introduction

Mixture of experts (MoE) introduced by Jacobs, Jordan, Nowlan, and Hinton (1991) is widely studied in statistics and machine learning. They consist in a fully conditional mixture model where both the mixing proportions, known as the gating functions, and the component densities, known as the experts, are conditional on some input covariates. MoE has been investigated, in their simple form, as well as in their hierarchical form (Jordan & Jacobs, 1994) (e.g. Section 5.12 of McLachlan & Peel, 2000) for regression and model-based cluster and discriminant analyses and in different application domains. A complete review of the MoE models can be found in Yuksel, Wilson, and Gader (2012). For continuous data, which we consider here in the context of non-linear regression and model-based cluster analysis, MoE usually uses normal experts, that is, expert components following the Gaussian distribution. Along this paper, we will call it the normal mixture of experts, abbreviated NMoE. It is well-known that the normal distribution is sensitive to outliers, which makes NMoE unsuitable to noisy data. Moreover, for a set of data containing a group or groups of observations with heavy tails, the use of normal experts may be unsuitable and can unduly affect the fit of the MoE model. In this paper, we attempt to overcome these limitations in MoE by proposing a more adapted and robust MoE model which can deal with the issues of heavy-tailed and atypical data.

The problem of sensitivity of NMoE to outliers has been considered very recently by Nguyen and McLachlan (2016) where the authors proposed a Laplace mixture of linear experts (LMoLE) for a robust modeling of non-linear regression data. The model parameters are estimated by maximizing the observed-data likelihood via a minorization–maximization (MM) algorithm. Here, we propose an alternative MoE model, by relying on the t distribution. We call this proposed model the t mixture of experts, abbreviated TMoE. The t distribution provides indeed a natural robust extension of the normal distribution to model data with possible outliers and tails more heavy compared to the normal distribution. It has been considered to develop the t mixture model proposed by Mclachlan and Peel (1998) for robust cluster analysis of multivariate data. We also mention that Lin, Lee, and Hsieh (2007) also proposed a mixture of skew t distributions to deal with heavy-tailed and asymmetric distributions. However, in the skew-t mixture model of Lin et al. (2007), the mixing proportions and the components means are constant, that is, they are not predictor-depending. In the proposed TMoE, however, we consider t expert components in which both the mixing proportions and the mixture component means are predictor-depending. More specifically, we use polynomial regressors for the components, as well as multinomial logistic regressors for the mixing proportions. In the framework of regression analysis, recently, Bai, Yao, and Boyer (2012), Ingrassia, Minotti, and Vittadini (2012) proposed a robust mixture modeling of regression on univariate data, by using a univariate t-mixture model. For the general multivariate case using t mixtures, one can refer to for example the two key papers Mclachlan and Peel (1998) and Peel and Mclachlan (2000). The inference in the previously described approaches is performed by maximum likelihood estimation via expectation–maximization (EM) or extensions (Dempster et al., 1977, McLachlan and Krishnan, 2008), in particular the expectation conditional maximization (ECM) algorithm (Meng & Rubin, 1993). For the Bayesian framework, Frühwirth-Schnatter and Pyne (2010) have considered the Bayesian inference for both the univariate and the multivariate skew-normal and skew-t mixtures. For the regression context, the robust modeling of regression data has been studied namely by Ingrassia et al. (2012), Wei (2012) who considered a t-mixture model for regression analysis of univariate data, as well as by Bai et al. (2012) who relied on the M-estimate in mixture of linear regressions. In the same context of regression, Song, Yao, and Xing (2014) proposed the mixture of Laplace regressions, which has been then extended by Nguyen and McLachlan (2016) to the case of mixture of experts, by introducing the Laplace mixture of linear experts (LMoLE). However, unlike our proposed TMoE model, the regression mixture models of Bai et al. (2012), Ingrassia et al. (2012), Song et al. (2014) and Wei (2012) do not consider conditional mixing proportions, that is, mixing proportions depending on some input variables, as in the case of mixture of experts, which we investigate here.

Here we consider the MoE framework for non-linear regression problems and model-based clustering of regression data, and we attempt to overcome the limitations of the NMoE model for dealing with heavy-tailed data and which may contain outliers. We investigate the use of the t distribution for the experts, rather than the commonly used normal distribution. The t-mixture of experts model (TMoE) handles the issues regarding namely the sensitivity of the NMoE to outliers. This model is an extension of the unconditional mixture of t distributions (Mclachlan and Peel, 1998, Wei, 2012), to the mixture of experts (MoE) framework, where the mixture means are regression functions and the mixing proportions are covariate-varying. For the models inference, we develop a dedicated expectation–maximization (EM) algorithm to estimate the parameters of the proposed model by monotonically maximizing the observed data log-likelihood. The EM algorithm is indeed a very popular and successful estimation algorithm for mixture models in general and for mixture of experts in particular. Indeed, the EM algorithm for MoE has been shown by Ng and McLachlan (2004) to be monotonically maximizing the MoE likelihood. The authors have shown that the EM (with IRLS in this case) algorithm has stable convergence and the log-likelihood is monotonically increasing when a learning rate smaller than one is adopted for the IRLS procedure within the M-step of the EM algorithm. They have further proposed an expectation conditional maximization (ECM) algorithm to train MoE, which also has desirable numerical properties. Beyond the frequentist framework we consider here, we also mention the MoE has also been considered in the Bayesian framework, for example one can cite the Bayesian MoE Waterhouse (1997), Waterhouse, Mackay, and Robinson (1996) and the Bayesian hierarchical MoE Bishop and Svensén (2003). Beyond the Bayesian parametric framework, the MoE models have also been investigated within the Bayesian non-parametric framework. We cite for example the Bayesian non-parametric MoE model (Rasmussen & Ghahramani, 2001) and the Bayesian non-parametric hierarchical MoE approach of Shi et al. (2005) using Gaussian Processes experts for regression. For further models on mixture of experts for regression, the reader can refer to for example the book of Shi and Choi (2011). In this paper, we investigate semi-parametric models under the maximum likelihood estimation framework.

The remainder of this paper is organized as follows. In Section  2 we briefly recall the MoE framework, particularly the NMoE model and its maximum-likelihood estimation via EM. Then, in Section  3 we present the TMoE model and derive its parameter estimation technique using the EM algorithm in Section  4. Next, in Section  5 we investigate the use of the proposed models for fitting non-linear regression functions as well for prediction. We also show in Section  6 how the models can be used in a model-based clustering prospective. In Section  7, we discuss the model selection problem for the model. In Section  8, we perform experiments to assess the proposed models. Finally, Section  9 is dedicated to conclusions and future work.

Section snippets

Mixture of experts for continuous data

Mixture of experts (Jacobs et al., 1991, Jordan and Jacobs, 1994) is used in a variety of contexts including regression, classification and clustering. Here we consider the MoE framework for fitting (non-linear) regression functions and clustering of univariate continuous data. The aim of regression is to explore the relationship of an observed random variable Y given a covariate vector XRp via conditional density functions for Y|X=x of the form f(y|x), rather than only exploring the

The t MoE (TMoE) model

The proposed t MoE (TMoE) model is based on the t distribution, which is known as a robust generalization of the normal distribution. The t distribution is recalled in the following section. We also describe its stochastic and hierarchical representations, which will be used to derive those of the proposed TMoE model.

Maximum likelihood estimation of the TMoE model

Given an i.i.d. sample of n observations, the unknown parameter vector Ψ can be estimated by maximizing the observed-data log-likelihood, which, under the TMoE model, is given by: logL(Ψ)=i=1nlogk=1Kπk(ri;α)t(yi;μ(xi;βk),σk2,νk). To perform this maximization, we first use the EM algorithm and then describe an extension based on the ECM algorithm (Meng & Rubin, 1993) as in Liu and Rubin (1995) for a single t distribution, and as in Mclachlan and Peel (1998) and Peel and Mclachlan (2000) for

Prediction using the TMoE

The goal in regression is to be able to make predictions for the response variable(s) given some new value of the predictor variable(s) on the basis of a model trained on a set of training data. In regression analysis using MoE, the aim is therefore to predict the response y given new values of the predictors (x,r), on the basis of a MoE model characterized by a parameter vector Ψˆ inferred from a set of training data, here, by maximum likelihood via EM. These predictions can be expressed in

Model-based clustering using the TMoE

It is natural to utilize the MoE models for a model-based clustering perspective to provide a partition of the regression data into K clusters. Model-based clustering using the TMoE, as in MoE in general, consists in assuming that the observed data {xi,ri,yi}i=1n are generated from a K component mixture of t experts with parameter vector Ψ. The mixture components can be interpreted as clusters and hence each cluster can be associated with a mixture component. The problem of clustering therefore

Model selection for the NNMoE

One of the issues in mixture model-based clustering is model selection. The problem of model selection for the TMoE model presented here in its general form, is equivalent to the one of choosing the optimal number of experts K, the degree p of the polynomial regression and the degree q for the logistic regression. The optimal value of (K,p,q) can be computed by using some model selection criteria such as the Akaike Information Criterion (AIC) (Akaike, 1974), the Bayesian Information Criterion

Experimental study

This section is dedicated to the evaluation of the proposed approach on simulated data and real-world data. We evaluated the performance of proposed EM algorithm by comparing it with the standard normal MoE (NMoE) model (Jacobs et al., 1991, Jordan and Jacobs, 1994) and the Laplace MoE of (Nguyen & McLachlan, 2016)1 on both simulated and real-world data sets.

Conclusion and future work

In this paper, we proposed a new robust non-normal MoE model, which generalizes the standard normal MoE. It is based on the t distribution and named TMoE. The TMoE model is suggested for data with possibly outliers and heavy tail. We developed an EM algorithm and ECM extension to infer the proposed model and described its use in non-linear regression and prediction, as well as in model-based clustering. The developed model is successfully applied and validated on simulated and real data sets.

Faicel Chamroukhi received his Master degree of Engineering Sciences, in the area of signals, images and robotics from Pierre & Marie Curie (Paris 6) University in 2007. Then, he received his Ph.D. degree in applied mathematics and computer science, in the area of statistical leaning and data analysis from Compiègne University of Technology in 2010. In 2011, he was qualified for the position of Associate Professor in applied mathematics (CNU 26), computer science (CNU 27), and signal processing

References (49)

  • R.P. Brent
  • E.A. Cohen

    Some effects of inharmonic partials on interval perception

    Music Perception

    (1984)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    Journal of the Royal Statistical Society: Series B

    (1977)
  • S. Faria et al.

    Fitting mixtures of linear regressions

    Journal of Statistical Computation and Simulation

    (2010)
  • S. Frühwirth-Schnatter
  • S. Frühwirth-Schnatter et al.

    Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions

    Biostatistics

    (2010)
  • S. Gaffney et al.

    Trajectory clustering with mixtures of regression models

  • P. Green

    Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives

    Journal of the Royal Statistical Society: Series B

    (1984)
  • J. Hansen et al.

    Giss analysis of surface temperature change

    Journal of Geophysical Research

    (1999)
  • J. Hansen et al.

    A closer look at united states and global surface temperature change

    Journal of Geophysical Research

    (2001)
  • D. Hunter et al.

    Semiparametric mixtures of regressions

    Journal of Nonparametric Statistics

    (2012)
  • S. Ingrassia et al.

    Local statistical modeling via a cluster-weighted approach with elliptical distributions

    Journal of Classification

    (2012)
  • R.A. Jacobs et al.

    Adaptive mixtures of local experts

    Neural Computation

    (1991)
  • W. Jiang et al.

    On the asymptotic normality of hierarchical mixtures-of-experts for generalized linear models

    IEEE Transactions on Information Theory

    (1999)
  • Cited by (39)

    • Deep supervised learning with mixture of neural networks

      2020, Artificial Intelligence in Medicine
      Citation Excerpt :

      For these three parts, the research of MoE can be divided into three areas [11] stated as follows: a) EM-based algorithm: Iterative Reweighted Least Squares (IRLS) [12], Generalized Expectation Maximization [13], Newton-Raphson Method [14], Expectation Maximization Algorithm [15–17]; b) Other algorithms: Gibbs sampling [18], Bayesian inference [19], Markov Chain Monte Carlo, etc.

    • Flexible mixture regression with the generalized hyperbolic distribution

      2024, Advances in Data Analysis and Classification
    View all citing articles on Scopus

    Faicel Chamroukhi received his Master degree of Engineering Sciences, in the area of signals, images and robotics from Pierre & Marie Curie (Paris 6) University in 2007. Then, he received his Ph.D. degree in applied mathematics and computer science, in the area of statistical leaning and data analysis from Compiègne University of Technology in 2010. In 2011, he was qualified for the position of Associate Professor in applied mathematics (CNU 26), computer science (CNU 27), and signal processing (CNU 61). Since september 2011, he is an Associate Professor at University of Toulon and the Information Sciences and Systems Lab (LSIS) UMR CNRS 7296. In 2015, he received his Accreditation to Supervise Research (HDR) in applied mathematics and computer science, in the area of statistical learning and data analysis, from Toulon University. Since 2016 he is qualified for the position of Professor in the three area of applied mathematics, computer science, and signal processing (CNU 26, 27, 61). In 2015, he was awarded a CNRS research leave and since september he moved to the Lab of mathematics Paul Painlevé (LPP) UMR CNRS 7296, probability and statistics team, in Lille, where he is also invited at INRIA - Modal team. His multidisciplinary research is in the area of Data Science and includes statistics, machine learning and statistical signal processing, with a particular focus of the statistical methodology and inference of latent data models for complex heterogeneous high-dimensional and massive data, temporal data, functional data, and their application to real-world problems including dynamical systems, acoustic/speech processing, life sciences (medicine, biology), information retrieval, social networks.

    View full text