Consider a set of learning observations
\(\{\mathbf {x}_s; \bar{{\varvec{\ell }}}_{s}\}\), where
\(\mathbf {x}_s\) is the observation of a vector of random variables and
\(\bar{{\varvec{\ell }}}_{s}\) is the associate class label, such that
\({\bar{\ell }}_{sc} = 1\) if observation
s belongs to class
c, 0 otherwise;
\(c = 1,\ldots ,C\). The aim of supervised classification is to build a classifier from the complete learning data
\(\lbrace \mathbf {x}_s, \bar{{\varvec{\ell }}}_s \rbrace \) and use it to assign a new observation to one of the known classes.
Model-based discriminant analysis (MDA, Bouveyron et al.
2019; McLachlan
2012,
2004; Fraley and Raftery
2002) is a probabilistic approach for supervised classification of continuous data in which the data generating process is represented as follows:
$$\begin{aligned} \begin{aligned} \bar{{\varvec{\ell }}}_{s}&\sim \prod _{c=1}^C \tau _c^{ \,{\bar{\ell }}_{sc} },\\ (\mathbf {x}_s \,|\,{\bar{\ell }}_{sc} = 1)&\sim {\mathcal {N}}( {\varvec{\mu }}_c, {\varvec{\Sigma }}_c ), \end{aligned} \end{aligned}$$
(1)
where
\(\tau _c\) denotes the probability of observing class
c, with
\(\sum _c \tau _c = 1\). Consequently, the marginal density of each data point corresponds to the density of a Gaussian mixture distribution:
$$\begin{aligned} f(\mathbf {x}_s \,; {\varvec{\Theta }} ) = \sum _{c=1}^C \tau _c \, \phi ( \mathbf {x}_s \,;\, {\varvec{\mu }}_c, {\varvec{\Sigma }}_c), \end{aligned}$$
where
\(\phi ( . \,;\, {\varvec{\mu }}_c, {\varvec{\Sigma }}_c)\) is the multivariate Gaussian density, with mean
\({\varvec{\mu }}_c\) and covariance matrix
\({\varvec{\Sigma }}_c\), and
\({\varvec{\Theta }}\) is the collection of all mixture parameters. Then, using the
maximum a posteriori (MAP) rule, a new observation
\(\mathbf {y}_i\) is assigned to the class
\(\ell _{ic}\) with the highest posterior probability:
$$\begin{aligned} \Pr (\ell _{ic} = 1 \,|\,\mathbf {y}_i) = \dfrac{ \tau _c \, \phi ( \mathbf {y}_i \,;\, {\varvec{\mu }}_c, {\varvec{\Sigma }}_c ) }{ \sum _{c=1}^C \tau _c \, \phi ( \mathbf {y}_i \,;\, {\varvec{\mu }}_c, {\varvec{\Sigma }}_c) }. \end{aligned}$$
(2)
The framework is closely related to other discriminant analysis methods. If the covariance matrices are constrained to be the same across the classes, then the standard linear discriminant analysis (LDA) is recovered. On the other hand, if the covariance matrices have no constraints, the method corresponds to the standard quadratic discriminant analysis (QDA McLachlan
2004; Fraley and Raftery
2002). Several extension of this framework have been proposed in the literature in order to increase its flexibility and scope. For example, Hastie and Tibshirani (
1996) consider the case where each class density is itself a mixture of Gaussian distributions with common covariance matrix and known number of components. Fraley and Raftery (
2002) further generalize this approach, allowing the covariance matrices to be different across the sub-groups and applying model-based clustering to the observations of each class. Another approach, eigenvalue decomposition discriminant analysis (EDDA, Bensmail and Celeux
1996), is based on the family of parsimonious Gaussian models of Celeux and Govaert (
1995), which imposes cross-constraints on the eigen-decomposition of the class covariance matrices. This latter approach allows more flexibility than LDA, and is more structured than QDA and the methods of Fraley and Raftery (
2002), which could be over-parameterized. In high-dimensional settings, different approaches have been proposed based on regularization and variable selection: Friedman (
1989) and Xu et al. (
2009) propose regularized versions of discriminant analysis where a shrinkage parameter is introduced to control the degree of regularization between LDA and QDA; Le et al. (
2020) and Sun and Zhao (
2015) define frameworks where a penalty term is introduced and the classes are characterized by sparse inverse covariance matrices. It is also worth to mention that for high-dimensional data, the framework of discriminant analysis has often been phrased in terms of sparse discriminant vectors, see for example: Clemmensen et al. (
2011), Mai et al. (
2012), Safo and Ahn (
2016), Jiang et al. (
2018), Qin (
2018).