Elsevier

Pattern Recognition Letters

Volume 31, Issue 9, 1 July 2010, Pages 1048-1055
Pattern Recognition Letters

Joint discriminative–generative modelling based on statistical tests for classification

https://doi.org/10.1016/j.patrec.2010.01.015Get rights and content

Abstract

In statistical pattern classification, generative approaches, such as linear discriminant analysis (LDA), assume a data-generating process (DGP), whereas discriminative approaches, such as linear logistic regression (LLR), do not model the DGP. In general, a generative classifier performs better than its discriminative counterpart if the DGP is well-specified and worse than the latter if the DGP is clearly mis-specified. In view of this, this paper presents a joint discriminative–generative modelling (JoDiG) approach, by partitioning predictor variables X into two sub-vectors, namely XG, to which a generative approach is applied, and XD, to be treated by a discriminative approach. This partitioning of X is based on statistical tests of the assumed DGP: the variables that clearly fail the tests are grouped as XD and the rest as XG. Then the generative and discriminative approaches are combined in a probabilistic rather than a heuristic way. The principle of the JoDiG approach is quite generic, but for illustrative purposes numerical studies of the paper focus on a widely-used case, in which the DGP assumes a multivariate normal distribution for each class. In this case, the JoDiG approach uses LDA for XG and LLR for XD. Numerical experiments on real and simulated data demonstrate that the performance of this new approach to classification is similar to or better than that of its discriminative and generative counterparts, in particular when the size of the training-set is comparable to the dimension of the data.

Introduction

The objective of statistical pattern classification is to assign a new individual to one of several pre-specified classes (Venables and Ripley, 2002). Let X=(x1,,xq) represent an individual with q features and let y, a categorical variable, label the class of X. As the value of y is unknown for a new individual X, we often assign X to class yˆ using the maximum a posteriori criterion: yˆ=argmaxyp(y|X,α).

The parameter vector α in the conditional distribution p(y|X,α) is in general estimated from a training-set of n labelled, independent individuals {Xi=(xi1,,xiq)}i=1n together with their labels {yi}i=1n. The estimation can be done by a generative or a discriminative approach.

Discriminative approaches estimate α by maximisation of the conditional log-likelihood (α)=i=1nlogp(yi|Xi,α). A typical discriminative approach is linear logistic regression (LLR) for two-class discrimination, in which the two classes have labels y=1 and y=0, say, and p(y|X,α)=exp{g(X,α)y}1+exp{g(X,α)} where g(X,α)=logp(y=1|X,α)p(y=0|X,α)=β0+βTX. The decision boundary is given by the hyperplane g(X,α)=0. In this case, α comprises coefficients β0 and β. In short, discriminative approaches aim to estimate directly the decision boundary. However, since no data-generating process (DGP) p(X|y) is assumed, they do not fully exploit information provided by the joint distribution p(X,y).

Generative approaches estimate α in a two-stage scheme. First, a DGP p(X|y,θ) and class prior probabilities p(y|π) are assumed, and the parameter vectors θ and π are estimated by maximisation of the joint log-likelihood, (θ,π)=i=1nlog{p(Xi|yi,θ)p(yi|π)}. Secondly, α and p(y|X,α) can be obtained through Bayes’ theorem: p(y|X,α)=p(X|y,θ)p(y|π)yp(X|y,θ)p(y|π). A typical generative approach is normal-based linear or quadratic discriminant analysis (LDA or QDA), in which the DGP p(X|y,θ) is assumed to correspond to a multivariate normal distribution for each class y. In this case, θ comprises class means and covariance matrices. In short, the generative approaches aim to exploit fully the information contained in p(X,y). However, their performance may be degraded if the DGP is wrongly specified.

In fact, both discriminative and generative approaches can be derived from factorisation of the joint distribution p(X,y): a factorisation p(X,y)=p(y|X)p(X) leads to discriminative approaches which assume the form of the posterior probabilities p(y|X) for classification; another factorisation p(X,y)=p(X|y)p(y) leads to generative approaches which assume the DGP p(X|y).

The terms ‘generative (or informative) approach’ and ‘discriminative approach’ are the new terminology for what Dawid (1976) called respectively the ‘sampling paradigm’ and the ‘diagnostic paradigm’. A considerable number of comparisons of these two paradigms have been reported, including, among others, Efron, 1975, O’Neill, 1980, Titterington et al., 1981, Rubinstein and Hastie, 1997, Ng and Jordan, 2001, Xue and Titterington, 2008.

Given the advantages and disadvantages of using either a generative or a discriminative approach, in recent years many hybrid methods have been proposed in order to exploit the best of both worlds (Rubinstein, 1998, Raina et al., 2003, Bouchard and Triggs, 2004, McCallum et al., 2006, Bishop and Lasserre, 2007). Some comments about these methods can be found in (Xue and Titterington, 2009, Xue and Titterington, 2010).

Since a generative classifier assumes a DGP whereas its discriminative counterpart does not, a general observation is as follows: a generative classifier performs better than its discriminative counterpart if the DGP is well-specified (but not necessarily perfectly-specified, given the bias-variance trade-off); the former performs worse than the latter if the DGP is clearly mis-specified.

Ng and Jordan (2001) presented some theoretical and empirical comparisons of LLR and the normal-based naïve Bayes classifier, which is a generative approach equivalent to LDA or QDA for conditionally independent feature variables. Their results suggested that, for the two approaches, there were two distinct regimes of relative classification performance with respect to the training-set size: the discriminative classifier performs better with larger training-sets whereas the generative classifier does better with smaller training-sets. We conjecture that such an important pattern may be explained to some extent by the following: for empirical data, any mis-specification of the DGP becomes more apparent as the amount of training data increases, and thus the discriminative classifier is favoured for larger training sets.

In this context, we present a new hybrid method, which we call a joint discriminative–generative modelling (JoDiG) approach to classification. The basic ideas of the approach are as follows. First, it uses a discriminative approach for a sub-vector XD of X, where XD contains the variables that clearly violate the assumptions underlying the proposed DGP; secondly, it uses a generative approach for the remaining variables XG of X; and, thirdly, these two approaches are combined in a probabilistic way, by factorisation of the joint distribution p(X,y), rather than in an ad hoc way.

The assumptions underlying the JoDiG approach are twofold. First, it must be possibly to test the DGP, partially if not fully. Secondly, XG and XD are assumed (block-wise) conditionally independent given the class y, such that p(XG|XD,y)=p(XG|y). Within XG or XD, the individual variables are not necessarily assumed conditionally independent.

In Section 2, we describe the motivation, algorithmic details, interpretation, computational complexity and extensions of the JoDiG approach. Then, in Section 3, we illustrate the JoDiG approach in a widely-used scenario and apply it to real and simulated data. Finally, in Section 4, we discuss closely-related work and other models that can also be derived by factorisation of p(X,y).

Section snippets

Motivation

The JoDiG approach is motivated by the general observation that a generative classifier performs better than its discriminative counterpart when the DGP is well-specified and worse than the latter when the DGP is clearly mis-specified. It also pursues exactly a suggestion made but not developed in Rubinstein and Hastie (1997): “it is best to use an informative (generative) approach if confidence in the model correctness is high. This suggests a promising way of combining the two approaches:

Methodology

The principle of the JoDiG approach is quite generic, but for illustration numerical studies in this paper focus on a widely-used case: the DGP assumes a multivariate normal distribution for each class, i.e., XG|yN(μy,Σy). Suppose there are only two classes, C1 (with y=1) and C0 (with y=0); for multi-class extensions, see Section 2.4. For linear classifiers, it is assumed that Σ1=Σ0=Σ.

The (multivariate) normal distribution is the most widely-assumed DGP for statistical classification, as the

Closely-related work

Kang and Tian (2006) proposed a hybrid generative/discriminative Bayesian (HBayes) classifier, which is closely related to our JoDiG approach.

The HBayes method constructs an iterative, heuristic partition of X. It starts with an empty XD(0) (i.e., XG(0)=X). Then in the t-th iteration it moves a single variable xj from XG(t-1) into XD(t-1) such that moving this variable produces the greatest improvement. The procedure is continued till no such variable can be found. Therefore, in order to select

Conclusions

Based on a general observation that a generative classifier performs better than its discriminative counterpart if the DGP is well-specified and worse than the latter if the DGP is clearly mis-specified, this paper has presented a JoDiG approach. The approach partitions variables into two sub-vectors based on statistical tests of the assumed DGP, then uses a discriminative approach for the variables which clearly failed the tests and a generative approach for the other variables, and finally

Acknowledgments

The authors thank the reviewers, the associate editor and Professor David J. Hand for their suggestions that have enhanced the comprehensiveness, structure and composition of the manuscript. This work was partly supported by the award of a Hutchison Whampoa-EPSRC Dorothy Hodgkin Postgraduate Award to J.H.X. It also benefited from our participation in the Research Programme on ‘Statistical Theory and Methods for Complex, High-Dimensional Data’ at the Isaac Newton Institute for Mathematical

References (25)

  • J.-H. Xue et al.

    Interpretation of hybrid generative/discriminative algorithms

    Neurocomputing

    (2009)
  • J.-H. Xue et al.

    On the generative–discriminative tradeoff approach: interpretation, asymptotic efficiency and classification performance

    Comput. Statist. Data Anal.

    (2010)
  • Asuncion, A., Newman, D.J., 2007. UCI Machine Learning Repository. University of California, School of Information and...
  • C.M. Bishop et al.

    Generative or discriminative? Getting the best of both worlds (with discussion)

  • G. Bouchard et al.

    The tradeoff between generative and discriminative classifiers

  • A.P. Dawid

    Properties of diagnostic data distributions

    Biometrics

    (1976)
  • B. Efron

    The efficiency of logistic regression compared to normal discriminant analysis

    J. Amer. Statist. Assoc.

    (1975)
  • Friedman, J.H., 1996. Another Approach to Polychotomous Classification. Tech. Rep., Stanford...
  • D.J. Hand et al.

    Idiot’s Bayes – Not so stupid after all?

    Internat. Statist. Rev.

    (2001)
  • T. Hastie et al.

    Discriminant analysis by gaussian mixtures

    J. Roy. Statist. Soc. Ser. B

    (1996)
  • T. Hastie et al.

    Classification by pairwise coupling

    Ann. Statist.

    (1998)
  • C. Kang et al.

    A hybrid generative/discriminative Bayesian classifier

  • Cited by (13)

    • A bias-variance based heuristic for constructing a hybrid logistic regression-naïve Bayes model for classification

      2020, International Journal of Approximate Reasoning
      Citation Excerpt :

      In contrast, our focus is on a restricted Bayesian network classifier, which combines two probabilistic models in a graphical way. One of the closest to our work is that by Xue and Titterington [21]. They study hybrid discriminative-generative classifiers where the discriminative component is LR, and the generative component is Fisher's linear discriminant analysis (LDA).

    • Discriminatively guided filtering (DGF) for hyperspectral image classification

      2018, Neurocomputing
      Citation Excerpt :

      In contrast, we would consider generative classifiers, which can serve as supervised feature-extraction approaches and have the capability of exploiting the labelling information of the training samples, a capacity that PCA lacks. Moreover, to the discriminative classifiers like the SVM adopted by PGF, the generative classifiers can provide complementary discriminative strength, another capacity that PCA lacks; for the complementarity of these two types of classifiers and the advantages of combining them together, see [29–33]. As pioneered by Kang et al. [28], the GF not only can be used as an edge-preserving smoothing operator, but also can help HSI classification.

    • On computing probabilities of dismissal of 10b-5 securities class-action cases

      2017, Decision Support Systems
      Citation Excerpt :

      At this juncture, we emphasize the choice of RMSE as a performance metric. Most of the earlier works in machine learning literatures (see [26,27]) focus on classification. Consequently, classification error, which measures the ratio of # incorrect classifications to # total instances, is their choice for a performance measure.

    View all citing articles on Scopus
    View full text