Assessment of two approximation methods for computing posterior model probabilities

https://doi.org/10.1016/j.csda.2004.01.005Get rights and content

Abstract

Model selection is an important problem in statistical applications. Bayesian model averaging provides an alternative to classical model selection procedures and allows researchers to consider several models from which to draw inferences. In the multiple linear regression case, it is difficult to compute exact posterior model probabilities required for Bayesian model averaging. To reduce the computational burden the Laplace approximation and an approximation based on the Bayesian information criterion (BIC) have been proposed. The BIC approximation is the easiest to calculate and is being used widely in application. In this paper we conduct a simulation study to determine which approximation performs better. We give an example of where the methods differ, study the performance of these methods on randomly generated models and explore some of the features of the approximations. Our simulation study suggests that the Laplace approximation performs better on average than the BIC approximation.

Introduction

Model building in multiple linear regression (MLR) often requires assessing which subset of variables create the “best” model. Methods such as Stepwise, Forward, Backward selection, PRESS, Mallow's Cp, Akaike's Criterion, Schwartz's Bayesian Information Criterion have been developed to address this issue. Once a “best” model is selected all inferences are based on this model as if it were the true model, no uncertainty is associated with the modelling process. In the recent literature efforts have been made to incorporate this model uncertainty into the analysis using posterior model probabilities (see Kass and Raftery 1995; Madigan and York, 1995; Raftery, 1996; Raftery et al., 1997; Hoeting et al., 1999).

Posterior model probabilities have many uses in modelling. They can be used to select the highest probability model, to aid variable assessment and for prediction. The main justification for using any of the aforementioned methods is to formally incorporate model uncertainty into the analysis. By incorporating this uncertainty we arrive at inferences that more accurately reflect all the uncertainties associated with the analysis. This type of analysis has been growing in popularity in application. For example, Murphy and Wang (2001) considered an infant survival application. Viallefont et al. (2001) employed these methods on case–control studies. In addition, Clyde (2000), Lamon and Clyde (2000), Noble (2000) and Lipkovich (2002) explored ecological data sets with these methods.

In this paper we examine the performance of two methods for approximating posterior model probabilities, the Laplace approximation and the Bayesian information criterion (BIC) approximation method. We focus on the multiple linear regression setting to evaluate the performance. Via simulation and example we show that the Laplace approximation is more accurate than BIC with moderate to large sample sizes. We then employ the Laplace approximation, BIC method and the Exact method to an ecological data set collected by the State of Ohio Environmental Protection Agency. We use this case study to illustrate the importance of accurate posterior model probability approximations. This case study suggests the relative the poor performance of the BIC method in practice.

Computing the exact posterior model probabilities is computationally intensive. The calculation involves the inverse and determinant of an n×n matrix where n is the sample size. This calculation needs to be repeated for each model. In the situation where we have k candidate variables, there are 2k models to consider. Hence, the inverse and determinant would need to be calculated 2k times. As n grows large the computation time becomes infeasible. Hence approximations are needed to reduce the computation time.

From experience, we have observed disagreement between the BIC and the Laplace method for calculating posterior model probabilities. Weakliem (1999) noted that the BIC approximation to Bayes factors is too conservative. Raftery (1999) stated that the BIC approximation method is a “crude” approximation, however, BIC values are readily available on standard computer output. Noble (2000) observed poor performance of the BIC method for small sample sizes and suggested corrections to improve performance. Noble's adjustment reduced the penalty associated with BIC. Noble (2000) did not directly compare the BIC method or the Laplace method to the exact probabilities. Both the BIC method and the Laplace method are asymptotically accurate methods for determining Bayes factors, which can be used to determine the posterior model probabilities (Tierney and Kadane, 1986; Raftery, 1996; Kass and Wasserman, 1995). Chickering and Heckerman (1997) consider the performance of the BIC and Laplace methods for finite mixture models and find the BIC method does not perform well in that case. Our focus is on the multiple linear regression setting. Since these are asymptotic approximations, we need to understand how these methods perform in small to moderate sample sizes in order to know when to apply each method.

In this paper, we present an extensive simulation study of the accuracy of using the BIC and Laplace approximation in the context of Bayesian model averaging (BMA). In the remainder of this section we discuss BMA, the BIC and Laplace approximations in this context. The next section contains an example where the BIC and Laplace methods lead to different conclusions when used for single model selection via the highest probability model and when used for variable assessment. In Section 3, we consider the performance for randomly generated models. We conduct a study of the accuracies in the random case by varying the number of “significant” variables in the model in Section 4.

All algorithms were coded in Microsoft Visual Basic for Applications. We use the RanDev.dll library for all psuedo-random number generation. This requires the programs to run in a PC environment. All programs are available from the authors.

Suppose we have k candidate predictor variables X1,X2,…,Xk and a single response Y. In this situation we have 2k first-order models which can be formed from these predictors. Let M be the set of all possible models and let Mi denote the ith model in the set M. The cardinality or size of M is denoted by |M|.

Once we collect data D, we can determine for model MiM,P(Mi|D) the posterior probability of Mi:P(Mi|D)=P(Mi)P(D|Mi)MjMP(Mj)P(D|Mj),where P(Mi) is the prior probability of model Mi andP(D|Mi)=∫L(Di,Mi)P(θ|Mi)dθwith L(D|θi,Mi) being the likelihood of the data given the parameter vector θi for model Mi and P(θ|Mi) being the prior probability density for θi given model Mi. The calculation of the exact quantities in Eq. (1) is computationally intensive.

For large model spaces, P(Mi|D) can be directly estimated by the Markov Chain Monte Carlo methods used by Madigan and York (1995), Raftery et al. (1997), Noble (2000) and Lipkovich (2002). To accomplish this, a Markov Chain is created where each state is a model M. Transitions from model Mi to model Mj are governed by the following acceptance probability α:α=min1,P(Mj|D)P(Mi|D).This method depends on the accuracy of P(Mi|D). Noble (2000) and Lipkovich (2002) use the BIC approximation to determine P(Mi|D).

The accuracy of P(Mi|D), for a given data set D, is crucial to ensure the posterior inferences are correct. In the model selection context, the model Mi with the highest values of P(Mi|D) is selected to be the best model. In the model averaging context, for a quantity Δ, the posterior model probabilities are used in the law of total probability formulaP(Δ|D)=i=1|M|P(Δ|Mi,D)P(Mi|D).

In the regression setting our parameter vector is θi=(βi,σ2), where βi is the regression coefficient vector with βij=0 if XjMi. The prior probability density for θi needs to be proper (i.e. ∫p(θi,Mi)dθi=1) in order for Eq. (1) to exist.

For linear models Y=Xβ Eq. (2) can only be determined analytically for a few special prior distributions. If we cannot obtain (2) analytically, approximations are needed or numerical integration can be employed. In the special case of a normal error regression model, the probabilities expressed in (2) can be analytically determined when a normal prior is used for the β's and a gamma prior for 1/σ2 (see Raiffa and Schlaifer, 1961). Suppose for each model Mi the normal-gamma prior is used, i.e. βi is normally distributed with some mean μi and variance Vi and σi2λν/χν2 where λ is the mean and ν represents the degrees of freedom. In this situation we can determine (2) by the following:P(Di,Vi,ν,Xi,Mi)=Γ((ν+n)/2)(νλ)ν/2πn/2Γ(ν/2)|I+XiViXi′|1/2[λν+(YXiμi)′(I+XiViXi′)−1(YXiμi)−(ν+n)/2].where Xi is the corresponding design matrix for model Mi, and Γ(·) is the gamma function. For moderate to large model spaces or large sample sizes this method is computationally intensive since the determinant and the inverse of the n×n matrix I+XiViXi′ must be computed for each model. This limits its usability in large sample size problems.

Tierney and Kadane (1986) proposed the idea of using the Laplace approximation to evaluate (2). They showed that this approximation is of order O(1/n). The Laplace approximation for a unimodal function p(θ) is given by∫p(θ)dθ≈p(θ̂)(2π)d/2|I(θ̂)−1|1/2,where θ̂ is the mode of the probability function p(θ), d is the dimensionality of p(θ) and I(θ̂) is the observed information matrix evaluated at θ̂. Standard statistical software cannot readily calculate the posterior mode, nor the information matrix of a posterior distribution. However, the mode θ̂ can easily be calculated by using methods such as Newton–Rhaphson, Fisher's scoring or steepest ascent or any combinations of these.

Raftery (1996), Kass and Raftery (1995) and Kass and Wasserman (1995) suggest approximating (2) using a function of the BIC developed by Schwartz (1978)P(D|Mi)≈e1/2BICi,where BICi=2{lnp(D|θk̂,Mk)−dln(n) is the Bayesian information criterion for model Mk and d is the dimension of the model (Schwartz, 1978). Raftery (1996) and Kass and Wasserman (1995) showed that this approximation is asymptotically accurate for computing a Bayes factor for nested hypotheses and is of order O(1) under the unit-information prior distribution on β. A “unit information prior” is where the prior distribution for β contains the amount of information on β as is available in one observation. In the normal case, the unit information prior for testing a nested null hypothesis, H0:φ=φ0, we have φ∼N(φ0φ) where |Σφ|−1=|Iφφ(β,φ0)|, where Iφφ(β,φ0) denotes the sub-matrix of the Fisher information matrix corresponding to φ in the restricted likelihood. The popularity of this approximation is due to the fact that it can be calculated from the BIC values conveniently available from standard statistical software output. Extensions of this approximation for multivariate analysis have been studied by Noble (2000) for principal components analysis. Lipkovich (2002) used this method for canonical correspondence analysis and cluster analysis.

As mentioned earlier, the BIC and Laplace methods are asymptotically valid for determining Bayes factors. However, we usually do not know the sample size required for an accurate approximation, although both Laplace and BIC methods are correct. We are interested in which criterion yields better quality results, under the same conditions. To measure the accuracy of the approximations three measures are used: weighted L1 and L2 distances and the Hellinger distance. The L2 distance of the probability function P1 relative to the probability function P2 is given byL2(P1,P2)=MiM[P1(Mi|D)−P2(Mi|D)]2P2(Mi|D)1/2.The L1 distance is given byL1(P1,P2)=MiM|P1(Mi|D)−P2(Mi|D)|P2(Mi|D).Finally, the discrete Hellinger distance is given bydH(P1,P2)=2−2MiM[P1(Mi|D)P2(Mi|D)]1/21/2.The weighted L1 and L2 measures, give more importance to and have greater accuracy for models of high probability with respect to P2. We also chose the Hellinger distance to consider a non-weighted distance measure. In many cases P(Mi|D)≈0, thus the discrete analog of the Kullback–Liebler distance is not appropriate. With all of these distance measures values closer to zero correspond to a smaller distance between distributions. The L1 and L2 distances are bounded between 0 and 1 and the Hellinger distance is bounded between 0 and 2. To evaluate the methods, we deem distances less than 0.05 to be acceptable for L1 and L2 and less than 0.2 for the Hellinger distance. These values correspond to 95% of the total probability being assigned correctly.

Section snippets

Example

To illustrate the importance of the accuracy of both methods, we employ each method on an environmental data set collected in Ohio. In this section we examine the highest probability models, and variable assessment of each of the methods on a biological data set provided by the Ohio Environmental Protection Agency (EPA). The Ohio EPA is interested in the health of the fish living in the streams and rivers of Ohio and how this is affected by environmental stress. It is especially important how

Random models

To understand how the methods perform in general we conducted the following simulation study. We allowed both β and σ2 to be random. We used the 5 regressor case with XijiidN(0,1) for all i=1,…,5 and j=1,…,n. We used the following distributions to sample the parameter values: βiiidN(0,4) and σ2χ2(1). To sample the random error εj we first sampled σ2 and then sampled εj2iidN(0,σ2).

We again used a normal-gamma prior distribution for the regression parameters. For β we used βiiidN(0,10),

Exploration

We also wish to understand how each method performs when the number of “significant” variables varies. For this we use the five variable case as before. To determine whether each variable was “significant” we used the standard t-statistic used for testing regression coefficients. For each of the variables we chose the cut off value of 2.5 since this roughly corresponds to the cut off value for five simultaneous tests using the Bonferroni correction at the α=0.05 level. Hence, if the t-statistic

Discussion

While the BIC approximation of Bayes factors may be asymptotically accurate, using this approximation in the context of BMA produces unsatisfactory results. We have illustrated by example how the differences between the methods manifest themselves in practice. Inferences drawn from applications of BMA using the BIC approximation may lead to erroneous conclusions. This paper shows that researchers should be careful when using approximations for BMA. The goal of BMA is to account for the model

References (18)

  • D. Chickering et al.

    Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables

    Mach. Learning

    (1997)
  • M. Clyde

    Model uncertainty and health effect studies for particulate matter

    Environmetrics

    (2000)
  • J.A. Hoeting et al.

    Bayesian model averaginga tutorial

    Statist. Sci

    (1999)
  • R.E. Kass et al.

    Bayes factors

    J. Amer. Statist. Assoc

    (1995)
  • R. Kass et al.

    A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion

    J. Amer. Statist. Assoc

    (1995)
  • E.C. Lamon et al.

    Accounting for model uncertainty in prediction of chlorophyll A in lake Okeechobee

    J. Agri. Biol. Environ. Statist

    (2000)
  • Lipkovich, L., 2002. Bayesian model averaging and variable selection in multivariate ecological models. Unpublished...
  • D. Madigan et al.

    Bayesian graphical models for discrete data

    Internat. Statist. Rev

    (1995)
  • M. Murphy et al.

    Do previous birth interval and maternal education influence infant survival? A Bayesian model averaging analysis of Chinese data

    Popul. Stud

    (2001)
There are more references available in the full text version of this article.

Cited by (12)

  • Approximate predictive densities and their applications in generalized linear models

    2011, Computational Statistics and Data Analysis
    Citation Excerpt :

    Bayesian applications in a number of problems need to be able to evaluate marginal probability distributions of the data, often called predictive densities, or their ratios, known as Bayes factors, for a set of competing models, which are often analytically intractable. Calculations of such quantities have been addressed by several authors, including sampling or Monte Carlo methods (e.g., Gelfand and Smith, 1990, Verdinelli and Wasserman, 1995, Han and Carlin, 2001) and analytic approximations based on the Laplace method (e.g., Tierney and Kadane, 1986, Tierney et al., 1989, Gelfand and Dey, 1994, Raftery, 1996, Boone et al., 2005). The first part of this paper presents new analytical methods for approximating the predictive densities, where we also discuss and compare their theoretical properties with those of Laplace approximations.

  • Bayesian methods

    2015, Advanced Medical Statistics
View all citing articles on Scopus
View full text