Assessment of two approximation methods for computing posterior model probabilities

doi:10.1016/j.csda.2004.01.005

Computational Statistics & Data Analysis

Volume 48, Issue 2, 1 February 2005, Pages 221-234

https://doi.org/10.1016/j.csda.2004.01.005 Get rights and content

Abstract

Model selection is an important problem in statistical applications. Bayesian model averaging provides an alternative to classical model selection procedures and allows researchers to consider several models from which to draw inferences. In the multiple linear regression case, it is difficult to compute exact posterior model probabilities required for Bayesian model averaging. To reduce the computational burden the Laplace approximation and an approximation based on the Bayesian information criterion (BIC) have been proposed. The BIC approximation is the easiest to calculate and is being used widely in application. In this paper we conduct a simulation study to determine which approximation performs better. We give an example of where the methods differ, study the performance of these methods on randomly generated models and explore some of the features of the approximations. Our simulation study suggests that the Laplace approximation performs better on average than the BIC approximation.

Introduction

Model building in multiple linear regression (MLR) often requires assessing which subset of variables create the “best” model. Methods such as Stepwise, Forward, Backward selection, PRESS, Mallow's C_p, Akaike's Criterion, Schwartz's Bayesian Information Criterion have been developed to address this issue. Once a “best” model is selected all inferences are based on this model as if it were the true model, no uncertainty is associated with the modelling process. In the recent literature efforts have been made to incorporate this model uncertainty into the analysis using posterior model probabilities (see Kass and Raftery 1995; Madigan and York, 1995; Raftery, 1996; Raftery et al., 1997; Hoeting et al., 1999).

Posterior model probabilities have many uses in modelling. They can be used to select the highest probability model, to aid variable assessment and for prediction. The main justification for using any of the aforementioned methods is to formally incorporate model uncertainty into the analysis. By incorporating this uncertainty we arrive at inferences that more accurately reflect all the uncertainties associated with the analysis. This type of analysis has been growing in popularity in application. For example, Murphy and Wang (2001) considered an infant survival application. Viallefont et al. (2001) employed these methods on case–control studies. In addition, Clyde (2000), Lamon and Clyde (2000), Noble (2000) and Lipkovich (2002) explored ecological data sets with these methods.

In this paper we examine the performance of two methods for approximating posterior model probabilities, the Laplace approximation and the Bayesian information criterion (BIC) approximation method. We focus on the multiple linear regression setting to evaluate the performance. Via simulation and example we show that the Laplace approximation is more accurate than BIC with moderate to large sample sizes. We then employ the Laplace approximation, BIC method and the Exact method to an ecological data set collected by the State of Ohio Environmental Protection Agency. We use this case study to illustrate the importance of accurate posterior model probability approximations. This case study suggests the relative the poor performance of the BIC method in practice.

Computing the exact posterior model probabilities is computationally intensive. The calculation involves the inverse and determinant of an n×n matrix where n is the sample size. This calculation needs to be repeated for each model. In the situation where we have k candidate variables, there are 2^k models to consider. Hence, the inverse and determinant would need to be calculated 2^k times. As n grows large the computation time becomes infeasible. Hence approximations are needed to reduce the computation time.

From experience, we have observed disagreement between the BIC and the Laplace method for calculating posterior model probabilities. Weakliem (1999) noted that the BIC approximation to Bayes factors is too conservative. Raftery (1999) stated that the BIC approximation method is a “crude” approximation, however, BIC values are readily available on standard computer output. Noble (2000) observed poor performance of the BIC method for small sample sizes and suggested corrections to improve performance. Noble's adjustment reduced the penalty associated with BIC. Noble (2000) did not directly compare the BIC method or the Laplace method to the exact probabilities. Both the BIC method and the Laplace method are asymptotically accurate methods for determining Bayes factors, which can be used to determine the posterior model probabilities (Tierney and Kadane, 1986; Raftery, 1996; Kass and Wasserman, 1995). Chickering and Heckerman (1997) consider the performance of the BIC and Laplace methods for finite mixture models and find the BIC method does not perform well in that case. Our focus is on the multiple linear regression setting. Since these are asymptotic approximations, we need to understand how these methods perform in small to moderate sample sizes in order to know when to apply each method.

In this paper, we present an extensive simulation study of the accuracy of using the BIC and Laplace approximation in the context of Bayesian model averaging (BMA). In the remainder of this section we discuss BMA, the BIC and Laplace approximations in this context. The next section contains an example where the BIC and Laplace methods lead to different conclusions when used for single model selection via the highest probability model and when used for variable assessment. In Section 3, we consider the performance for randomly generated models. We conduct a study of the accuracies in the random case by varying the number of “significant” variables in the model in Section 4.

All algorithms were coded in Microsoft Visual Basic for Applications. We use the RanDev.dll library for all psuedo-random number generation. This requires the programs to run in a PC environment. All programs are available from the authors.

Suppose we have k candidate predictor variables X₁,X₂,…,X_k and a single response Y. In this situation we have 2^k first-order models which can be formed from these predictors. Let $M$ be the set of all possible models and let M_i denote the ith model in the set $M$ . The cardinality or size of $M$ is denoted by $| M |$ .

Once we collect data D, we can determine for model $M_{i} ∈ M, P(M_{i} | D)$ the posterior probability of M_i: $P(M_{i} | D)= P(M_{i})P(D |M_{i}) ∑_{M_{j}∈M} P(M_{j})P(D |M_{j}),$ where P(M_i) is the prior probability of model M_i and $P(D |M_{i})=∫L(D |θ_{i},M_{i})P(θ|M_{i}) d θ$ with L(D|θ_i,M_i) being the likelihood of the data given the parameter vector θ_i for model M_i and P(θ|M_i) being the prior probability density for θ_i given model M_i. The calculation of the exact quantities in Eq. (1) is computationally intensive.

For large model spaces, P(M_i|D) can be directly estimated by the Markov Chain Monte Carlo methods used by Madigan and York (1995), Raftery et al. (1997), Noble (2000) and Lipkovich (2002). To accomplish this, a Markov Chain is created where each state is a model M. Transitions from model M_i to model M_j are governed by the following acceptance probability α: $α= min 1, P(M_{j} | D) P(M_{i} | D) .$ This method depends on the accuracy of P(M_i|D). Noble (2000) and Lipkovich (2002) use the BIC approximation to determine P(M_i|D).

The accuracy of P(M_i|D), for a given data set D, is crucial to ensure the posterior inferences are correct. In the model selection context, the model M_i with the highest values of P(M_i|D) is selected to be the best model. In the model averaging context, for a quantity Δ, the posterior model probabilities are used in the law of total probability formula $P(Δ| D)= ∑ i=1 | M | P(Δ|M_{i}, D)P(M_{i} | D).$

In the regression setting our parameter vector is θ_i=(β_i,σ²), where β_i is the regression coefficient vector with β_ij=0 if X_j∉M_i. The prior probability density for θ_i needs to be proper (i.e. $∫p(θ_{i},M_{i}) d θ_{i} =1$ ) in order for Eq. (1) to exist.

For linear models Y=Xβ Eq. (2) can only be determined analytically for a few special prior distributions. If we cannot obtain (2) analytically, approximations are needed or numerical integration can be employed. In the special case of a normal error regression model, the probabilities expressed in (2) can be analytically determined when a normal prior is used for the β's and a gamma prior for 1/σ² (see Raiffa and Schlaifer, 1961). Suppose for each model M_i the normal-gamma prior is used, i.e. β_i is normally distributed with some mean μ_i and variance V_i and σ_i²∼λν/χ_ν² where λ is the mean and ν represents the degrees of freedom. In this situation we can determine (2) by the following: $P(D |μ_{i}, V_{i},ν, X_{i},M_{i})= Γ((ν+n)/2)(νλ)^{ν/2} π^{n/2} Γ(ν/2)| I + X_{i} V_{i} X_{i} ′|^{1/2} [λν+(Y − X_{i} μ_{i})′(I + X_{i} V_{i} X_{i} ′)^{−1} (Y − X_{i} μ_{i})^{−(ν+n)/2}].$ where X_i is the corresponding design matrix for model M_i, and Γ(·) is the gamma function. For moderate to large model spaces or large sample sizes this method is computationally intensive since the determinant and the inverse of the n×n matrix I+X_iV_iX_i′ must be computed for each model. This limits its usability in large sample size problems.

Tierney and Kadane (1986) proposed the idea of using the Laplace approximation to evaluate (2). They showed that this approximation is of order $O (1/ n)$ . The Laplace approximation for a unimodal function p(θ) is given by $∫p(θ) d θ≈p(θ ̂)(2π)^{d/2} |I(θ ̂)^{−1} |^{1/2},$ where $θ ̂$ is the mode of the probability function p(θ), d is the dimensionality of p(θ) and $I(θ ̂)$ is the observed information matrix evaluated at $θ ̂$ . Standard statistical software cannot readily calculate the posterior mode, nor the information matrix of a posterior distribution. However, the mode $θ ̂$ can easily be calculated by using methods such as Newton–Rhaphson, Fisher's scoring or steepest ascent or any combinations of these.

Raftery (1996), Kass and Raftery (1995) and Kass and Wasserman (1995) suggest approximating (2) using a function of the BIC developed by Schwartz (1978) $P(D |M_{i})≈ e^{1/2BIC_{i}},$ where $BIC_{i} =2{ln p(D | θ_{k} ̂,M_{k})−d ln (n)$ is the Bayesian information criterion for model M_k and d is the dimension of the model (Schwartz, 1978). Raftery (1996) and Kass and Wasserman (1995) showed that this approximation is asymptotically accurate for computing a Bayes factor for nested hypotheses and is of order O(1) under the unit-information prior distribution on β. A “unit information prior” is where the prior distribution for β contains the amount of information on β as is available in one observation. In the normal case, the unit information prior for testing a nested null hypothesis, $H_{0} : φ=φ_{0}$ , we have φ∼N(φ₀,Σ_φ) where |Σ_φ|⁻¹=|I_φφ(β,φ₀)|, where I_φφ(β,φ₀) denotes the sub-matrix of the Fisher information matrix corresponding to φ in the restricted likelihood. The popularity of this approximation is due to the fact that it can be calculated from the BIC values conveniently available from standard statistical software output. Extensions of this approximation for multivariate analysis have been studied by Noble (2000) for principal components analysis. Lipkovich (2002) used this method for canonical correspondence analysis and cluster analysis.

As mentioned earlier, the BIC and Laplace methods are asymptotically valid for determining Bayes factors. However, we usually do not know the sample size required for an accurate approximation, although both Laplace and BIC methods are correct. We are interested in which criterion yields better quality results, under the same conditions. To measure the accuracy of the approximations three measures are used: weighted L₁ and L₂ distances and the Hellinger distance. The L₂ distance of the probability function P₁ relative to the probability function P₂ is given by $L_{2} (P_{1},P_{2})= ∑ M_{i} ∈ M [P_{1} (M_{i} | D)−P_{2} (M_{i} | D)]^{2} P_{2} (M_{i} | D)^{1/2} .$ The L₁ distance is given by $L_{1} (P_{1},P_{2})= ∑ M_{i} ∈ M |P_{1} (M_{i} | D)−P_{2} (M_{i} | D)|P_{2} (M_{i} | D).$ Finally, the discrete Hellinger distance is given by $d_{H} (P_{1},P_{2})= 2−2 ∑ M_{i} ∈ M [P_{1} (M_{i} | D)P_{2} (M_{i} | D)]^{1/2}^{1/2} .$ The weighted L₁ and L₂ measures, give more importance to and have greater accuracy for models of high probability with respect to P₂. We also chose the Hellinger distance to consider a non-weighted distance measure. In many cases P(M_i|D)≈0, thus the discrete analog of the Kullback–Liebler distance is not appropriate. With all of these distance measures values closer to zero correspond to a smaller distance between distributions. The L₁ and L₂ distances are bounded between 0 and 1 and the Hellinger distance is bounded between 0 and 2. To evaluate the methods, we deem distances less than 0.05 to be acceptable for L₁ and L₂ and less than 0.2 for the Hellinger distance. These values correspond to 95% of the total probability being assigned correctly.

Section snippets

Example

To illustrate the importance of the accuracy of both methods, we employ each method on an environmental data set collected in Ohio. In this section we examine the highest probability models, and variable assessment of each of the methods on a biological data set provided by the Ohio Environmental Protection Agency (EPA). The Ohio EPA is interested in the health of the fish living in the streams and rivers of Ohio and how this is affected by environmental stress. It is especially important how

Random models

To understand how the methods perform in general we conducted the following simulation study. We allowed both β and σ² to be random. We used the 5 regressor case with $X_{ij} ∼ iid N (0,1)$ for all i=1,…,5 and j=1,…,n. We used the following distributions to sample the parameter values: $β_{i} ∼ iid N (0,4)$ and σ²∼χ²(1). To sample the random error ε_j we first sampled σ² and then sampled $ε_{j} |σ^{2} ∼ iid N (0,σ^{2})$ .

We again used a normal-gamma prior distribution for the regression parameters. For β we used $β_{i} ∼ iid N (0,10)$ ,

Exploration

We also wish to understand how each method performs when the number of “significant” variables varies. For this we use the five variable case as before. To determine whether each variable was “significant” we used the standard t-statistic used for testing regression coefficients. For each of the variables we chose the cut off value of 2.5 since this roughly corresponds to the cut off value for five simultaneous tests using the Bonferroni correction at the α=0.05 level. Hence, if the t-statistic

Discussion

While the BIC approximation of Bayes factors may be asymptotically accurate, using this approximation in the context of BMA produces unsatisfactory results. We have illustrated by example how the differences between the methods manifest themselves in practice. Inferences drawn from applications of BMA using the BIC approximation may lead to erroneous conclusions. This paper shows that researchers should be careful when using approximations for BMA. The goal of BMA is to account for the model

References (18)

D. Chickering et al.
Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables
Mach. Learning
(1997)
M. Clyde
Model uncertainty and health effect studies for particulate matter
Environmetrics
(2000)
J.A. Hoeting et al.
Bayesian model averaginga tutorial
Statist. Sci
(1999)
R.E. Kass et al.
Bayes factors
J. Amer. Statist. Assoc
(1995)
R. Kass et al.
A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion
J. Amer. Statist. Assoc
(1995)
E.C. Lamon et al.
Accounting for model uncertainty in prediction of chlorophyll A in lake Okeechobee
J. Agri. Biol. Environ. Statist
(2000)
Lipkovich, L., 2002. Bayesian model averaging and variable selection in multivariate ecological models. Unpublished...
D. Madigan et al.
Bayesian graphical models for discrete data
Internat. Statist. Rev
(1995)
M. Murphy et al.
Do previous birth interval and maternal education influence infant survival? A Bayesian model averaging analysis of Chinese data
Popul. Stud
(2001)

There are more references available in the full text version of this article.

Cited by (12)

Approximate predictive densities and their applications in generalized linear models
2011, Computational Statistics and Data Analysis
Citation Excerpt :
Bayesian applications in a number of problems need to be able to evaluate marginal probability distributions of the data, often called predictive densities, or their ratios, known as Bayes factors, for a set of competing models, which are often analytically intractable. Calculations of such quantities have been addressed by several authors, including sampling or Monte Carlo methods (e.g., Gelfand and Smith, 1990, Verdinelli and Wasserman, 1995, Han and Carlin, 2001) and analytic approximations based on the Laplace method (e.g., Tierney and Kadane, 1986, Tierney et al., 1989, Gelfand and Dey, 1994, Raftery, 1996, Boone et al., 2005). The first part of this paper presents new analytical methods for approximating the predictive densities, where we also discuss and compare their theoretical properties with those of Laplace approximations.
Exact calculations of model posterior probabilities or related quantities are often infeasible due to the analytical intractability of predictive densities. Here new approximations for obtaining predictive densities are proposed and contrasted with those based on the Laplace method. Our theory and a numerical study indicate that the proposed methods are easy to implement, computationally efficient, and accurate over a wide range of hyperparameters. In the context of GLMs, we show that they can be employed to facilitate the posterior computation under three general classes of informative priors on regression coefficients. A real example is provided to demonstrate the feasibility and usefulness of the proposed methods in a fully Bayes variable selection procedure.
Analyzing qquantitative trait loci for the Arabidopsis thaliana using Markov Chain Monte Carlo model composition with restricted and unrestricted model spaces
2006, Statistical Methodology
Quantitative Trait Loci (QTL) mapping is a growing field in statistical genetics. However, dealing with this type of data from a statistical perspective is often perilous. In this paper we extend and apply a Markov Chain Monte Carlo Model Composition (MC³) technique to a data set of the Arabidopsis thaliana plant for locating the QTL mapping associated with cotyledon opening. The posterior model probabilities as well as the marginal posterior probabilities of each locus belonging to the model are presented. Furthermore, we show how the MC³ method can be used to deal with the situation where the sample size is less than the number of parameters in a model using a restricted model space approach.
Bayesian Model Averaging: A Systematic Review and Conceptual Classification
2018, International Statistical Review
Laplace’s method in Bayesian inverse problems with Gaussian priors
2017, arXiv
Random field modeling with insufficient field data for probability analysis and design
2015, Structural and Multidisciplinary Optimization
Bayesian methods
2015, Advanced Medical Statistics

View all citing articles on Scopus

View full text

Assessment of two approximation methods for computing posterior model probabilities

Abstract

Introduction

Section snippets

Example

Random models

Exploration

Discussion

Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables

Mach. Learning

Model uncertainty and health effect studies for particulate matter

Environmetrics

Bayesian model averaginga tutorial

Statist. Sci

Bayes factors

J. Amer. Statist. Assoc

A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion

J. Amer. Statist. Assoc

Accounting for model uncertainty in prediction of chlorophyll A in lake Okeechobee

J. Agri. Biol. Environ. Statist

Bayesian graphical models for discrete data

Internat. Statist. Rev

Do previous birth interval and maternal education influence infant survival? A Bayesian model averaging analysis of Chinese data

Popul. Stud