Bootstrap hypothesis testing for some common statistical problems: A critical evaluation of size and power properties

https://doi.org/10.1016/j.csda.2007.01.020Get rights and content

Abstract

The construction of bootstrap hypothesis tests can differ from that of bootstrap confidence intervals because of the need to generate the bootstrap distribution of test statistics under a specific null hypothesis. Similarly, bootstrap power calculations rely on resampling being carried out under specific alternatives. We describe and develop null and alternative resampling schemes for common scenarios, constructing bootstrap tests for the correlation coefficient, variance, and regression/ANOVA models. Bootstrap power calculations for these scenarios are described. In some cases, null-resampling bootstrap tests are equivalent to tests based on appropriately constructed bootstrap confidence intervals. In other cases, particularly those for which simple percentile-method bootstrap intervals are in routine use such as the correlation coefficient, null-resampling tests differ from interval-based tests. We critically assess the performance of bootstrap tests, examining size and power properties of the tests numerically using both real and simulated data. Where they differ from tests based on bootstrap confidence intervals, null-resampling tests have reasonable size properties, outperforming tests based on bootstrapping without regard to the null hypothesis. The bootstrap tests also have reasonable power properties.

Introduction

Since its introduction by Efron (1979a), the bootstrap has been widely used to provide computationally intensive alternatives to many standard statistical procedures, without the need for restrictive parametric assumptions. Two important statistical activities for which bootstrap approaches have been extensively developed are confidence interval construction and hypothesis testing, with coverage accuracy and size, respectively, being the primary foci of investigations. Moreover, close links between confidence intervals and hypothesis tests have meant that advances in bootstrap techniques for one application have often led to benefits for the other.

There is a vast literature on bootstrap confidence intervals, summarized well in papers by Hall (1988) and DiCiccio and Efron (1996) among many others, as well as in sections of the books by Hall (1992), Efron and Tibshirani (1993), Shao and Tu (1995), and Davison and Hinkley (1997). Bootstrap hypothesis testing has also received considerable, although somewhat less, attention, with early work by Hinkley, 1988, Hinkley, 1989, Young (1988), Fisher and Hall (1990) and Hall and Wilson (1991) identifying and exploring clear issues and directions for bootstrap testing that distinguish that setting from the confidence interval context. Some of the origins of bootstrap hypothesis testing owe much to an older idea, that of randomization or permutation tests. Noreen (1989) gives a good overview of these methods and relates them to bootstrap testing. Romano, 1988, Romano, 1989 are early references to some theoretical development of bootstrap tests, particularly their relationship with randomization procedures. In their seminal monograph, Westfall and Young (1993) summarized and extended bootstrap testing theory to include multiple testing scenarios, which include single bootstrap tests as a special case, moreover providing powerful SAS-based software, PROC MULTTEST, for implementing bootstrap tests in a wide range of situations. More recent work in the multiple testing context has also been carried out by Romano and Wolf, 2003a, Romano and Wolf, 2003b and Pollard and van der Laan (2004) in which resampling under specific hypotheses is proposed. Research into bootstrap hypothesis testing has been an area of very active interest in econometrics, with work by Horowitz, 2001, Horowitz, 2003 prominent, along with a suite of papers by Davidson (2000) and Davidson and MacKinnon, 2000, Davidson and MacKinnon, 1999a, Davidson and MacKinnon, 1999b, Davidson and MacKinnon, 2000, Davidson and MacKinnon, 2004, Davidson and MacKinnon, 2006 which explore many aspects of bootstrap hypothesis tests used in econometric applications, as well as more generally. Bootstrap testing is also covered in some detail in the monographsby Efron and Tibshirani, 1993, Davison and Hinkley, 1997, Chernick (1999) and Good (2000).

Work on bootstrap confidence intervals is obviously linked to that on bootstrap hypothesis tests because of the well-known duality between confidence intervals and hypothesis tests—de facto bootstrap hypothesis tests can be generated by first forming a bootstrap confidence interval, and using its complement as a rejection region for an associated hypothesis test. As a result, work on developing accurate bootstrap confidence intervals can be viewed, indirectly at least, as an avenue to accurate bootstrap tests, and so research into bootstrap confidence intervals has had the laudable dual benefit of informing and improving bootstrap testing procedures. Of course, the duality between intervals and tests is strongly related to the use of (approximate) pivots in their construction, and, for example, percentile method bootstrap intervals do not generally have this feature. Moreover, while the processes of constructing confidence intervals and conducting hypothesis tests are closely associated, key differences between the two make it important to consider bootstrap hypothesis testing separately from bootstrap confidence intervals. The very language of hypothesis testing makes reference to two types of errors: those occurring when the null hypothesis is true; and those occurring when an alternative hypothesis holds. Therefore, when constructing tests of a certain size, it is fundamental to the procedure that the distribution of the test statistic assuming that the null hypothesis holds be considered. Yet, if naïve bootstrap tests are constructed using tests based on, say, a percentile-method confidence interval, the bootstrap distribution generated is not restricted by any pre-conceived hypothesis—it is constrained only by the data itself. As a result, in cases where the data do not reflect the null hypothesis—which itself may still reflect the truth—the bootstrap distribution generated for constructing a confidence interval may be inadequate for constructing a bootstrap test of specified size. This point has been well-recognized in the literature on bootstrap tests, with Fisher and Hall (1990) and, later, Hall and Wilson (1991) explicitly recommending that resampling for bootstrap testing should be conducted to adequately reflect the null hypothesis. Hall and Wilson (1991) further recommend that bootstrap tests should be based on statistics that are close to pivotal. The importance of using (asymptotically) pivotal quantities in bootstrap inference is familiar from much of the research on bootstrap confidence intervals—see Hall, 1988, Hall, 1992—and so it is unsurprising that the same advice should apply to testing. Indeed, given that the use of pivots is fundamental to the duality argument between intervals and testing, this recommendation seems both natural and sensible. The first recommendation—resampling under the null—is certainly appropriate, but its application usually needs careful thought as to how such resampling might actually be conducted.

Following Hall and Wilson (1991), many authors such as Westfall and Young (1993) have promoted null resampling as critical to the proper construction of bootstrap tests. Davison and Hinkley (1997) discuss a general approach to bootstrap testing, first outlining how resampling under the null might be conducted in a variety of circumstances, then developing fully nonparametric null models from which resampling can be carried out when no obvious simple null model exists. Davison and Hinkley's ideas are evocative, and provide a theoretical framework from which bootstrap tests can be developed in a wide range of circumstances. But in many common, practical settings it is possible to develop null-resampling schemes that are much simpler to implement, and which make bootstrap tests readily available. Along with Hall and Wilson's (1991) two recommendations, a third principle guiding the development of the null-resampling schemes described in this paper involves constructing null data from which to resample that allows as many properties of the underlying data, such as nuisance parameter estimates, to appropriately reflect the model assumptions underlying the testing situation being considered.

We also consider here the power of bootstrap tests, estimated using resampling under the alternative hypothesis. Power of bootstrap tests has not been a focus of previous studies of bootstrap testing. We outline the construction of bootstrap power estimates, and compare them with power calculations based on large sample theory as well as with simulated power values under specific parent distributions. These power estimates provide a practical way for practitioners to assess the power of the tests they are using in the context of the data they have. The bootstrap power calculations are similar to bootstrap calculation of p-values except that resamples are generated in a manner consistent with a particular alternative hypothesis. Power is then estimated as the proportion among such resamples for which the associated test statistic is more extreme than the critical value estimated using null resampling.

This paper has two main goals. The first is the description and development of simple methods for resampling under a relevant hypothesis in a range of common applications. Some simple applications are already well developed—the bootstrap test for a mean is described by both Efron and Tibshirani (1993) and Davison and Hinkley (1997). In this case, the null-resampling tests are equivalent to tests based on the construction of a percentile-t bootstrap confidence interval. However, in other important cases such as the correlation coefficient or a ratio of means, situations for which percentile-method bootstrap confidence intervals are routinely used, tests based on null resampling differ from interval-based tests, and can have superior size and power properties. Some cases, such as testing for a difference in means, are developed in a similar way to permutation procedures. Other common cases, such as bootstrap testing in ANOVA and regression settings are also described and explored.

The second goal of the paper is to provide a critical assessment of size and, particularly, power properties of bootstrap tests. Such an assessment is critical if bootstrap tests are to attract routine use in data analyses. These properties are examined using both real data and a simulation study in Section 3. The results are encouraging regarding the use of the bootstrap for hypothesis testing. First, the real data examples show that resampling under the null can yield materially different results than naïve resampling, especially in cases where the data are “far” from the null hypothesis. Put simply, if the null hypothesis is ignored in resampling for bootstrap tests it can make a material difference to the observed level of the test. Where null-resampling-based tests are equivalent to bootstrap confidence interval based tests, the size properties of the tests are reflected by the coverage properties of the associated intervals, an issue well examined in the literature. The simulation results raise some other key points. First, the sizes of the bootstrap tests proposed are, in general, very good. The proposed bootstrap tests appear to perform better when the data have been cleaned of outliers, a feature common to many statistical procedures. Nevertheless, for reasonably well-behaved data, bootstrap tests based on null-resampling appear to have slightly better size properties than interval-based tests, in cases where the two methods differ. The study of power properties of the tests is also encouraging. Comparisons of bootstrap power estimates based on resampling under the alternative with simulated power of both null-resampling and interval-based tests revealed that the bootstrap estimates of power for null-resampling tests fell close to their simulated power values, and that null-resampling tests generally had better power properties than interval-based tests in cases where the two methods differed. In many cases, the null-resampling tests had power close to theoretical ideal power based on large-sample theory.

Section snippets

General framework

Consider testing for a parameter θ, H0:θ=θ0, using data (possibly multivariate) X={X1,,Xn}. Denote the test statistic by gθ0(X), where the notation reflects the usual dependence of the test statistic on both the data and the hypothesized value, and denote the null distribution of gθ0(X) by Fθ0. Similarly, the distribution of gθ0(X) under a specific alternative θA is denoted FθA. Generally, null and alternative distributions of gθ0(X) are unavailable unless the underlying distribution of X is

Numerical results and discussion

The performance of bootstrap hypothesis tests in terms of size and power was examined in several settings, via both simulation and in cases involving real data. All calculations and simulations described were carried out using the S-Plus and R statistical computing languages.

Example 3.1 Efron's law school data

One of the first data sets used to illustrate bootstrap methodology was Efron's law school data, relating average LSAT (results of an aptitude test) and GPA (undergraduate grade point average) scores for entry into 15 U.S.

References (35)

  • R. Davidson et al.

    The power of bootstrap and asymptotic tests

    J. Econometrics

    (2006)
  • J.L. Horowitz

    The bootstrap and hypothesis tests in econometrics

    J. Econometrics

    (2001)
  • M. Chernick

    Bootstrap MethodsA Practitioner's Guide

    (1999)
  • Davidson, R., 2000. Bootstrap confidence intervals based on inverting hypothesis tests. Document de travail, GREQAM...
  • Davidson, R., MacKinnon, J.G., 1996. The power of bootstrap tests. Discussion paper, Queen's University, Kingston,...
  • R. Davidson et al.

    The size distortion of bootstrap tests

    Econometric Theory

    (1999)
  • R. Davidson et al.

    Bootstrap testing in non-linear models

    Internat. Econom. Rev.

    (1999)
  • R. Davidson et al.

    Bootstrap testshow many bootstraps?

    Econometric Rev.

    (2000)
  • R. Davidson et al.

    Econometric Theory and Methods

    (2004)
  • A.C. Davison et al.

    Bootstrap Methods and their Application

    (1997)
  • T.J. DiCiccio et al.

    Bootstrap confidence intervals (with discussion)

    Statist. Sci.

    (1996)
  • B. Efron

    Bootstrap methodsanother look at the Jackknife

    Ann. Statist.

    (1979)
  • B. Efron

    Computers and the theory of statisticsthinking the unthinkable

    SIAM Rev.

    (1979)
  • B. Efron et al.

    An Introduction to the Bootstrap

    (1993)
  • N.I. Fisher et al.

    On bootstrap hypothesis testing

    Austral. J. Statist.

    (1990)
  • P. Good

    Permutation TestsA Practical Guide to Resampling Methods for Testing Hypotheses

    (2000)
  • P. Hall

    On the number of bootstrap simulations required to construct a confidence interval

    Ann. Statist.

    (1986)
  • Cited by (44)

    • New methodology for discrimination of topography diversity of engineering surfaces – Case of grinding

      2022, Measurement: Journal of the International Measurement Confederation
    • Vehicle recalls performance in an emerging market: Evidence from the comparison between China and U.S.

      2020, Transportation Research Part A: Policy and Practice
      Citation Excerpt :

      Nonparametric bootstrapping is a sample-resample technique, that is especially useful when the distribution is unknown, or when normal approximations do not hold (Kleijnen, 2015). While many studies have focused on the theoretical development of bootstrapping in statistics, little research has been devoted to bootstrap testing (Martin, 2007) and its applications. In our setting, however, as we observe the indicators over time, the data may take the form of a time series where a trend is present.

    • Computational Tool to Study Perturbations in Muscle Regulation and Its Application to Heart Disease

      2019, Biophysical Journal
      Citation Excerpt :

      We find that for the experiments described here, 1000 rounds of bootstrapping simulations are sufficient (Fig. 2 A). For statistical hypothesis testing, we define a test statistic as the difference between parameter values in the perturbed and unperturbed systems (14). For example, to determine whether there is a statistically significant difference in the value of KW for a mutant protein relative to the wild type (WT), the test statistic is H = KW(WT) − KW(mutant).

    • Source apportionment of PM<inf>10</inf> and PM<inf>2.5</inf> in major urban Greek agglomerations using a hybrid source-receptor modeling process

      2017, Science of the Total Environment
      Citation Excerpt :

      Finally, in this work, a higher-level conditional probability function was defined, as well, from the common locus of geographical sectors resolved by the original CPF analysis for neighboring receptor sites. The statistical significance of PSCF values can be examined, by making use of the non-parametric bootstrap method (Martin, 2007; Wehrens et al., 2000). The method involves an iterative procedure that substitutes the original trajectories with random ones that are picked with replacement from the ensemble of trajectories (Doğan et al., 2008; Güllü et al., 2005).

    • Genetic assessment of additional endophenotypes from the Consortium on the Genetics of Schizophrenia Family Study

      2016, Schizophrenia Research
      Citation Excerpt :

      The Total Significance Test conditions simultaneously on all observed correlations among endophenotypes; among SNPs; and among related individuals within each family to correct for multiple testing using a null-resampling form of the bootstrap, (Greenwood et al., 2011). This method has been demonstrated to appropriately control type I error, while reducing type II error (Hall and Wilson, 1991; Martin, 2007). The heritability and linkage analyses were conducted according to previously established methods (Greenwood et al., 2013c).

    View all citing articles on Scopus
    View full text