Bootstrap hypothesis testing for some common statistical problems: A critical evaluation of size and power properties

doi:10.1016/j.csda.2007.01.020

Computational Statistics & Data Analysis

Volume 51, Issue 12, 15 August 2007, Pages 6321-6342

https://doi.org/10.1016/j.csda.2007.01.020 Get rights and content

Abstract

The construction of bootstrap hypothesis tests can differ from that of bootstrap confidence intervals because of the need to generate the bootstrap distribution of test statistics under a specific null hypothesis. Similarly, bootstrap power calculations rely on resampling being carried out under specific alternatives. We describe and develop null and alternative resampling schemes for common scenarios, constructing bootstrap tests for the correlation coefficient, variance, and regression/ANOVA models. Bootstrap power calculations for these scenarios are described. In some cases, null-resampling bootstrap tests are equivalent to tests based on appropriately constructed bootstrap confidence intervals. In other cases, particularly those for which simple percentile-method bootstrap intervals are in routine use such as the correlation coefficient, null-resampling tests differ from interval-based tests. We critically assess the performance of bootstrap tests, examining size and power properties of the tests numerically using both real and simulated data. Where they differ from tests based on bootstrap confidence intervals, null-resampling tests have reasonable size properties, outperforming tests based on bootstrapping without regard to the null hypothesis. The bootstrap tests also have reasonable power properties.

Introduction

Since its introduction by Efron (1979a), the bootstrap has been widely used to provide computationally intensive alternatives to many standard statistical procedures, without the need for restrictive parametric assumptions. Two important statistical activities for which bootstrap approaches have been extensively developed are confidence interval construction and hypothesis testing, with coverage accuracy and size, respectively, being the primary foci of investigations. Moreover, close links between confidence intervals and hypothesis tests have meant that advances in bootstrap techniques for one application have often led to benefits for the other.

There is a vast literature on bootstrap confidence intervals, summarized well in papers by Hall (1988) and DiCiccio and Efron (1996) among many others, as well as in sections of the books by Hall (1992), Efron and Tibshirani (1993), Shao and Tu (1995), and Davison and Hinkley (1997). Bootstrap hypothesis testing has also received considerable, although somewhat less, attention, with early work by Hinkley, 1988, Hinkley, 1989, Young (1988), Fisher and Hall (1990) and Hall and Wilson (1991) identifying and exploring clear issues and directions for bootstrap testing that distinguish that setting from the confidence interval context. Some of the origins of bootstrap hypothesis testing owe much to an older idea, that of randomization or permutation tests. Noreen (1989) gives a good overview of these methods and relates them to bootstrap testing. Romano, 1988, Romano, 1989 are early references to some theoretical development of bootstrap tests, particularly their relationship with randomization procedures. In their seminal monograph, Westfall and Young (1993) summarized and extended bootstrap testing theory to include multiple testing scenarios, which include single bootstrap tests as a special case, moreover providing powerful SAS-based software, PROC MULTTEST, for implementing bootstrap tests in a wide range of situations. More recent work in the multiple testing context has also been carried out by Romano and Wolf, 2003a, Romano and Wolf, 2003b and Pollard and van der Laan (2004) in which resampling under specific hypotheses is proposed. Research into bootstrap hypothesis testing has been an area of very active interest in econometrics, with work by Horowitz, 2001, Horowitz, 2003 prominent, along with a suite of papers by Davidson (2000) and Davidson and MacKinnon, 2000, Davidson and MacKinnon, 1999a, Davidson and MacKinnon, 1999b, Davidson and MacKinnon, 2000, Davidson and MacKinnon, 2004, Davidson and MacKinnon, 2006 which explore many aspects of bootstrap hypothesis tests used in econometric applications, as well as more generally. Bootstrap testing is also covered in some detail in the monographsby Efron and Tibshirani, 1993, Davison and Hinkley, 1997, Chernick (1999) and Good (2000).

Work on bootstrap confidence intervals is obviously linked to that on bootstrap hypothesis tests because of the well-known duality between confidence intervals and hypothesis tests—de facto bootstrap hypothesis tests can be generated by first forming a bootstrap confidence interval, and using its complement as a rejection region for an associated hypothesis test. As a result, work on developing accurate bootstrap confidence intervals can be viewed, indirectly at least, as an avenue to accurate bootstrap tests, and so research into bootstrap confidence intervals has had the laudable dual benefit of informing and improving bootstrap testing procedures. Of course, the duality between intervals and tests is strongly related to the use of (approximate) pivots in their construction, and, for example, percentile method bootstrap intervals do not generally have this feature. Moreover, while the processes of constructing confidence intervals and conducting hypothesis tests are closely associated, key differences between the two make it important to consider bootstrap hypothesis testing separately from bootstrap confidence intervals. The very language of hypothesis testing makes reference to two types of errors: those occurring when the null hypothesis is true; and those occurring when an alternative hypothesis holds. Therefore, when constructing tests of a certain size, it is fundamental to the procedure that the distribution of the test statistic assuming that the null hypothesis holds be considered. Yet, if naïve bootstrap tests are constructed using tests based on, say, a percentile-method confidence interval, the bootstrap distribution generated is not restricted by any pre-conceived hypothesis—it is constrained only by the data itself. As a result, in cases where the data do not reflect the null hypothesis—which itself may still reflect the truth—the bootstrap distribution generated for constructing a confidence interval may be inadequate for constructing a bootstrap test of specified size. This point has been well-recognized in the literature on bootstrap tests, with Fisher and Hall (1990) and, later, Hall and Wilson (1991) explicitly recommending that resampling for bootstrap testing should be conducted to adequately reflect the null hypothesis. Hall and Wilson (1991) further recommend that bootstrap tests should be based on statistics that are close to pivotal. The importance of using (asymptotically) pivotal quantities in bootstrap inference is familiar from much of the research on bootstrap confidence intervals—see Hall, 1988, Hall, 1992—and so it is unsurprising that the same advice should apply to testing. Indeed, given that the use of pivots is fundamental to the duality argument between intervals and testing, this recommendation seems both natural and sensible. The first recommendation—resampling under the null—is certainly appropriate, but its application usually needs careful thought as to how such resampling might actually be conducted.

Following Hall and Wilson (1991), many authors such as Westfall and Young (1993) have promoted null resampling as critical to the proper construction of bootstrap tests. Davison and Hinkley (1997) discuss a general approach to bootstrap testing, first outlining how resampling under the null might be conducted in a variety of circumstances, then developing fully nonparametric null models from which resampling can be carried out when no obvious simple null model exists. Davison and Hinkley's ideas are evocative, and provide a theoretical framework from which bootstrap tests can be developed in a wide range of circumstances. But in many common, practical settings it is possible to develop null-resampling schemes that are much simpler to implement, and which make bootstrap tests readily available. Along with Hall and Wilson's (1991) two recommendations, a third principle guiding the development of the null-resampling schemes described in this paper involves constructing null data from which to resample that allows as many properties of the underlying data, such as nuisance parameter estimates, to appropriately reflect the model assumptions underlying the testing situation being considered.

We also consider here the power of bootstrap tests, estimated using resampling under the alternative hypothesis. Power of bootstrap tests has not been a focus of previous studies of bootstrap testing. We outline the construction of bootstrap power estimates, and compare them with power calculations based on large sample theory as well as with simulated power values under specific parent distributions. These power estimates provide a practical way for practitioners to assess the power of the tests they are using in the context of the data they have. The bootstrap power calculations are similar to bootstrap calculation of p-values except that resamples are generated in a manner consistent with a particular alternative hypothesis. Power is then estimated as the proportion among such resamples for which the associated test statistic is more extreme than the critical value estimated using null resampling.

This paper has two main goals. The first is the description and development of simple methods for resampling under a relevant hypothesis in a range of common applications. Some simple applications are already well developed—the bootstrap test for a mean is described by both Efron and Tibshirani (1993) and Davison and Hinkley (1997). In this case, the null-resampling tests are equivalent to tests based on the construction of a percentile-t bootstrap confidence interval. However, in other important cases such as the correlation coefficient or a ratio of means, situations for which percentile-method bootstrap confidence intervals are routinely used, tests based on null resampling differ from interval-based tests, and can have superior size and power properties. Some cases, such as testing for a difference in means, are developed in a similar way to permutation procedures. Other common cases, such as bootstrap testing in ANOVA and regression settings are also described and explored.

The second goal of the paper is to provide a critical assessment of size and, particularly, power properties of bootstrap tests. Such an assessment is critical if bootstrap tests are to attract routine use in data analyses. These properties are examined using both real data and a simulation study in Section 3. The results are encouraging regarding the use of the bootstrap for hypothesis testing. First, the real data examples show that resampling under the null can yield materially different results than naïve resampling, especially in cases where the data are “far” from the null hypothesis. Put simply, if the null hypothesis is ignored in resampling for bootstrap tests it can make a material difference to the observed level of the test. Where null-resampling-based tests are equivalent to bootstrap confidence interval based tests, the size properties of the tests are reflected by the coverage properties of the associated intervals, an issue well examined in the literature. The simulation results raise some other key points. First, the sizes of the bootstrap tests proposed are, in general, very good. The proposed bootstrap tests appear to perform better when the data have been cleaned of outliers, a feature common to many statistical procedures. Nevertheless, for reasonably well-behaved data, bootstrap tests based on null-resampling appear to have slightly better size properties than interval-based tests, in cases where the two methods differ. The study of power properties of the tests is also encouraging. Comparisons of bootstrap power estimates based on resampling under the alternative with simulated power of both null-resampling and interval-based tests revealed that the bootstrap estimates of power for null-resampling tests fell close to their simulated power values, and that null-resampling tests generally had better power properties than interval-based tests in cases where the two methods differed. In many cases, the null-resampling tests had power close to theoretical ideal power based on large-sample theory.

Section snippets

General framework

Consider testing for a parameter $θ$ , $H_{0} : θ = θ_{0}$ , using data (possibly multivariate) $X = {X_{1}, \dots, X_{n}}$ . Denote the test statistic by $g_{θ_{0}} (X)$ , where the notation reflects the usual dependence of the test statistic on both the data and the hypothesized value, and denote the null distribution of $g_{θ_{0}} (X)$ by $F_{θ_{0}}$ . Similarly, the distribution of $g_{θ_{0}} (X)$ under a specific alternative $θ_{A}$ is denoted $F_{θ_{A}}$ . Generally, null and alternative distributions of $g_{θ_{0}} (X)$ are unavailable unless the underlying distribution of $X$ is

Numerical results and discussion

The performance of bootstrap hypothesis tests in terms of size and power was examined in several settings, via both simulation and in cases involving real data. All calculations and simulations described were carried out using the S-Plus and R statistical computing languages.

Example 3.1 Efron's law school data

One of the first data sets used to illustrate bootstrap methodology was Efron's law school data, relating average LSAT (results of an aptitude test) and GPA (undergraduate grade point average) scores for entry into 15 U.S.

References (35)

R. Davidson et al.
The power of bootstrap and asymptotic tests
J. Econometrics
(2006)
J.L. Horowitz
The bootstrap and hypothesis tests in econometrics
J. Econometrics
(2001)
M. Chernick
Bootstrap MethodsA Practitioner's Guide
(1999)
Davidson, R., 2000. Bootstrap confidence intervals based on inverting hypothesis tests. Document de travail, GREQAM...
Davidson, R., MacKinnon, J.G., 1996. The power of bootstrap tests. Discussion paper, Queen's University, Kingston,...
R. Davidson et al.
The size distortion of bootstrap tests
Econometric Theory
(1999)
R. Davidson et al.
Bootstrap testing in non-linear models
Internat. Econom. Rev.
(1999)
R. Davidson et al.
Bootstrap testshow many bootstraps?
Econometric Rev.
(2000)
R. Davidson et al.
Econometric Theory and Methods
(2004)
A.C. Davison et al.
Bootstrap Methods and their Application
(1997)

T.J. DiCiccio et al.

Bootstrap confidence intervals (with discussion)

Statist. Sci.

(1996)

B. Efron

Bootstrap methodsanother look at the Jackknife

Ann. Statist.

(1979)

B. Efron

Computers and the theory of statisticsthinking the unthinkable

SIAM Rev.

(1979)

B. Efron et al.

An Introduction to the Bootstrap

(1993)

N.I. Fisher et al.

On bootstrap hypothesis testing

Austral. J. Statist.

(1990)

P. Good

Permutation TestsA Practical Guide to Resampling Methods for Testing Hypotheses

(2000)

P. Hall

On the number of bootstrap simulations required to construct a confidence interval

Ann. Statist.

(1986)

Cited by (44)

New methodology for discrimination of topography diversity of engineering surfaces – Case of grinding
2022, Measurement: Journal of the International Measurement Confederation
The paper presents the methodology of assessing the degree of differentiation of machined technical surfaces based on the empirical probability distribution of the roughness parameters. The methodology allowed to determine the differentiation level of the analysed surfaces morphology based on the influence of the machining process features on the variability of roughness parameters. The minimum sample size necessary to obtain a statistically significant differentiation level in roughness parameters of the compared surfaces was also determined. The conducted analyses made it possible to determine the information content (indicating the ability to differentiate the features of the analysed surfaces) of the roughness parameters. The robustness of the methodology has been demonstrated. The described methodology was used to evaluate the information content on the example of the grinding process of 100Cr6 bearing steel and Ti-6Al-4V titanium alloy.
Meaningful associations in the adolescent brain cognitive development study
2021, NeuroImage
The Adolescent Brain Cognitive Development (ABCD) Study is the largest single-cohort prospective longitudinal study of neurodevelopment and children's health in the United States. A cohort of n = 11,880 children aged 9–10 years (and their parents/guardians) were recruited across 22 sites and are being followed with in-person visits on an annual basis for at least 10 years. The study approximates the US population on several key sociodemographic variables, including sex, race, ethnicity, household income, and parental education. Data collected include assessments of health, mental health, substance use, culture and environment and neurocognition, as well as geocoded exposures, structural and functional magnetic resonance imaging (MRI), and whole-genome genotyping. Here, we describe the ABCD Study aims and design, as well as issues surrounding estimation of meaningful associations using its data, including population inferences, hypothesis testing, power and precision, control of covariates, interpretation of associations, and recommended best practices for reproducible research, analytical procedures and reporting of results.
Vehicle recalls performance in an emerging market: Evidence from the comparison between China and U.S.
2020, Transportation Research Part A: Policy and Practice
Citation Excerpt :
Nonparametric bootstrapping is a sample-resample technique, that is especially useful when the distribution is unknown, or when normal approximations do not hold (Kleijnen, 2015). While many studies have focused on the theoretical development of bootstrapping in statistics, little research has been devoted to bootstrap testing (Martin, 2007) and its applications. In our setting, however, as we observe the indicators over time, the data may take the form of a time series where a trend is present.
The past decade has witnessed a remarkable growth of automobile sales and production in emerging economies, with China developing into the largest global auto market since 2009. This paper focuses on an important but neglected aspect in these emerging markets, namely, vehicle recalls. The aim of this study is twofold. The first is to show that a significant difference exists in the number and volume of vehicle recalls between the emerging Chinese market and the established US market; the second is to detect whether this difference can be attributed to the initiator level (voluntary versus involuntary recalls) and/or the firm level (organizational ownership structure and nationality of the foreign partner in international joint ventures). To that end, we quantify the recall performance by means of 4 metrics: total number of recall events per annum (NRE), total number of units recalled per annum (NUR), average number of vehicles recalled per event per annum (NRPE), and recall rate (RR); for each of these, we benchmark the US market and assess the relative performance of the investigated market using a bootstrap method. The empirical results indicate that the recall metrics in the Chinese market have underperformed relative to those of the established market. This is extremely pertinent in light of the current “Going Out” policy put forward by the Chinese government, as subpar quality awareness hampers the successful access of Chinese automakers to foreign markets.
Computational Tool to Study Perturbations in Muscle Regulation and Its Application to Heart Disease
2019, Biophysical Journal
Citation Excerpt :
We find that for the experiments described here, 1000 rounds of bootstrapping simulations are sufficient (Fig. 2 A). For statistical hypothesis testing, we define a test statistic as the difference between parameter values in the perturbed and unperturbed systems (14). For example, to determine whether there is a statistically significant difference in the value of KW for a mutant protein relative to the wild type (WT), the test statistic is H = KW(WT) − KW(mutant).
Striated muscle contraction occurs when myosin thick filaments bind to thin filaments in the sarcomere and generate pulling forces. This process is regulated by calcium, and it can be perturbed by pathological conditions (e.g., myopathies), physiological adaptations (e.g., β-adrenergic stimulation), and pharmacological interventions. Therefore, it is important to have a methodology to robustly determine the impact of these perturbations and statistically evaluate their effects. Here, we present an approach to measure the equilibrium constants that govern muscle activation, estimate uncertainty in these parameters, and statistically test the effects of perturbations. We provide a MATLAB-based computational tool for these analyses, along with easy-to-follow tutorials that make this approach accessible. The hypothesis testing and error estimation approaches described here are broadly applicable, and the provided tools work with other types of data, including cellular measurements. To demonstrate the utility of the approach, we apply it to elucidate the biophysical mechanism of a mutation that causes familial hypertrophic cardiomyopathy. This approach is generally useful for studying muscle diseases and therapeutic interventions that target muscle contraction.
Source apportionment of PM<inf>10</inf> and PM<inf>2.5</inf> in major urban Greek agglomerations using a hybrid source-receptor modeling process
2017, Science of the Total Environment
Citation Excerpt :
Finally, in this work, a higher-level conditional probability function was defined, as well, from the common locus of geographical sectors resolved by the original CPF analysis for neighboring receptor sites. The statistical significance of PSCF values can be examined, by making use of the non-parametric bootstrap method (Martin, 2007; Wehrens et al., 2000). The method involves an iterative procedure that substitutes the original trajectories with random ones that are picked with replacement from the ensemble of trajectories (Doğan et al., 2008; Güllü et al., 2005).
A hybrid source-receptor modeling process was assembled, to apportion and infer source locations of PM₁₀ and PM_2.5 in three heavily-impacted urban areas of Greece, during the warm period of 2011, and the cold period of 2012. The assembled process involved application of an advanced computational procedure, the so-called Robotic Chemical Mass Balance (RCMB) model. Source locations were inferred using two well-established probability functions: (a) the Conditional Probability Function (CPF), to correlate the output of RCMB with local wind directional data, and (b) the Potential Source Contribution Function (PSCF), to correlate the output of RCMB with 72 h air-mass back-trajectories, arriving at the receptor sites, during sampling. Regarding CPF, a higher-level conditional probability function was defined as well, from the common locus of CPF sectors derived for neighboring receptor sites. With respect to PSCF, a non-parametric bootstrapping method was applied to discriminate the statistically significant values. RCMB modeling showed that resuspended dust is actually one of the main barriers for attaining the European Union (EU) limit values in Mediterranean urban agglomerations, where the drier climate favors build-up. The shift in the energy mix of Greece (caused by the economic recession) was also evidenced, since biomass burning was found to contribute more significantly to the sampling sites belonging to the coldest climatic zone, particularly during the cold period. The CPF analysis showed that short-range transport of anthropogenic emissions from urban traffic to urban background sites was very likely to have occurred, within all the examined urban agglomerations. The PSCF analysis confirmed that long-range transport of primary and/or secondary aerosols may indeed be possible, even from distances over 1000 km away from study areas.
Genetic assessment of additional endophenotypes from the Consortium on the Genetics of Schizophrenia Family Study
2016, Schizophrenia Research
Citation Excerpt :
The Total Significance Test conditions simultaneously on all observed correlations among endophenotypes; among SNPs; and among related individuals within each family to correct for multiple testing using a null-resampling form of the bootstrap, (Greenwood et al., 2011). This method has been demonstrated to appropriately control type I error, while reducing type II error (Hall and Wilson, 1991; Martin, 2007). The heritability and linkage analyses were conducted according to previously established methods (Greenwood et al., 2013c).
The Consortium on the Genetics of Schizophrenia Family Study (COGS-1) has previously reported our efforts to characterize the genetic architecture of 12 primary endophenotypes for schizophrenia. We now report the characterization of 13 additional measures derived from the same endophenotype test paradigms in the COGS-1 families. Nine of the measures were found to discriminate between schizophrenia patients and controls, were significantly heritable (31 to 62%), and were sufficiently independent of previously assessed endophenotypes, demonstrating utility as additional endophenotypes. Genotyping via a custom array of 1536 SNPs from 94 candidate genes identified associations for CTNNA2, ERBB4, GRID1, GRID2, GRIK3, GRIK4, GRIN2B, NOS1AP, NRG1, and RELN across multiple endophenotypes. An experiment-wide p value of 0.003 suggested that the associations across all SNPs and endophenotypes collectively exceeded chance. Linkage analyses performed using a genome-wide SNP array further identified significant or suggestive linkage for six of the candidate endophenotypes, with several genes of interest located beneath the linkage peaks (e.g., CSMD1, DISC1, DLGAP2, GRIK2, GRIN3A, and SLC6A3). While the partial convergence of the association and linkage likely reflects differences in density of gene coverage provided by the distinct genotyping platforms, it is also likely an indication of the differential contribution of rare and common variants for some genes and methodological differences in detection ability. Still, many of the genes implicated by COGS through endophenotypes have been identified by independent studies of common, rare, and de novo variation in schizophrenia, all converging on a functional genetic network related to glutamatergic neurotransmission that warrants further investigation.

View all citing articles on Scopus

View full text

Bootstrap hypothesis testing for some common statistical problems: A critical evaluation of size and power properties

Abstract

Introduction

Section snippets

General framework

Numerical results and discussion

J. Econometrics

J. Econometrics

Bootstrap MethodsA Practitioner's Guide

The size distortion of bootstrap tests

Econometric Theory

Bootstrap testing in non-linear models

Internat. Econom. Rev.

Bootstrap testshow many bootstraps?

Econometric Rev.

Econometric Theory and Methods

Bootstrap Methods and their Application

Bootstrap confidence intervals (with discussion)

Statist. Sci.

Bootstrap methodsanother look at the Jackknife

Ann. Statist.

Computers and the theory of statisticsthinking the unthinkable

SIAM Rev.

An Introduction to the Bootstrap

On bootstrap hypothesis testing

Austral. J. Statist.

Permutation TestsA Practical Guide to Resampling Methods for Testing Hypotheses

On the number of bootstrap simulations required to construct a confidence interval

Ann. Statist.