Skip to main content
Top
Published in: Empirical Economics 6/2023

Open Access 17-03-2023

Unbiased estimation of the OLS covariance matrix when the errors are clustered

Authors: Tom Boot, Gianmaria Niccodemi, Tom Wansbeek

Published in: Empirical Economics | Issue 6/2023

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

When data are clustered, common practice has become to do OLS and use an estimator of the covariance matrix of the OLS estimator that comes close to unbiasedness. In this paper, we derive an estimator that is unbiased when the random-effects model holds. We do the same for two more general structures. We study the usefulness of these estimators against others by simulation, the size of the t-test being the criterion. Our findings suggest that the choice of estimator hardly matters when the regressor has the same distribution over the clusters. But when the regressor is a cluster-specific treatment variable, the choice does matter and the unbiased estimator we propose for the random-effects model shows excellent performance, even when the clusters are highly unbalanced.
Notes

Supplementary Information

The online version contains supplementary material available at https://​doi.​org/​10.​1007/​s00181-023-02379-w.
The authors are grateful for the incisive and most useful comments from two referees. Tom Boot acknowledges financial support from the Dutch Research Council (NWO) under research grant No. \(201\text {E}.011\).

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Within-cluster dependence presents a considerable challenge for reliable inference. Even with large data sets, a small number of clusters induces substantial finite-sample bias in the estimated variance of the regression coefficients. Several options are available to mitigate this bias. Stata uses a scalar correction to the Liang and Zeger (1986) cluster-robust variance estimator, while Bell and McCaffrey (2002) develop cluster extensions of the MacKinnon and White (1985) heteroskedasticity-robust variance estimators. See Cameron and Miller (2015) and MacKinnon et al. (2023) for recent surveys on the topic. However, with the exception of some special cases, none of these variance adjustments completely eliminates the bias.
In this paper, we develop variance estimators that are unbiased under progressively more complicated dependence structures. Our aim is to investigate whether removing the bias in the variance estimators leads to improved inference, in particular by delivering hypothesis tests with more accurate size control. The key idea underlying the unbiased variance estimator is a cluster extension of the variance estimator by Hartley et al. (1969), which is unbiased under heteroskedasicity. In its original form, this variance estimator has the drawback that it requires inverting a matrix that grows quadratically with the sample size. We show how the underlying structure of this matrix can be exploited to make the computation feasible even with large microeconometric data sets.
With a large number of clusters, test statistics based on cluster-robust variance estimators have a standard normal distribution, see for instance Hansen and Lee (2019). With a small number of clusters, the use of the normal distribution to obtain confidence intervals and critical values can lead to substantial size distortions as discussed in Cameron and Miller (2015), Sect. VI.D, unless the within-cluster dependence is restricted as in Ibragimov and Müller (2016). The use of a t-distribution reduces the size distortion, but this requires selecting the appropriate degrees of freedom (d.f.). For our proposed variance estimators, we derive a data-driven estimator for the d.f. following the approach based on an independence assumption on the errors as in Bell and McCaffrey (2002), as well as the generalization to a random-effects structure studied in Imbens and Kolesár (2016).
We focus on three dependence structures of increasing generality. First, we assume that in each cluster, the errors follow the same random-effects structure. In this case, the covariance structure depends on two (unknown) parameters. Second, we extend this setting by allowing the RE parameters to be cluster dependent, increasing the number of parameters to two times the number of clusters. Finally, we consider a fully unrestricted setting where each cluster has an arbitrary covariance matrix. This captures for example a setting with conditional heteroskedasticity where the covariance matrix depends via an unknown functional form on a set of continuous regressors. In practice, leaving the correlation structure completely undetermined is generally preferred. However, tighter parametrizations can be useful to reduce estimation uncertainty and improve the behavior of tests in settings where the number of clusters is small and the parametrization is only mildly misspecified.
As said, the first two structures contain random effects, and one might consider treating the effects as fixed, that is, adding cluster-level fixed effects to the model. This will greatly reduce the intra-cluster correlation and might be considered a simple alternative. However, the drawbacks outweigh the benefits. Just as in the closely connected case of panel data analysis, fixed effects spawn the within-transformation, which often eliminates most variation in the data while wiping out regressors that are the same for all observations in a cluster, like a cluster-specific treatment dummy, a case of great empirical relevance. At the same time, the main advantage of fixed over random effects in a panel data context, controlling for endogeneity, is not a particular issue in the current context.
For each of the three dependence structures, we numerically evaluate the size properties of hypothesis tests based on the unbiased variance estimators. We compare their performance with the default Stata option, as well as the HC2 variance estimator by Bell and McCaffrey (2002) with d.f. as in Imbens and Kolesár (2016). The model we consider includes a treatment dummy and a continuous variable. For each covariance structure, we vary the number of treated clusters and consider both a balanced design, where each cluster has the same number of observations, as well as an unbalanced design.
Under the specification where the random-effects structure is the same across clusters, we find that the corresponding unbiased variance estimator performs remarkably well. Even with only a single treated cluster, hypothesis tests provide accurate size control on both the treatment dummy and the continuous variable. When the number of observations differs between clusters, we find that the d.f. calculated under the more general RE assumption improve substantially over those calculated under independence assumptions. In a more general setting where the RE structure is cluster dependent, we find that using the corresponding variance estimator improves over the benchmarks particularly when the design is unbalanced. Finally, we consider a setting where there is conditional heteroskedasticity that depends on the continuous variable. The most general unbiased variance estimator continues to control size in this setup.
After these simulations with fully artificial data, we compare methods using real-life data with an artificial element added. That is, we estimate a wage equation on the basis of US data, clustered by state. To the real-life data, we added an artificial state-wide policy dummy variable. We study the size of an hypothesis test on the effect of this dummy variable by sampling subsets of states either at random or based on their number of observations.
Finally, we remark that we develop our variance estimators in what Abadie et al. (2020) refer to as a sampling-based framework, where we condition on the available regressors and the cluster structure is determined by the covariance structure of the regression errors. When the regressors are random as in a design-based framework, the relevant cluster structure is instead determined by the clustering in both the regressors and the regression errors. For instance, when the regressor is a treatment dummy that is randomized at the unit level, there is no need to account for clustering at all. From this perspective, we expect that in a design-based framework, the tighter cluster parametrizations we propose can be useful when these correspond to the cluster structure in the assignment mechanism for the treatment dummy.
The paper is organized as follows. In Sect. 2, we start by deriving the general form of unbiased estimators for error covariance matrices with a linear structure. We then specify this for clusters in Sect. 3. We first consider in Sect. 3.1 a simple structure with just two parameters, one for the overall error and one for the within-cluster error. In Sect. 3.2, we generalize this and make these parameters specific per cluster. In Sect. 3.3, we generalize this one more step and allow all covariances within clusters to vary freely. We proceed to compare the performance of the various unbiased variance estimators, first by simulation and then through an application to real-life data. Our performance measure is the size of the t-test. The d.f. of the t-tests play an important role, and in Sect. 4, we discuss how we set them. Section 5 describes the setup of the simulations, while the results are presented in Sect. 6. The results for the real-life data are given in Sect. 7. Section 8 concludes.
Most derivations are given in “Online Appendix A, B and C,” contained in the Supplementary Information available online. The MATLAB code for the computations reported in this paper is available from https://​sites.​google.​com/​view/​tomboot/​.

2 Unbiased variance estimation

We consider the linear regression model \({\textbf{y}}={\textbf{X}}{\varvec{\beta }}+{\varvec{\varepsilon }}\), with \({\textbf{X}}\) exogenous of order \(n\times k\). We follow the usual notation \({\textbf{M}}\equiv {\textbf{I}}_n-{\textbf{X}}({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'\) and \({\textbf{P}}\equiv {\textbf{X}}({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'\). The errors are distributed according to \({\varvec{\varepsilon }}\sim ({\varvec{0}},{\varvec{\Sigma }})\) and we consider the case where \({\varvec{\Sigma }}\) is linear in parameters,
$$\begin{aligned} \text{ vec }{\varvec{\Sigma }}={\textbf{D}}{\varvec{\pi }}, \end{aligned}$$
with \({\varvec{\pi }}\) of order \(r\times 1\) and the design matrix \({\textbf{D}}\) of order \(n^2\times r\). We are interested in unbiased estimation of the covariance matrix \({\textbf{V}}\) of the OLS estimator \({\hat{{\varvec{\beta }}}}\) of \({\varvec{\beta }}\),
$$\begin{aligned} {\textbf{V}}=({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'{\varvec{\Sigma }}{\textbf{X}}({\textbf{X}}'{\textbf{X}})^{-1}. \end{aligned}$$
As will become clear below, our analyses involving \({\textbf{V}}\) require us to consider it in stacked form, \({\textbf{v}}\equiv \text{ vec }{\textbf{V}}\). With
$$\begin{aligned} {\textbf{R}}'\equiv \left( ({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'\otimes ({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'\right) {\textbf{D}}, \end{aligned}$$
we have in stacked form
$$\begin{aligned} {\textbf{v}}= & {} \left( ({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'\otimes ({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'\right) \text{ vec }{\varvec{\Sigma }}\\= & {} {\textbf{R}}'{\varvec{\pi }}. \end{aligned}$$
We base our estimator on a function of the residuals \(\hat{{\varvec{\varepsilon }}}\equiv {\textbf{M}}{\varvec{\varepsilon }}\) that is aligned with the structure of \({\varvec{\Sigma }}\). We hence project the squared residuals on the space spanned by \({\textbf{D}}\), so we use \({\textbf{D}}({\textbf{D}}'{\textbf{D}})^{-1}{\textbf{D}}'(\hat{{\varvec{\varepsilon }}}\otimes \hat{{\varvec{\varepsilon }}})\), leading to the estimator
$$\begin{aligned} {\tilde{{\textbf{v}}}}= & {} \left( ({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'\otimes ({\textbf{X}}'{\textbf{X}})^{-1} {\textbf{X}}'\right) {\textbf{D}}({\textbf{D}}'{\textbf{D}})^{-1}{\textbf{D}}'(\hat{{\varvec{\varepsilon }}}\otimes \hat{{\varvec{\varepsilon }}})\nonumber \\= & {} {\textbf{R}}'({\textbf{D}}'{\textbf{D}})^{-1}{\textbf{D}}'(\hat{{\varvec{\varepsilon }}}\otimes \hat{{\varvec{\varepsilon }}}). \end{aligned}$$
(1)
However, this estimator is biased; with \(\hbox {E}\left( {\textbf{D}}'(\hat{{\varvec{\varepsilon }}}\otimes \hat{{\varvec{\varepsilon }}})\right) ={\textbf{D}}'({\textbf{M}}\otimes {\textbf{M}}){\textbf{D}}{\varvec{\pi }}\) there holds
$$\begin{aligned} \hbox {E}({\tilde{{\textbf{v}}}})={\textbf{R}}'({\textbf{D}}'{\textbf{D}})^{-1}[{\textbf{D}}'({\textbf{M}}\otimes {\textbf{M}}){\textbf{D}}]{\varvec{\pi }}\ne {\textbf{R}}'{\varvec{\pi }}={\textbf{v}}. \end{aligned}$$
The bias is easily removed by replacing the term \(({\textbf{D}}'{\textbf{D}})^{-1}\) by \([{\textbf{D}}'({\textbf{M}}\otimes {\textbf{M}}){\textbf{D}}]^{-1}\). For the special case of heteroskedasticity, this idea is due to Hartley et al. (1969). The adapted, unbiased estimator of \({\textbf{v}}\) then is
$$\begin{aligned} {\hat{{\textbf{v}}}}\equiv {\textbf{R}}'[{\textbf{D}}'({\textbf{M}}\otimes {\textbf{M}}){\textbf{D}}]^{-1}{\textbf{D}}'(\hat{{\varvec{\varepsilon }}}\otimes \hat{{\varvec{\varepsilon }}}). \end{aligned}$$
(2)
For computational purposes, (2) is unattractive as the matrix \({\textbf{M}}\otimes {\textbf{M}}\) is huge with large data sets. However, we show below how the simple structure of \({\textbf{M}}\), being the sum of the unit matrix and a matrix of low rank, can be exploited to avoid computational difficulties. A relatively common issue with unbiased estimation of variance components, see for instance Kline et al. (2020), is that the estimator is not guaranteed to be positive definite. However, corrections that make the estimator positively biased are readily available and avoid overrejection.
Below we will consider three cases, with different design matrices \({\textbf{D}}\). In the third case, the number of columns of \({\textbf{D}}\) can be very large. Then, we can use an adapted version of (2). Let
$$\begin{aligned} {\textbf{A}}\equiv & {} {\textbf{D}}'{\textbf{D}}-{\textbf{D}}'({\textbf{I}}_n\otimes {\textbf{P}}){\textbf{D}}-{\textbf{D}}'({\textbf{P}}\otimes {\textbf{I}}_n){\textbf{D}}\\ {\textbf{W}}\equiv & {} {\textbf{X}}'{\textbf{X}}\otimes {\textbf{X}}'{\textbf{X}}\\ {\textbf{F}}\equiv & {} {\textbf{D}}'({\textbf{X}}\otimes {\textbf{X}}). \end{aligned}$$
Then,
$$\begin{aligned} {\textbf{R}}'= & {} {\textbf{W}}^{-1}{\textbf{F}}'\\ {\textbf{D}}'\big ({\textbf{P}}\otimes {\textbf{P}}\big ){\textbf{D}}= & {} {\textbf{F}}{\textbf{W}}^{-1}{\textbf{F}}'\\ {\textbf{D}}'\big ({\textbf{M}}\otimes {\textbf{M}}\big ){\textbf{D}}= & {} {\textbf{D}}'{\textbf{D}}-{\textbf{D}}'\big ({\textbf{I}}_n\otimes {\textbf{P}}\big ){\textbf{D}}-{\textbf{D}}'\big ({\textbf{P}}\otimes {\textbf{I}}_n\big ){\textbf{D}}+{\textbf{D}}'\big ({\textbf{P}}\otimes {\textbf{P}}\big ){\textbf{D}}\\= & {} {\textbf{A}}+{\textbf{F}}{\textbf{W}}^{-1}{\textbf{F}}' \end{aligned}$$
Since
$$\begin{aligned} \big ({\textbf{W}}+{\textbf{F}}'{\textbf{A}}^{-1}{\textbf{F}}\big ){\textbf{W}}^{-1}{\textbf{F}}'={\textbf{F}}'{\textbf{A}}^{-1}\big ({\textbf{A}}+{\textbf{F}}{\textbf{W}}^{-1}{\textbf{F}}'\big ) \end{aligned}$$
there holds
$$\begin{aligned} {\textbf{W}}^{-1}{\textbf{F}}'\big ({\textbf{A}}+{\textbf{F}}{\textbf{W}}^{-1}{\textbf{F}}'\big )^{-1}=\big ({\textbf{W}}+{\textbf{F}}'{\textbf{A}}^{-1}{\textbf{F}}\big )^{-1}{\textbf{F}}'{\textbf{A}}^{-1}. \end{aligned}$$
Substitution in (2) yields
$$\begin{aligned} {\hat{{\textbf{v}}}}= & {} {\textbf{W}}^{-1}{\textbf{F}}'\big ({\textbf{A}}+{\textbf{F}}{\textbf{W}}^{-1}{\textbf{F}}'\big )^{-1}{\textbf{D}}'\big (\hat{{\varvec{\varepsilon }}}\otimes \hat{{\varvec{\varepsilon }}}\big )\nonumber \\= & {} \big ({\textbf{W}}+{\textbf{F}}'{\textbf{A}}^{-1}{\textbf{F}}\big )^{-1}{\textbf{F}}'{\textbf{A}}^{-1}{\textbf{D}}'\big (\hat{{\varvec{\varepsilon }}}\otimes \hat{{\varvec{\varepsilon }}}\big ). \end{aligned}$$
(3)
This expression still contains the inverse of the matrix \({\textbf{A}}\), which has the same number of columns as \({\textbf{D}}\). It will appear, though, that \({\textbf{A}}^{-1}\) occurs only in the form \({\textbf{F}}'{\textbf{A}}^{-1}\), which appears to have a simple expression in this third case.
We now turn to the cluster structure. We denote the number of clusters by C and index them by \(c=1,\ldots ,C\). Cluster c has \(n_c\) observations, so \(\sum _cn_c=n\). We let
$$\begin{aligned} \ddot{n}\equiv & {} \sum _cn_c^2\\ {\varvec{\Delta }}_n\equiv & {} \text{ diag }\;n_c. \end{aligned}$$
Let \({\textbf{i}}_c\) an \(n_c\)-vector of ones. With a slight abuse of notation, we will write \({\textbf{I}}_c\) for \({\textbf{I}}_{n_c}\) and let
$$\begin{aligned} {\textbf{G}}_c\equiv \left( \begin{array}{c}{\textbf{O}}\\ \vdots \\ {\textbf{I}}_c\\ \vdots \\ {\textbf{O}}\end{array}\right) \qquad {\textbf{b}}_c\equiv \left( \begin{array}{c}{\varvec{0}}\\ \vdots \\ {\textbf{i}}_c\\ \vdots \\ {\varvec{0}}\end{array}\right) \qquad {\textbf{B}}\equiv ({\textbf{b}}_1,\ldots ,{\textbf{b}}_c,\ldots ,{\textbf{b}}_C). \end{aligned}$$
(4)
The regressors for cluster c are collected in \({\textbf{X}}_c\equiv {\textbf{G}}_c'{\textbf{X}}\) and their sum over the cluster in the row vector \(\tilde{{\textbf{x}}}_c'\equiv {\textbf{b}}_c'{\textbf{X}}\). The \(\tilde{{\textbf{x}}}_c'\)s are collected in the \(C\times k\) matrix \(\tilde{{\textbf{X}}}\equiv {\textbf{B}}'{\textbf{X}}\). Likewise, \(\hat{{\varvec{\varepsilon }}}_c\equiv {\textbf{G}}_c'\hat{{\varvec{\varepsilon }}}\) and \(\tilde{\hat{{\varvec{\varepsilon }}}}_c\equiv {\textbf{b}}_c'\hat{{\varvec{\varepsilon }}}\) so \(\tilde{\hat{{\varvec{\varepsilon }}}}={\textbf{B}}'\hat{{\varvec{\varepsilon }}}\).
Below we will frequently perform matrix operations using
$$\begin{aligned} \text{ vec }({\textbf{A}}{\textbf{B}}{\textbf{C}})= & {} ({\textbf{C}}'\otimes {\textbf{A}})\text{ vec }{\textbf{B}}\\ \text{ tr }({\textbf{A}}{\textbf{B}}{\textbf{C}}{\textbf{D}})= & {} \text{ vec }({\textbf{A}}')'({\textbf{D}}'\otimes {\textbf{B}})\text{ vec }{\textbf{C}}, \end{aligned}$$
for conformable generic \({\textbf{A}},{\textbf{B}},{\textbf{C}}\) and \({\textbf{D}}\). A piece of notation that is useful in the third case that we will study is the Kronecker product with a dot on top. With \({\textbf{e}}_c\) be the cth unit vector, we write
$$\begin{aligned} \sum _c{\textbf{e}}_c'\;{\dot{\otimes }}\;{\textbf{A}}_c=\left( {\textbf{A}}_1,\ldots ,{\textbf{A}}_C\right) \end{aligned}$$
for matrices \({\textbf{A}}_1,\ldots ,{\textbf{A}}_C\) with the same number of rows but possibly different number of columns. The use of \({\dot{\otimes }}\) is as straightforward as the use of \(\otimes \).

3 Application to three forms of clustering

In this section, we consider three, increasingly general structures for \({\varvec{\Sigma }}\) and present the variance estimator (2) for each case. The results are in stacked form, \({\hat{{\textbf{v}}}}\). We also present the simpler expressions that are obtained when \({\varvec{\varepsilon }}\) would be observable and \({\hat{{\varvec{\varepsilon }}}}\) is substituted for \({\varvec{\varepsilon }}\) afterward, that is, the results that we obtain when we neglect the presence of the regressors. These simpler expressions can be put in the usual, “unstacked” form, that is, as \({\hat{{\textbf{V}}}}\) rather than \({\hat{{\textbf{v}}}}\). Derivations are relegated to “Online Appendix A”.

3.1 Equicorrelated errors

We first consider the case where the errors are equicorrelated within clusters, so
$$\begin{aligned} {\varvec{\Sigma }}=\sigma ^2{\textbf{I}}_n+\tau ^2{\textbf{B}}{\textbf{B}}', \end{aligned}$$
with \({\textbf{B}}\) as given in (4). Let
$$\begin{aligned} {\varvec{\Psi }}=\left( \begin{array}{cc}n-k&{}n-s\\ n-s&{}\ddot{n}-2\breve{s}+\dot{s}\end{array}\right) , \end{aligned}$$
with
$$\begin{aligned} s\equiv & {} \text{ tr }({\textbf{X}}'{\textbf{X}})^{-1}\tilde{{\textbf{X}}}'\tilde{{\textbf{X}}}\\ \dot{s}\equiv & {} \text{ tr }({\textbf{X}}'{\textbf{X}})^{-1}\tilde{{\textbf{X}}}'\tilde{{\textbf{X}}}({\textbf{X}}'{\textbf{X}})^{-1}\tilde{{\textbf{X}}}'\tilde{{\textbf{X}}}\\ \breve{s}\equiv & {} \text{ tr }({\textbf{X}}'{\textbf{X}})^{-1}{\tilde{{\textbf{X}}}}'{\varvec{\Delta }}_n{\tilde{{\textbf{X}}}} \end{aligned}$$
Then,
$$\begin{aligned} {\hat{{\textbf{v}}}}=({\textbf{X}}'{\textbf{X}}\otimes {\textbf{X}}'{\textbf{X}})^{-1}\left( \text{ vec }\;{\textbf{X}}'{\textbf{X}},\text{ vec }\;\tilde{{\textbf{X}}}'\tilde{{\textbf{X}}}\right) {\varvec{\Psi }}^{-1}(\hat{{\varvec{\varepsilon }}}'\hat{{\varvec{\varepsilon }}},\tilde{\hat{{\varvec{\varepsilon }}}}'\tilde{\hat{{\varvec{\varepsilon }}}})' \end{aligned}$$
(5)
is an unbiased estimator of \({\textbf{v}}\).
Two remarks are in order here. The first one concerns symmetry. The \(k\times k\) covariance matrix \({\hat{{\textbf{V}}}}\), obtained by rearranging \({\hat{{\textbf{v}}}}\) into a matrix, should be symmetric. The derivation of (5) did not take this requirement into consideration. However, it is easy to show that \({\hat{{\textbf{V}}}}\) is symmetric, by employing the commutation matrix \({\textbf{K}}_k\), with properties \({\textbf{K}}_k({\textbf{A}}\otimes {\textbf{B}})=({\textbf{B}}\otimes {\textbf{A}}){\textbf{K}}_k\) for any \(k\times k\) matrices \({\textbf{A}}\) and \({\textbf{B}}\) and \({\textbf{K}}_k\text{ vec }{\textbf{C}}=\text{ vec }{\textbf{C}}\) for any symmetric \(k\times k\) matrix \({\textbf{C}}\). Symmetry of \({\hat{{\textbf{V}}}}\) is equivalent to \({\textbf{K}}_k\text{ vec }{\hat{{\textbf{v}}}}=\text{ vec }{\hat{{\textbf{v}}}}\). By using \({\textbf{K}}_k={\textbf{K}}_k^{-1}\), this readily follows. The same holds for the other two variance estimators derived below.
The second remark concerns the role played by the regressors. When they would have been neglected in the derivation, that is, estimating \({\textbf{v}}\) by (1) instead of by (2), we would have obtained
$$\begin{aligned} {\varvec{\Psi }}=\left( \begin{array}{cc}n&{}n\\ n&{}\ddot{n}\end{array}\right) \qquad \text{ so }\qquad {\varvec{\Psi }}^{-1}=\frac{1}{n(\ddot{n}-n)}\left( \begin{array}{rr}\ddot{n}&{}-n\\ -n &{} n\end{array}\right) . \end{aligned}$$
(6)
We can then write
$$\begin{aligned} {\hat{{\textbf{v}}}}=({\textbf{X}}'{\textbf{X}}\otimes {\textbf{X}}'{\textbf{X}})^{-1}\left( \text{ vec }\;{\textbf{X}}'{\textbf{X}},\text{ vec }\; \tilde{{\textbf{X}}}'\tilde{{\textbf{X}}}\right) (\hat{\sigma }^2,\hat{\tau }^2)' \end{aligned}$$
or
$$\begin{aligned} {\hat{{\textbf{V}}}}=({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'\hat{{\varvec{\Sigma }}}{\textbf{X}}({\textbf{X}}'{\textbf{X}})^{-1}, \end{aligned}$$
(7)
with \({\hat{{\varvec{\Sigma }}}}=\hat{\sigma }^2{\textbf{I}}_n+\hat{\tau }^2{\textbf{B}}{\textbf{B}}'\), where
$$\begin{aligned} \hat{\sigma }^2= & {} \frac{1}{n}\hat{{\varvec{\varepsilon }}}'\hat{{\varvec{\varepsilon }}}-\hat{\tau }^2 \end{aligned}$$
(8)
$$\begin{aligned} \hat{\tau }^2= & {} \frac{1}{\ddot{n}-n}(\tilde{\hat{{\varvec{\varepsilon }}}}'\tilde{\hat{{\varvec{\varepsilon }}}}-\hat{{\varvec{\varepsilon }}}'\hat{{\varvec{\varepsilon }}}). \end{aligned}$$
(9)
In this form, \({\hat{{\varvec{\Sigma }}}}\) is the estimator for \({\varvec{\Sigma }}\) used by Imbens and Kolesár (2016) in their d.f. derivation, to be discussed in the following Sect. 4.

3.2 Cluster-specific parameters

We next let \(\sigma ^2\) and \(\tau ^2\) vary over clusters, so now
$$\begin{aligned} {\varvec{\Sigma }}=\sum _c(\sigma _c^2{\textbf{G}}_c{\textbf{G}}_c'+\tau _c^2{\textbf{b}}_c{\textbf{b}}_c'). \end{aligned}$$
Let
$$\begin{aligned} {\varvec{\Phi }}=\left( \begin{array}{cc}{\varvec{\Delta }}_n-2{\varvec{\Delta }}_s+{\textbf{A}}&{}{\varvec{\Delta }}_n-2{\varvec{\Delta }}_{\tilde{s}}+{\textbf{L}}\\ {\varvec{\Delta }}_n-2{\varvec{\Delta }}_{\tilde{s}}+{\textbf{L}}'&{}{\varvec{\Delta }}_n^2-2{\varvec{\Delta }}_n{\varvec{\Delta }}_{\tilde{s}}+{\textbf{Q}}\end{array}\right) , \end{aligned}$$
with
$$\begin{aligned} {\varvec{\Delta }}_s= & {} \text{ diag }\;\text{ tr }({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}_c'{\textbf{X}}_c\\ {\varvec{\Delta }}_{\tilde{s}}= & {} \text{ diag }\;\tilde{{\textbf{x}}}_c'({\textbf{X}}'{\textbf{X}})^{-1}\tilde{{\textbf{x}}}_c \end{aligned}$$
while \({\textbf{A}}, {\textbf{L}}\) and \({\textbf{Q}}\) are matrices of order \(C\times C\) with typical elements
$$\begin{aligned} a_{cd}\equiv & {} \text{ tr }({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}_c'{\textbf{X}}_c({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}_d'{\textbf{X}}_d\\ \ell _{cd}\equiv & {} \tilde{{\textbf{x}}}_d'({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}_c'{\textbf{X}}_c({\textbf{X}}'{\textbf{X}})^{-1}\tilde{{\textbf{x}}}_d\\ q_{cd}\equiv & {} \left( \tilde{{\textbf{x}}}_c'({\textbf{X}}'{\textbf{X}})^{-1}\tilde{{\textbf{x}}}_d\right) ^2. \end{aligned}$$
Then,
$$\begin{aligned} {\hat{{\textbf{v}}}}=({\textbf{X}}'{\textbf{X}}\otimes {\textbf{X}}'{\textbf{X}})^{-1}\sum _c\left[ (\text{ vec }{\textbf{X}}_c'{\textbf{X}}_c){\textbf{e}}_c',(\tilde{{\textbf{x}}}_c\otimes \tilde{{\textbf{x}}}_c){\textbf{e}}_c'\right] {\varvec{\Phi }}^{-1}\sum _c\left( \begin{array}{c}\hat{{\varvec{\varepsilon }}}_c'\hat{{\varvec{\varepsilon }}}_c{\textbf{e}}_c\\ \tilde{\hat{{\varvec{\varepsilon }}}}^2_c{\textbf{e}}_c\end{array}\right) \end{aligned}$$
(10)
is the unbiased estimator for the variance \({\textbf{v}}\).
Also, here, we present the simpler result when the regressors are neglected. Then,
$$\begin{aligned} {\varvec{\Phi }}=\left( \begin{array}{cc}{\varvec{\Delta }}_n&{}{\varvec{\Delta }}_n\\ {\varvec{\Delta }}_n&{}{\varvec{\Delta }}_n^2\end{array}\right) \qquad \text{ so }\qquad {\varvec{\Phi }}^{-1}=\left( \begin{array}{cc}{\varvec{\Delta }}_n{\textbf{W}}&{}-{\textbf{W}}\\ bW&{}{\textbf{W}}\end{array}\right) , \end{aligned}$$
with \({\textbf{W}}\equiv ({\varvec{\Delta }}_n^2-{\varvec{\Delta }}_n)^{-1}\). Then,
$$\begin{aligned} {\varvec{\Phi }}^{-1}\sum _c\left( \begin{array}{c}\hat{{\varvec{\varepsilon }}}_c'\hat{{\varvec{\varepsilon }}}_c{\textbf{e}}_c\\ \tilde{\hat{{\varvec{\varepsilon }}}}^2_c {\textbf{e}}_c\end{array}\right) = \sum _c\frac{1}{n_c(n_c-1)}\left( \begin{array}{c} (n_c\hat{{\varvec{\varepsilon }}}_c'\hat{{\varvec{\varepsilon }}}_c-\tilde{\hat{{\varvec{\varepsilon }}}}^2_c){\textbf{e}}_c\\ (\tilde{\hat{{\varvec{\varepsilon }}}}^2_c-\hat{{\varvec{\varepsilon }}}_c'\hat{{\varvec{\varepsilon }}}_c){\textbf{e}}_c\end{array}\right) \equiv \sum _c\left( \begin{array}{c}{\hat{\sigma }}^2_c{\textbf{e}}_c\\ {\hat{\tau }}^2_c{\textbf{e}}_c\end{array}\right) , \end{aligned}$$
with \({\hat{\sigma }}^2_c\) and \({\hat{\tau }}^2_c\) implicitly defined. Then,
$$\begin{aligned} {\hat{{\textbf{V}}}}=({\textbf{X}}'{\textbf{X}})^{-1}\left[ \sum _c\left( {\hat{\sigma }}^2_c{\textbf{X}}_c'{\textbf{X}}_c+{\hat{\tau }}^2_c{\tilde{{\textbf{x}}}}_c{\tilde{{\textbf{x}}}}_c' \right) \right] ({\textbf{X}}'{\textbf{X}})^{-1}, \end{aligned}$$
which is the obvious extension to the case of cluster-specific parameters from the one where the parameters are the same over clusters, as discussed in Sect. 3.1.

3.3 Unrestricted error correlation within clusters

The third case we consider has errors that correlate freely within clusters, in a way that differs over clusters. Thus,
$$\begin{aligned} {\varvec{\Sigma }}=\text{ diag }\;{\varvec{\Lambda }}_c, \end{aligned}$$
(11)
where the \({\varvec{\Lambda }}_c\) are \(n_c\times n_c\) matrices of parameters. With
$$\begin{aligned} {\textbf{S}}_c\equiv {\textbf{I}}_{k^2}-{\textbf{I}}_k\otimes {\textbf{X}}_c'{\textbf{X}}_c({\textbf{X}}'{\textbf{X}})^{-1}-{\textbf{X}}_c'{\textbf{X}}_c({\textbf{X}}'{\textbf{X}})^{-1}\otimes {\textbf{I}}_k, \end{aligned}$$
we now obtain
$$\begin{aligned} {\hat{{\textbf{v}}}}=\left( {\textbf{X}}'{\textbf{X}}\otimes {\textbf{X}}'{\textbf{X}}+\sum _c{\textbf{S}}_c^{-1}\left( {\textbf{X}}_c' {\textbf{X}}_c\otimes {\textbf{X}}_c'{\textbf{X}}_c\right) \right) ^{-1}\sum _c{\textbf{S}}_c^{-1}\left( {\textbf{X}}_c'\hat{{\varvec{\varepsilon }}}_c\otimes {\textbf{X}}_c'\hat{{\varvec{\varepsilon }}}_c\right) \end{aligned}$$
(12)
as the unbiased estimator of \({\textbf{v}}\) for this case.
As regards the computability of \({\hat{{\textbf{v}}}}\), notice the expression includes the inverse of a \(k^2\times k^2\) matrix with no exploitable structure. However, its inversion should not be problematic computationally for a typical value of k. When cluster-specific dummies are added to the model, the matrix to be inverted increases in size to \((k+C-1)^2\times (k+C-1)^2\). Any computational problem that might arise is easily averted by eliminating the dummies through the within-transformation (subtract the cluster mean), which can be performed in O(n). The results remain the same, but now in transformed variables, and without the intercept, which becomes zero after the within-transformation.
Also, here, we consider the version of \({\hat{{\textbf{v}}}}\) that neglects the regressors. Rearranged into matrix format, it appears to be
$$\begin{aligned} {\hat{{\textbf{V}}}}=({\textbf{X}}'{\textbf{X}})^{-1}\sum _c{\textbf{X}}_c'\hat{{\varvec{\varepsilon }}}_c\hat{{\varvec{\varepsilon }}}_c'{\textbf{X}}_c({\textbf{X}}'{\textbf{X}})^{-1}. \end{aligned}$$
(13)
This estimator directly generalizes the White (1980) for cross-sections to clusters and was introduced in the context of panel data analysis by Liang and Zeger (1986), where it underlies the widely used panel-robust standard errors allowing for both heteroskedasticity and correlation over time, see, e.g., Cameron and Trivedi (2005).

4 Degrees of freedom

The various expressions for \({\hat{{\textbf{V}}}}\) or \({\hat{{\textbf{v}}}}\) may be of interest by themselves but their main use will be in inference on one particular regression coefficient, \(\beta _\ell \), say. For large C, the critical values from a standard normal distribution can be used. However, in practice, C is often small, and using a t-distribution is to be preferred. For instance, Stata uses a \(t(C-1)\)-distribution after the command regress y x, vce(cluster clustvar).
Following Satterthwaite (1946), Bell and McCaffrey (2002) proposed a refinement by making the d.f. in the t-distribution data-dependent. The idea is as follows. Let \(v^2_\ell \) be the variance of the OLS estimator \(\hat{\beta }_\ell \) and \(\hat{v}^2_\ell \) an estimator of \(v^2_\ell \). Let
$$\begin{aligned} T=\frac{\hat{\beta }_\ell }{v_\ell }/{\frac{\hat{v}_\ell }{v_\ell }}. \end{aligned}$$
Under normality of the regression errors, the numerator is N(0, 1) when \(\beta _\ell =0\). Letting \(\hat{v}^2_\ell \) be the usual OLS-based estimator of \(v^2_\ell \), the denominator is distributed according to
$$\begin{aligned} (n-k)\frac{\hat{v}^2_\ell }{v^2_\ell }\sim \chi ^2_{n-k}, \end{aligned}$$
(14)
leading to the \(t(n-k)\)-distribution for T. This classical result gets lost when we employ another estimator \(\hat{v}^2_\ell \) than the usual one, like one of the cluster-robust estimators discussed in Sect. 3. The proposal of Bell and McCaffrey (2002) is to stay close to (14), by setting the d.f. \(d_\ell \) such that
$$\begin{aligned} d_\ell \;\frac{\hat{v}^2_\ell }{v^2_\ell }{\mathop {\sim }\limits ^{\tiny {\text{ app }}}}\chi ^2_{d_\ell }, \end{aligned}$$
where “app” stands for “approximately” in the sense that the first two moments of \(d_\ell \hat{v}^2_\ell /v^2_\ell \) match those of a \(\chi ^2\)-distribution with \(d_\ell \) d.f. Using unbiased estimators of the variance as derived in the preceding section proves its usefulness here since then the first moments left and right match. Letting the second moments match means \(\text{ var }(d_\ell \hat{v}^2_\ell /v^2_\ell )=2d_\ell \) or
$$\begin{aligned} d_\ell =2\frac{(v_\ell ^2)^2}{\text{ var }(\hat{v}^2_\ell )}. \end{aligned}$$
(15)
Obviously, \(d_\ell \) is not known and needs to be estimated. There are two issues with this. One is that \(d_\ell \) may depend on parameters, which have to be estimated. A second issue is that evaluating \(v^2_\ell \) and \(\text{ var }(\hat{v}^2_\ell )\) requires the distribution of \({\varvec{\varepsilon }}\). As a practical solution to obtain a reasonable value of \(\hat{d}_\ell \), Bell and McCaffrey (2002) propose to simply take \({\varvec{\varepsilon }}\sim N({\varvec{0}},\sigma ^2{\textbf{I}}_n)\) as the “reference distribution.” Imbens and Kolesár (2016) suggested to take the RE model as the reference distribution, \({\varvec{\varepsilon }}\sim N({\varvec{0}},\sigma ^2{\textbf{I}}_n+\tau ^2{\textbf{B}}{\textbf{B}})\), with \({\textbf{B}}\) as defined in (4). We will now derive expressions for \(d_\ell \) for both cases. Given our focus on unbiased estimation, we extend previous results by using an unbiased estimator for \(\text{ var }(\hat{v}^2_\ell )\) and by using an unbiased estimator of any parameter that we meet in \(d_\ell \).
So, first following Bell and McCaffrey (2002), we let \({\varvec{\varepsilon }}\sim N({\varvec{0}},\sigma ^2{\textbf{I}}_n)\). As \(\hat{v}^2_\ell \) is quadratic in \({\hat{{\varvec{\varepsilon }}}}\), we can write \({\hat{v}}^2_\ell ={\hat{{\varvec{\varepsilon }}}}'{\textbf{A}}_\ell {\hat{{\varvec{\varepsilon }}}}\) for some symmetric \(n\times n\) matrix \({\textbf{A}}_\ell \) whose particular form follows from (5), (10) or (12), depending on the case under consideration. For notational simplicity, we will omit the subscript \(\ell \) to \({\textbf{A}}\) from now on and denote \({\textbf{a}}\equiv \text{ vec }{\textbf{A}}\), so
$$\begin{aligned} {\hat{v}}^2_\ell= & {} {\hat{{\varvec{\varepsilon }}}}'{\textbf{A}}{\hat{{\varvec{\varepsilon }}}}\\= & {} {\textbf{a}}'({\hat{{\varvec{\varepsilon }}}}\otimes {\hat{{\varvec{\varepsilon }}}})\\= & {} {\textbf{a}}'({\textbf{M}}\otimes {\textbf{M}})({\varvec{\varepsilon }}\otimes {\varvec{\varepsilon }}) \end{aligned}$$
hence,
$$\begin{aligned} \text{ var }(\hat{v}^2_\ell )= & {} 2\sigma ^4{\textbf{a}}'({\textbf{M}}\otimes {\textbf{M}}){\textbf{a}}\nonumber \\= & {} 2\sigma ^4\text{ tr }{\textbf{A}}{\textbf{M}}{\textbf{A}}{\textbf{M}}. \end{aligned}$$
(16)
From (5), (10) and (12), \({\textbf{A}}\) readily appears to be block-diagonal, with cth block \({\textbf{A}}_c\) given by
$$\begin{aligned} {\textbf{A}}_{c}&= r_1{\textbf{I}}_{c} + r_2{\textbf{i}}_{c}{\textbf{i}}_{c}',&(r_1,r_2) =&\; {\textbf{f}}_\ell '({\textbf{X}}'{\textbf{X}}\otimes {\textbf{X}}'{\textbf{X}})^{-1}(\text{ vec }{\textbf{X}}'{\textbf{X}},\text{ vec }\tilde{{\textbf{X}}}'\tilde{{\textbf{X}}}){\varvec{\Psi }}^{-1}\\ {\textbf{A}}_{c}&= r_{1c}{\textbf{I}}_{c} + r_{2c}{\textbf{i}}_{c}{\textbf{i}}_{c}',&({\textbf{r}}_{1}',{\textbf{r}}_{2}')=&\; {\textbf{f}}_\ell '({\textbf{X}}'{\textbf{X}}\otimes {\textbf{X}}'{\textbf{X}})^{-1}\sum _c\left( (\text{ vec }{\textbf{X}}_c'{\textbf{X}}_c){\textbf{e}}_c',(\tilde{{\textbf{x}}}_c\otimes \tilde{{\textbf{x}}}_c){\textbf{e}}_c'\right) {\varvec{\Phi }}^{-1}\\ {\textbf{A}}_{c}&= {\textbf{X}}_{c}{\textbf{Q}}_{c}{\textbf{X}}_{c}',&(\text{ vec }{\textbf{Q}}_{c})' =&\; {\textbf{f}}_\ell '\left( {\textbf{X}}'{\textbf{X}}\otimes {\textbf{X}}'{\textbf{X}}+\sum _c{\textbf{S}}_c^{-1}({\textbf{X}}_c'{\textbf{X}}_c\otimes {\textbf{X}}_c'{\textbf{X}}_c)\right) ^{-1}{\textbf{S}}_c^{-1}, \end{aligned}$$
respectively, with \({\textbf{f}}_\ell \equiv {\textbf{e}}_\ell \otimes {\textbf{e}}_\ell \) and \({\textbf{r}}_1\equiv (r_{11},\ldots ,r_{1C})'\) and likewise for \({\textbf{r}}_2\). Since \(v_\ell ^2=\sigma ^2{\textbf{e}}_\ell '({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{e}}_\ell \), we obtain
$$\begin{aligned} d_\ell =\frac{\left( {\textbf{e}}_\ell '({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{e}}_\ell \right) ^2}{\text{ tr }{\textbf{A}}{\textbf{M}}{\textbf{A}}{\textbf{M}}}, \end{aligned}$$
(17)
with
$$\begin{aligned} \text {tr}{\textbf{A}}{\textbf{M}}{\textbf{A}}{\textbf{M}}= & {} \text{ tr }\sum _{c,d}{\textbf{G}}_c{\textbf{A}}_c{\textbf{G}}_c'({\textbf{I}}-{\textbf{P}}){\textbf{G}}_d{\textbf{A}}_d{\textbf{G}}_d'({\textbf{I}}-{\textbf{P}})\nonumber \\= & {} \text {tr}\sum _{c}{\textbf{A}}_{c}^2-2\text {tr} ({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}'{\textbf{A}}^2{\textbf{X}}+\text {tr}(({\textbf{X}}'{\textbf{X}})^{-1} {\textbf{X}}'{\textbf{A}}{\textbf{X}})^2. \end{aligned}$$
(18)
Computational gains can be had by exploiting the structure of \({\textbf{A}}_c\). Notice that the expression for \(d_\ell \) does not depend on unknown parameters since the factors \(\sigma ^4\) in the numerator and the denominator cancel out.
Next, following Imbens and Kolesár (2016), we let \({\varvec{\Sigma }}= \sigma ^2{\textbf{I}}_{n} +\tau ^2{\textbf{B}}{\textbf{B}}'\), with \({\textbf{B}}\) as defined in (4). Instead of (16), we now have \(\text{ var }(\hat{v}^2_\ell )=2\text{ tr }{\textbf{A}}{\textbf{M}}{\varvec{\Sigma }}{\textbf{M}}{\textbf{A}}{\textbf{M}}{\varvec{\Sigma }}{\textbf{M}}\), and (17) generalizes to
$$\begin{aligned} d_\ell =\frac{\left( {\textbf{e}}_\ell '(\sigma ^2({\textbf{X}}'{\textbf{X}})^{-1}+\tau ^2({\textbf{X}}'{\textbf{X}})^{-1} \tilde{{\textbf{X}}}'\tilde{{\textbf{X}}}({\textbf{X}}'{\textbf{X}})^{-1}){\textbf{e}}_\ell \right) ^2}{\text{ tr }{\textbf{A}}{\textbf{M}}{\varvec{\Sigma }}{\textbf{M}}{\textbf{A}}{\textbf{M}}{\varvec{\Sigma }}{\textbf{M}}}. \end{aligned}$$
(19)
Here, both numerator and denominator depend on the parameters \(\sigma ^4, \tau ^4\) and \(\sigma ^2\tau ^2\), which do not cancel out and hence have to be replaced by estimators. The lengthy expression in the denominator poses another complication. Both complications are addressed in “Online Appendix B”. Our simulation results indicate that this more general procedure to estimate the degrees of freedom is particularly useful when the clusters are of unequal size.

5 Simulation design

We take the simulation design of MacKinnon and Webb (2018) as our point of departure. The data generating process includes a treatment dummy and a continuous variable. For \(c=1,\ldots ,C\), it is
$$\begin{aligned} {\textbf{y}}_c = {\textbf{i}}_c\alpha + {\textbf{d}}_{c}\beta + {\textbf{x}}_{c}\gamma + {\varvec{\varepsilon }}_{c}, \end{aligned}$$
(20)
with \({\textbf{i}}_c\) the intercept, \({\textbf{d}}_{c}\) the treatment dummy equal to 1 in clusters \(1,\ldots ,C_{1}\), which we will vary from 1 to \(C-1\), and \({\textbf{x}}_c\) the continuous regressor, whose elements are independent N(0, 1). The regression errors \({\varvec{\varepsilon }}_c\) within cluster c are normally distributed with their covariance matrix \({\varvec{\Sigma }}_{c}\) specified below. The errors are independent across clusters. We set the parameters \(\alpha =\beta =\gamma =0\), the number of clusters \(C=14\), and the total number of observations \(n=2800\). The results below are based on 200,000 draws of (20). We draw the continuous variable \({\textbf{x}}_c\) only once.
Error covariance matrix To generate the data, we consider three increasingly complicated designs for the covariance matrix of the \({\varvec{\varepsilon }}_{c}\).
1.
Homogeneous design as Sect. 3.1,
$$\begin{aligned} {\varvec{\Sigma }}_{c} = \sigma ^2{\textbf{I}}_{c} + \tau ^2{\textbf{i}}_{c}{\textbf{i}}_{c}'. \end{aligned}$$
(21)
with \(\sigma ^2 = 1\) and \(\tau ^2 = 0.1\).
 
2.
Restricted heterogeneous design as in Sect. 3.2,
$$\begin{aligned} {\varvec{\Sigma }}_{c} = \sigma _{c}^2{\textbf{I}}_{c} + \tau _{c}^2{\textbf{i}}_{c}{\textbf{i}}_{c}'\qquad \sigma _{c}^2 = \exp \left( 2\delta \frac{C-c}{C-1}\right) \qquad \tau _{c}^2 = \rho \sigma _{c}^2. \end{aligned}$$
(22)
This way of including heterogeneity across clusters is borrowed from MacKinnon and Webb (2018). We set \(\rho =0.1\) and \(\delta =\text{ ln }(2)/2\), which means that \(\sigma _c^2\) ranges from 1 to 2.
 
3.
Unrestricted heterogeneous design as in Sect. 3.3,
$$\begin{aligned} {\varvec{\Sigma }}_{c} = \sigma ^2 {\textbf{I}}_{c} + \tau ^2{\textbf{i}}_{c}{\textbf{i}}_{c}' + \text {diag}({\textbf{x}}_{c})^2/2, \end{aligned}$$
(23)
with \(\sigma ^2\) and \(\tau ^2\) as in the homogeneous design.
 
Balance An important design choice is the number of observations per cluster. We first consider a balanced design, where the number of observations per cluster is equal to \(n/C=200\), and next an unbalanced design, where the number of observations depends on the cluster index according to
$$\begin{aligned} \begin{aligned} n_{c}&=\text{ int }\left( n\frac{\exp (\gamma c/C)}{\sum _c\exp (\gamma c/C)}\right) , \quad c = 1,\ldots , C-1,\quad n_C= n-\sum _cn_c. \end{aligned} \end{aligned}$$
(24)
We set \(\gamma = 2\), which implies cluster sizes ranging from 67 to 438 observations.
Variance estimators and reference distributions We consider the following methods to obtain t-values for the OLS estimate for \(\beta \) in (20).
1.
The first benchmark t-values are based on the cluster extension of White’s standard errors due to Liang and Zeger (1986) as already introduced in (13), but with a finite-sample correction as implemented in Stata,
$$\begin{aligned} \hat{{\textbf{V}}}_{\text {LZ1}} = \frac{C}{C-1}\frac{n-1}{n-k} ({\textbf{X}}'{\textbf{X}})^{-1}\sum _{c}{\textbf{X}}_{c}'\hat{{\varvec{\varepsilon }}}_{c}\hat{{\varvec{\varepsilon }}}_{c}'{\textbf{X}}_{c}({\textbf{X}}'{\textbf{X}})^{-1}. \end{aligned}$$
Following Stata, we compare the resulting t-statistic against the critical values of a \(t(C-1)\) distribution. We denote this benchmark method by STATA.
 
2.
The second benchmark t-values implement the Liang and Zeger (1986) standard errors with a HC2 correction as in Bell and McCaffrey (2002).
$$\begin{aligned} \hat{{\textbf{V}}}_{\text {LZ2}} = ({\textbf{X}}'{\textbf{X}})^{-1}\sum _{c}{\textbf{X}}_{c}'({\textbf{I}}_{c}-{\textbf{P}}_{cc})^{-1/2}\hat{{\varvec{\varepsilon }}}_{c}\hat{{\varvec{\varepsilon }}}_{c}'({\textbf{I}}_{c}-{\textbf{P}}_{cc})^{-1/2}{\textbf{X}}_{c}({\textbf{X}}'{\textbf{X}})^{-1}, \end{aligned}$$
where \({\textbf{P}}_{cc}\equiv {\textbf{X}}_{c}({\textbf{X}}'{\textbf{X}})^{-1}{\textbf{X}}_{c}'\). Computation of \(\hat{{\textbf{V}}}_{\text {LZ2}}\) involves the inverse of the square root of the \(n_c\times n_c\) matrices \({\textbf{I}}_c-{\textbf{P}}_{cc}\), which can be problematic for large \(n_c\). However, Niccodemi et al. (2020) and Kolesár (2022) show how efficient computation can be achieved, that is, in \(O(n_c)\). We compare the t-statistic that follows from using \(\hat{{\textbf{V}}}_{\text {LZ2}}\) against the critical values of a \(t(d_{\tiny {\text{ IK }}})\) distribution, with \(d_{\tiny {\text{ IK }}}\) the d.f. suggested by Imbens and Kolesár (2016). We denote this benchmark method by LZIK.
 
3.
The third benchmark is the wild cluster bootstrap proposed by Cameron et al. (2008). Its asymptotic validity under a diverging number of clusters was shown by Djogbenou et al. (2019). We implement the restricted version as described in Sect. 3.2 of MacKinnon (2022), where we set the number of bootstrap draws at 999. This version calculates the distribution of t-statistics that are based on the variance estimator \(\hat{{\textbf{V}}}_{\text {LZ1}}\) defined above. Since our simulation setting is close to the one analyzed by MacKinnon and Webb (2018), the results for the wild bootstrap coincide with their findings.
 
4.
We use the three unbiased variance estimators from Sects. 3.13.3, denoted by UV1, UV2, and UV3, respectively, and compare the resulting t-statistics against the critical values of a t-distribution for both reference distributions considered (indicated by RV0 and RV1, respectively), so with d.f. \(d_\ell \) from (17) and from (19). This yields six cases, UV1(RV0), UV1(RV1), UV2(RV0), UV2(RV1), UV3(RV0), and UV3(RV1).
 
Notice that LZ2 does not exist when the number of (un)treated clusters is smaller than two, and that UV2(\(\cdot \)), UV3(\(\cdot \)) do not exist when the number of (un)treated clusters is smaller than three. We then set the size to zero.

6 Simulation results

The main results of the simulations are presented in Figs. , and , based on data simulated with error covariance matrix as in (21), (22) and (23), respectively. They show the size of the t test for \(H_0:\beta =0\), with \(\beta \) the coefficient of the dummy variable in (20). The number of treated clusters is on the horizontal axis. The upper panel of each figure is for the balanced case and the lower panel for the unbalanced case as described in (24). Each figure shows seven curves. The first four are STATA, LZIK, UV1(RV0), UV1(RV1). When we analyze the t-test on the treatment variable, the differences between UV2(RV0) and UV3(RV0), as well as those between UV2(RV1) and UV3(RV1), are not visible, so we report those as UV2/3(RV0) and UV2/3(RV1). Finally, we report the results for the restricted wild bootstrap. Notice that three variances are involved: the reference variance to obtain \(d_\ell \); the variance whose unbiased estimator was used; and the variance used in the simulation. For clarity, Table summarizes.
The most relevant curves in all three figures are the ones labeled UV1(RV1) in Fig. 1, upper and lower panels. The homogeneous RE design can be considered the more or less generic case in the clustered-error literature and, as is apparent from Table 1, this particular curve is maximally based on this design as it underlies the data generation SV1, the variance estimator UV1, and \(d_\ell \) based on RV1.
SV1 Inspecting Fig. 1 we see, for the balanced design in the upper panel, excellent size control for UV1(\(\cdot \)). This holds even when there is only a single treated cluster. It does not appear to matter whether the d.f. are calculated under the more restrictive i.i.d. assumption, UV1(RV0), or the RE structure, UV1(RV1). By contrast, UV2(\(\cdot \)), UV3(\(\cdot \)) and LZIK are slightly conservative when we have a small or large number of treated clusters. The STATA variance estimator performs quite poorly, especially when the number of treated clusters is small or large. The bootstrap with t-statistics based on the STATA variance estimator performs much better.
Moving to the unbalanced setup in the lower panel of Fig. 1, we see that UV1(RV0) no longer provides accurate size control. However, UV1(RV1), the most relevant case as argued above, still exhibits excellent performance. The additional computational complexity of this approach appears to pay off. We also see that, unlike in the balanced case, the results for UV2(RV0) and UV3(RV0) differ from the benchmark variance estimator LZIK. The unbiased variance estimators are more conservative for a small number of treated clusters, while becoming slightly oversized for 9–11 treated clusters. UV2(RV1) and UV3(RV1) are again very close to LZIK. The STATA variance estimator again is found not to accurately control size. For the bootstrap, we find that it is undersized for a small number of treated clusters and oversized for a large number of treated clusters.
SV2 In Fig. 2, we show the size for t tests based on the various variances estimators under the restricted heterogeneous design where each cluster has its own variance and covariance parameter. This setup is more general than the homogeneous design in which each cluster has the same variance and covariance parameter. As expected, the performance of UV1(RV0) and UV1(RV1) somewhat deteriorates in this setup, with size slightly below 0.10 for the case of a single treated cluster and balanced design. The same is observed for an unbalanced design, with the size obtained under UV1(RV1) being just over 0.10.
For UV2 and UV3, under both d.f., and LZIK, we see the test slightly overrejects for a small number of treated clusters. When the number of treated clusters increases, the tests become progressively more conservative. Again, a difference emerges between UV2, UV3 and LZIK in the unbalanced case presented in the lower panel of Fig. 2. Here, size control is more accurate for UV2 and UV3 compared to LZIK. Especially UV1(RV0) and UV2(RV0) perform well in this setup, providing accurate size control up to roughly eight treated clusters. With more treated clusters, they tend to be conservative, although not as much as LZIK. The bootstrap performance is similar to that in SV1, although in a balanced design with a small number of treated clusters, it is slightly oversized.
SV3 The results for the unrestricted heterogeneous design are nearly identical to those in the homogenous design for the STATA variance, the bootstrap, LZIK and UV2 and UV3 under both d.f. corrections. For UV1, we find reasonable performance when clusters are balanced. When the clusters are unbalanced, UV1(RV0) becomes oversized for a small number of treated clusters and undersized when the number of treated clusters is large. The more general d.f. correction in UV1(RV1) partly corrects these size distortions.
So far for the test on \(\beta \), the coefficient of the cluster-specific dummy variable. We can be much more concise as to \(\gamma \), the coefficient of the continuous variable. For SV1 and SV2, the size control is almost perfect. This no longer holds for SV3, where the size is still almost perfect for STATA, LZIK, UV3(\(\cdot \)) but appears to be double the nominal size for \(UV1(\cdot \)) and UV2(\(\cdot \)); the latter methods are apparently sensitive when the data are generated according to more general scheme SV3.
Degrees of freedom in SV1 Given the notable differences in performance when using degrees of freedom based on RV0 or RV1, we analyze the degrees of freedom under SV1 in Fig. . For a balanced design, we see that the degrees of freedom for UV1 are equal to \(C-2\). Donald and Lang (2007) show that if the design is balanced and if all regressors are invariant within clusters, the t-statistic is \(t(C-k)\) distributed, where k is the number of regressors in the model. We can expect the same result to apply here since the continuous variable is uncorrelated with the treatment dummy.
Under a balanced design, the degrees of freedom for the other methods are nearly identical. They are low when the number of treated clusters is low and increase to their maximum when half of the clusters is treated. This maximum appears to coincide numerically with \(C-k\) as well.
When the design is unbalanced, we see a strong deviation from the degrees of freedom under RV0 to those under RV1. This is especially true for UV1 and a small number of treated clusters. For the remaining variance estimators, we see that under RV0 the degrees of freedom are asymmetric in the number of treated clusters, while those under RV1 are symmetric.
Theoretical explanation of the differences The simulations highlight that UV1 can offer accurate size control even with only a single treated cluster, while UV2 and UV3 require a somewhat larger number of treated clusters. To explain these results from a theoretical perspective, we derive in “Online Appendix C” the required conditions for the consistency of the variance estimators in a simple model. There is a single treatment variable that is equal to one in \(t_{C}\) out of C clusters. The design is balanced, so that each cluster has n/C observations. The errors are \(N({\varvec{0}},{\varvec{\Sigma }})\) with \({\varvec{\Sigma }}\) as in Sect. 3.1. “Online Appendix C” shows that UV1 is consistent if \(n^2/C^3\rightarrow 0\). This requires the number of clusters to grow sufficiently fast, but does not impose any restriction on the number of treated clusters. For UV2 and UV3 on the other hand, we find that the number of treated clusters should diverge sufficiently fast so that \(n^2/(C^2\cdot t_{C})\rightarrow 0\). In contrast to UV1, we now only achieve consistency when the number of treated clusters goes to infinity. These results explain the difference in performance of the variance estimators when the number of treated clusters is small.

7 A placebo-regression experiment

To analyze the performance of the unbiased variance estimators in an empirical setting, we consider a placebo-regression experiment. Placebo regressions were originally proposed by Bertrand et al. (2004) to analyze the validity of commonly used standard errors for difference-in-difference estimators. We consider an application similar to that in Cameron and Miller (2015).
We use the Current Population Survey (CPS) 2012 data set that can be obtained from https://cps.ipums.org/cps/. The data consist of 51 clusters: the fifty American states and the District of Columbia. The number of observations in each cluster varies from 519 (Montana) to 5866 (California). For observation h in cluster \(i=1,\dots ,C\), we define the model
$$\begin{aligned} {\text{ ln }}(\text{ wage})_{hi} = \beta _0 + \beta _1 \text{ educ}_{hi} + \beta _2 \text{ age}_{hi}+\beta _3 \text{ age}^{2}_{hi} + \beta _4 \text{ policy}_{i} + \varepsilon _{hi}.\nonumber \\ \end{aligned}$$
(25)
Here, \(\text{ policy } \) is a fake policy variable that is randomly assigned to \(C_{1}=1,\ldots ,C-1\) sampled clusters and constant within each cluster. Since the policy variable is fake, we expect 5% rejections across the replications when we test the hypothesis \({\text{ H }}_0:\beta _4=0\) at the 5% level.
In line with the simulations in the previous section, we sample a subset of \(C=14\) clusters from the 51 available clusters. We consider two different ways of sampling this subset. In the first, we randomly sample clusters with replacement. To test the methods in an unbalanced setup, we also consider using the 3 states with the most observations and the 11 states with the fewest observations. To preserve the relative share of observations in each cluster, we randomly sample with replacement 20% of the observations within each sampled cluster.
Figures and show the empirical size (upper panel) and the degrees of freedom (lower panel) averaged over 10,000 replications for the four different designs. The x-axis again depicts the number of treated clusters.
In line with the Monte Carlo results from the previous section, we see that the Stata variance estimator with \(C-1\) degrees of freedom is severely oversized. This effect is largely mitigated by using the bootstrap, although it is consistently oversized for a moderate number of treated clusters. This is especially the case in the “3–11” setting. In contrast, we find remarkably good size control for UV1(RV1) across the designs. The degrees of freedom drop considerably when moving from RV0 to RV1. This shows that the use of RV1 is of empirical relevance, especially in the settings with higher imbalance and a small number of (un)treated clusters. The LZIK variance estimator also performs well, although it is oversized in the highly unbalanced “3–11” setting. There the unbiased variance matrix estimators control size more accurately.

8 Concluding remarks

The point of departure in this paper has been to derive unbiased estimators of the covariance matrix of the OLS estimator when the data are clustered. We considered three cases, the leading one being the RE model. This led to our main research question, which is to assess the performance of these estimators in the t test for a particular regression coefficient, both among each other and vis-à-vis two oft-used alternatives.
We addressed this question by simulation, in a regression model with a two regressors, one being continuous and distributed equally in all clusters, while the other regressor represented a cluster-specific treatment dummy. The main finding of the simulation study was the excellent behavior of the t test based on the unbiased estimator for the RE model, for the case that the data actually have been generated according to this model and the degrees of freedom have been based on it. So the three variances that play a role are aligned. This result holds for the coefficient of the cluster-specific dummy variable; there is hardly a noticeable difference in performance for between the other variance estimators underlying the t test.
The random-effects model considered in Sect. 3.1 suggests an issue worthy of investigation. Throughout the paper, we considered the OLS estimator and the t-values related to it under various specifications. However, we can also consider the feasible GLS estimator. When the random-effects specification would be the correct one (and the random-effects parameters would be known exactly), the model would have no clustered-error terms anymore, after the usual transformation well-known from the panel data literature. Unlike the transformation corresponding with fixed effects, the transformation for random effects keeps cluster-specific regressors in the model, although with little variation over time, so leading to large variances of the GLS estimators. It is interesting to know how this would work out in theory and practice.
In our analysis, we have restricted ourselves to the case of a cross-sectional model. An obvious topic for future research is an extension to case of panel data and difference-in-difference models.
A next step is to see if the excellent behavior mentioned above also shows up in the case where the three variances are still aligned but now pertain to the more flexible RE model where the two error-components parameters differ over clusters. While by itself this is eminently doable, the question arises to test this heterogeneous RE structure against the homogeneous one. An obvious starting point is the score test context proposed by Breusch and Pagan (1980). Deriving the relevant expression is straightforward but deriving the (limiting) distribution of the test statistic is not since the number of parameters grows with the number of clusters. We can have \(n\rightarrow \infty \) when \(C\rightarrow \infty \) but we can also consider keeping C fixed while letting the number of observations per cluster go to infinity, or any combination of the two.
The results in the paper on the quality of unbiased estimators in the t test are based on simulation only. We are not aware of any theory that might help giving these results a theoretical basis. There is certainly a research challenge here.
Table 1
Overview of the variances used
\({\varvec{\Sigma }}_c\)
Reference variance
Unbiased estimator
Simulation variance
\(\sigma ^2 {\textbf{I}}_c\)
RV0
  
\(\sigma ^2 {\textbf{I}}_c+\tau ^2{\textbf{i}}_c{\textbf{i}}_c'\)
RV1
UV1
SV1
\(\sigma _c^2{\textbf{I}}_c+\tau _c^2{\textbf{i}}_c{\textbf{i}}_c'\)
 
UV2
SV2
\({\varvec{\Lambda }}_c\)
 
UV3
SV3

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix

Supplementary Information

Below is the link to the electronic supplementary material.
Literature
go back to reference Abadie A, Athey S, Imbens GW, Wooldridge JM (2020) Sampling-based versus design-based uncertainty in regression analysis. Econometrica 88(1):265–296CrossRef Abadie A, Athey S, Imbens GW, Wooldridge JM (2020) Sampling-based versus design-based uncertainty in regression analysis. Econometrica 88(1):265–296CrossRef
go back to reference Bell RM, McCaffrey DF (2002) Bias reduction in standard errors for linear regression with multi-stage samples. Surv Methodol 28:169–179 Bell RM, McCaffrey DF (2002) Bias reduction in standard errors for linear regression with multi-stage samples. Surv Methodol 28:169–179
go back to reference Bertrand M, Duflo E, Mullainathan S (2004) How much should we trust differences-in-differences estimates? Quart J Econ 119:249–275CrossRef Bertrand M, Duflo E, Mullainathan S (2004) How much should we trust differences-in-differences estimates? Quart J Econ 119:249–275CrossRef
go back to reference Breusch TS, Pagan AR (1980) The Lagrange multiplier test and its applications to model specification in econometrics. Rev Econ Stud 47:239–253CrossRef Breusch TS, Pagan AR (1980) The Lagrange multiplier test and its applications to model specification in econometrics. Rev Econ Stud 47:239–253CrossRef
go back to reference Cameron AC, Miller DL (2015) A practitioner’s guide to cluster-robust inference. J Hum Resour 50:317–372CrossRef Cameron AC, Miller DL (2015) A practitioner’s guide to cluster-robust inference. J Hum Resour 50:317–372CrossRef
go back to reference Cameron AC, Trivedi PK (2005) Microeconometrics. Cambridge University Press, CambridgeCrossRef Cameron AC, Trivedi PK (2005) Microeconometrics. Cambridge University Press, CambridgeCrossRef
go back to reference Cameron AC, Gelbach JB, Miller DL (2008) Bootstrap-based improvements for inference with clustered errors. Rev Econ Stat 90(3):414–427CrossRef Cameron AC, Gelbach JB, Miller DL (2008) Bootstrap-based improvements for inference with clustered errors. Rev Econ Stat 90(3):414–427CrossRef
go back to reference Djogbenou AA, MacKinnon JG, Nielsen MØ (2019) Asymptotic theory and wild bootstrap inference with clustered errors. J Econom 212(2):393–412CrossRef Djogbenou AA, MacKinnon JG, Nielsen MØ (2019) Asymptotic theory and wild bootstrap inference with clustered errors. J Econom 212(2):393–412CrossRef
go back to reference Donald SG, Lang K (2007) Inference with difference-in-differences and other panel data. Rev Econ Stat 89(2):221–233CrossRef Donald SG, Lang K (2007) Inference with difference-in-differences and other panel data. Rev Econ Stat 89(2):221–233CrossRef
go back to reference Hansen BE, Lee S (2019) Asymptotic theory for clustered samples. J Econom 210(2):268–290CrossRef Hansen BE, Lee S (2019) Asymptotic theory for clustered samples. J Econom 210(2):268–290CrossRef
go back to reference Hartley H, Rao J, Kiefer G (1969) Variance estimation with one unit per stratum. J Am Stat Assoc 64:173–181CrossRef Hartley H, Rao J, Kiefer G (1969) Variance estimation with one unit per stratum. J Am Stat Assoc 64:173–181CrossRef
go back to reference Ibragimov R, Müller UK (2016) Inference with few heterogeneous clusters. Rev Econ Stat 98(1):83–96CrossRef Ibragimov R, Müller UK (2016) Inference with few heterogeneous clusters. Rev Econ Stat 98(1):83–96CrossRef
go back to reference Imbens GW, Kolesár M (2016) Robust standard errors in small samples: some practical advice. Rev Econ Stat 98(4):701–712CrossRef Imbens GW, Kolesár M (2016) Robust standard errors in small samples: some practical advice. Rev Econ Stat 98(4):701–712CrossRef
go back to reference Kline P, Saggio R, Sølvsten M (2020) Leave-out estimation of variance components. Econometrica 88(5):1859–1898CrossRef Kline P, Saggio R, Sølvsten M (2020) Leave-out estimation of variance components. Econometrica 88(5):1859–1898CrossRef
go back to reference Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73(1):13–22CrossRef Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73(1):13–22CrossRef
go back to reference MacKinnon JG (2022) Fast cluster bootstrap methods for linear regression models. Econom Stat 21 MacKinnon JG (2022) Fast cluster bootstrap methods for linear regression models. Econom Stat 21
go back to reference MacKinnon JG, Webb MD (2018) The wild bootstrap for few (treated) clusters. Econom J 21(2):114–135CrossRef MacKinnon JG, Webb MD (2018) The wild bootstrap for few (treated) clusters. Econom J 21(2):114–135CrossRef
go back to reference MacKinnon JG, White HL (1985) Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. J Econom 29(3):305–325CrossRef MacKinnon JG, White HL (1985) Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. J Econom 29(3):305–325CrossRef
go back to reference MacKinnon JG, Nielsen MØ Webb MD (2023) Cluster-robust inference: a guide to empirical practice. J Econom 232(2):272–299 MacKinnon JG, Nielsen MØ Webb MD (2023) Cluster-robust inference: a guide to empirical practice. J Econom 232(2):272–299
go back to reference Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biom Bull 2:110–114CrossRef Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biom Bull 2:110–114CrossRef
go back to reference White HL (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48:817–838CrossRef White HL (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48:817–838CrossRef
Metadata
Title
Unbiased estimation of the OLS covariance matrix when the errors are clustered
Authors
Tom Boot
Gianmaria Niccodemi
Tom Wansbeek
Publication date
17-03-2023
Publisher
Springer Berlin Heidelberg
Published in
Empirical Economics / Issue 6/2023
Print ISSN: 0377-7332
Electronic ISSN: 1435-8921
DOI
https://doi.org/10.1007/s00181-023-02379-w

Other articles of this Issue 6/2023

Empirical Economics 6/2023 Go to the issue

Premium Partner