1 Introduction
We consider issues of measuring treatment effects using panel data. We assume there are N cross-sectional units, each observed over T time periods. The treatment effects for the ith unit at time t are typically measured as the difference between the outcomes under the treatment, \(y^1_{it}\), and the outcomes in the absence of the treatment, \(y^0_{it}\), are given byHowever, the observed data for the ith individual at time t take the formwhere \(d_{it}\) denotes the treatment status dummy with \(d_{it}=1\) if the ith individual at time t is under the treatment and 0 if not. That is, the observed data are either \(y^1_{it} \) or \(y^0_{it} \), not simultaneously. To provide an estimate of the treatment effect, \(\Delta _{it}\), one needs to substitute the missing \(y^1_{it} \) or \(y^0_{it} \) by its predicted value.
$$\begin{aligned} \Delta _{it} =y^1_{it} - y^0_{it}. \end{aligned}$$
(1.1)
$$\begin{aligned} y_{it} =d_{it} y^1_{it} +(1-d_{it}) y^0_{it}, \ i=1, \ldots , N; t=1, \ldots , T, \end{aligned}$$
(1.2)
Assuming \(y^1_{it}\) is observed but not \(y^0_{it}\), then the estimated treatment effects for the ith unit at time t arewhere \(\hat{y}^0_{it}\) denotes the predicted value (or the counterfactuals) of \(y^0_{it}\). Conditional on observed \(y^1_{it}\), the bias and the variance of estimated \(\hat{\Delta }_{it}\),andIn other words, the bias and variance of \(\hat{\Delta }_{it}\) conditional on \(y^1_{it}\) only depend on the bias and error variance of the prediction error of \(\hat{y}^0_{it}\) (or \(\hat{y}^1_{it}\) if \(y^0_{i0}\) is observed). That is, to obtain accurate measurement of \(\Delta _{it}\) is fundamentally an issue of getting good prediction of \(y^0_{it}\). However, there is no realized \(y^0_{it}\) to evaluate how close \(\hat{y}^0_{it}\) is to \(y^0_{it}\). Any \(\hat{y}^0_{it}\) is a counterfactual. There is no way to say one method to generate \(\hat{y}^0_{it}\) is better than the other method by comparing the difference between \(y_{it}\) and \(\hat{y}^0_{it}\). Therefore, any preference for a particular method to generate counterfactuals must be based on the compatibility of the underlying assumptions with the data-generating process of observed \(\hat{y}^0_{i0}\) and the predictive accuracy of the method is conditional on the data-generating process of \(y^0_{it}\).
$$\begin{aligned} \hat{\Delta }_{it}= y^1_{it}-\hat{y}^0_{it}, \end{aligned}$$
(1.3)
$$\begin{aligned} & E( \hat{\Delta }_{it } |y^1_{it})= [E (y^1_{it } - \hat{y}^0_{it} | y^1_{it})] = E [(y^1_{it}-y^0_{it}) + (y^0_{it}-\hat{y}^0_{it}) | y^1_{it}]\nonumber \\ & \quad \quad \quad \quad =E(\Delta _{it}| y^1_{it})+ E(y^0_{it}-\hat{y}^0_{it}), \end{aligned}$$
(1.4)
$$\begin{aligned} \text {Var} (\hat{\Delta }_{it}|y^1_{it})= E\left[ (y^0_{it}-\hat{y}^0_{it})^2 \right] . \end{aligned}$$
(1.5)
Advertisement
Panel data sets provide the possibility to simultaneously capturing inter-individual differences and intra-individual dynamics. Compared to the cross-sectional \((T=1)\) or time series \((N=1)\) data sets, panel data possess several advantages: In this paper, we selectively review the panel data approaches to measure the treatment effect in light of these advantages. We assume there are \(N_1\) units receiving the treatment and \((N-N_1)\) unit not receiving the treatment. However, for ease of illustration of the fundamental methodology, we consider using panel data to measure the treatment effects of the first unit. In other words, only the first unit is in the treatment group; the rest of units are in the control group. We assume that before period T, all cross-sectional units received no treatment. At period \(T +1, \ldots , T+m\) onwards, the first unit received treatment, i.e., \(y_{it}= y^1_{it}\) but not for the rest of the units where \(y_{it}=y^0_{it}\) for \(i=2,\ldots ,N\), \(t=T +1, \ldots , T +m\). Sections 2 and 3 consider the causal and non-causal approaches to construct causal approach for a single treated unit. In Sect. 4, we consider issues of multiple treated units. Since as argued in (1.4) and (1.5), conditional on \(y_{it}= y^1_{it}\) for \(t=T+1,\ldots , T+m,\) the measurement of treatment effects \(\Delta _{it}\) is essentially a prediction issue for \(y^0_{it}\), for notational ease, we shall drop the superscript“0”and simply use \(y_{it}\) for \(y^0_{it}\). Concluding remarks are in Sect. 5.
1.
2.
Information on individual’s response to policy changes provides the possibility to identify if the differences in individual treatment effects can be considered as due to chance events (i.e., homogeneous) or due to some fundamental differences (i.e., heterogeneous), thus whether it makes sense to consider the estimation of average treatment effects (ATE) (e.g., Hsiao 2022).
3.
Information across individuals over time not only provides the possibility of examining if there are ’treatment effects,’ but also provides the possibility of examining whether treatment effects are evolutionary over time or stationary around a common mean (e.g., Hsiao et al. 2012; Ke and Hsiao 2022, 2023).
4.
It provides the possibility to blend the advantages of both the nonparametric approach to estimate the treatment effects with the parametric approach to identify the causal factors (e.g., Ke et al. 2017).
2 Causal approach
The causal approach essentially assumes the observed outcomes can be decomposed as the sum of some observed covariates \(\varvec{x}_{it}\) and the impact of unobserved factors represented by the error termsandwhere the error terms, \(\varepsilon ^1_{it}\) and \(\varepsilon ^0_{it}\), are typically assumed to be uncorrelated with \(\varvec{x}_{it}\),andThen the average treatment effects conditional on \(\varvec{x}\), ATE \((\varvec{x})\), are justHowever, since the observed data take the form (1.2), participation of treatment \((d_{it})\) could be correlated with the outcome (e.g., Heckman and Vytlacil 2001). Suppose, the treatment status dummy or participation decision model for \(d_{it}\) can be postulated by introducing a latent response function,whereThen, conditional on \(\varvec{x}_{it}\) and \(d_{it}\), the expected value of \(\varepsilon ^j_{it}\) could be eitherorWhen (2.8) holds, i.e., \(f(\varepsilon ^1, \varepsilon ^0, v | \varvec{x} )=f(\varepsilon ^1, \varepsilon ^0 |x) f(v | \varvec{x})\), models (2.1) and (2.2) are typically called the two part model. If the conditional mean function, \(g_1 (\varvec{x})\) and \(g_0 (\varvec{x})\), is known, regression methods can be applied to obtain consistent estimator of their parameters.1 When the conditional mean functions are unknown, nonparametric methods can be applied to identify \(g_1 (\varvec{x})\) and \(g_0 (\varvec{x})\) (e.g., Li and Racine 2007).
$$\begin{aligned} y^1_{it} = g_1(\varvec{x}_{it})+ \varepsilon ^1_{it}, \end{aligned}$$
(2.1)
$$\begin{aligned} y^0_{it} = g_0(\varvec{x}_{it})+ \varepsilon ^0_{it}, \end{aligned}$$
(2.2)
$$\begin{aligned} E(\varepsilon ^1_{it}|\varvec{x}_{it})= E(\varepsilon ^1_{it})=0 \end{aligned}$$
(2.3)
$$\begin{aligned} E(\varepsilon ^0_{it}|\varvec{x}_{it})= E(\varepsilon ^0_{it})=0. \end{aligned}$$
(2.4)
$$\begin{aligned} \text {ATE} (x)= g_1(\varvec{x}) -g_0 (\varvec{x}). \end{aligned}$$
(2.5)
$$\begin{aligned} d^*_{it}=h(\varvec{z}_{it})+ v_{it}, \ E(v_{it}| \varvec{x}_{it}, \varvec{z}_{it})=0, \end{aligned}$$
(2.6)
$$\begin{aligned} d_{it} =\left\{ \begin{array}{ll} & 1 \quad \text {if} \quad d^{*}_{it}> 0, \\ & 0 \quad \text {if} \quad d^{*}_{it} \le 0. \end{array} \right. \end{aligned}$$
(2.7)
$$\begin{aligned} E(\varepsilon ^j_{it}|\varvec{x}_{it}, d_{it})=0,\quad j=0,1. \end{aligned}$$
(2.8)
$$\begin{aligned} E(\varepsilon ^j_{it}|\varvec{x}_{it}, d_{it}) \ne 0,\quad j=0,1. \end{aligned}$$
(2.9)
If (2.9) holds, then \(f(\varepsilon ^1, \varepsilon ^0 | v )\ne f(\varepsilon ^1, \varepsilon ^0)\). Models (2.1), (2.2), (2.6), and (2.7) together with \(f(\varepsilon ^1, \varepsilon ^0, v )\) are typically referred to as sample selection models (e.g., Heckman 1979) and the observed data are subject to selection on unobservables, (e.g., Heckman and Vytlacil 2001).
If \(g_1(\varvec{x}), g_0(\varvec{x})\) and \(f(\varepsilon ^1, \varepsilon ^0, v)\) are known, then observed \(y^1_{it}\) or \(y^0_{it}\) under the assumption that \(g_1 (\varvec{x})=\varvec{x}' \varvec{\beta }_1\) and \(g_0(\varvec{x})=\varvec{x}' \varvec{\beta }_0\) takes the formandwhere \(\gamma ^1(\varvec{z}_{it})= E(\varepsilon ^1_{it}|v_{it} > -h(\varvec{z}_{it}) )\), \(\gamma ^0(\varvec{z}_{it})= E(\varepsilon ^0_{it}|v_{it} < -h(\varvec{z}_{it}))\), and \(\eta ^1_{it}, \eta ^0_{it}\) denote the residuals. If \(f(\varepsilon ^1, \varepsilon ^0, v)\) are known, maximum likelihood method can be implemented to estimate \(\varvec{\beta }_1, \varvec{\beta }_0, \gamma ^1 (\cdot )\) and \( \gamma ^0 (\cdot )\) (e.g., Damrongplasit et al. 2010). If the joint distribution of \(f(\varepsilon ^1, \varepsilon ^0, v)\) is unknown, \( \gamma ^1 (\cdot )\) and \( \gamma ^0 (\cdot )\) are unknown.
$$\begin{aligned} y^1_{it}&=E(y^1_{it}|\varvec{x}_{it},d_{it}=1)+ \eta ^1_{it} \nonumber \\&=\varvec{x}'_{it} \varvec{\beta }_1+ \gamma ^1 (\varvec{z}_{it})+ \eta ^1_{it}, \end{aligned}$$
(2.10)
$$\begin{aligned} y^0_{it}&=E(y^0_{it}|\varvec{x}_{it},d_{it}=0)+ \eta ^0_{it}, \nonumber \\&=\varvec{x}'_{it} \varvec{\beta }_0+ \gamma ^0 (\varvec{z}_{it})+ \eta ^0_{it}, \end{aligned}$$
(2.11)
Advertisement
When \(T=1\) (i.e., cross-sectional data), we drop the subscript t for ease of notation. Robinson (1988) notes that conditional on \(\varvec{z}_i,\)Subtracting (2.12) from (2.10) (or (2.11)) yields2where \(\eta ^j_i\) denotes the residual. Then a consistent estimator of \(\varvec{\beta }\) can be obtained by least-squares regression of (2.13). Ahn and Powell (1993) suggest to pairwise difference between \(y^j_i\) and \(y^j_l\) conditional on \(h(\varvec{z}_i)= h(\varvec{z}_l)\) to eliminate the sample selection effect,All these methods can be applied straightforwardly to panel data (i.e., \(T\ge 2\)) if \(\varepsilon ^1_{it}\) and \(\varepsilon ^0_{it}\) are independently, identically distributed over i and t. However, the availability of panel data allows one to relax this assumption and the assumption that \(E(\varepsilon ^j| \varvec{x})=0\) ((2.3) and (2.4)) by allowing \(E(\varepsilon ^j | \varvec{x}) \ne 0\) through decomposing \(\varepsilon ^j_{it}\) as the sum of two parts,such that \(E(\varvec{x}_{it}\delta ^j_{it})\ne 0\) and \(E(x_{it}u_{it})=0\) (e.g., Sickles 2005).
$$\begin{aligned} E(y^j_i| \varvec{z}_i)= E(\varvec{x}_i| \varvec{z}_i)' \varvec{\beta }_j + \gamma ^j(\varvec{z}_i), \quad j=0,1. \end{aligned}$$
(2.12)
$$\begin{aligned} y_i-E(y^j_i| \varvec{z}_i)=(\varvec{x}_i -E(\varvec{x}_i|\varvec{z}_i))' \varvec{\beta }_j + \eta ^j_i, \quad j=0,1, \end{aligned}$$
(2.13)
$$\begin{aligned} y^j_i- y^j_l=(\varvec{x}_i-\varvec{x}_l)' \varvec{\beta }+ (\eta ^j_i- \eta ^j_l), \quad j=0,1. \end{aligned}$$
(2.14)
$$\begin{aligned} \varepsilon ^j_{it} = \delta ^j_{it} + u^j_{it}, \end{aligned}$$
(2.15)
Whenand \(u_{it}\) i.i.d. over i and t, Honore (1992), Honore and Kyriazidou (2000a) suggested to first take the difference over time to eliminate the individual specific effects \((\alpha ^j_i, \ j=0,1)\), then apply the Ahn and Powell (1993) pairwise difference method to get rid of the sample selection effects.
$$\begin{aligned} \delta ^j_{it}= \alpha ^j_i \quad \text {for} \quad t=1,\ldots , T, \end{aligned}$$
(2.16)
When the part of the error term correlated with \(\varvec{x} _{it}\), \(\delta ^j_{it}\), takes the interactive form,where \(\varvec{f}_t\) is an r-dimensional common factors that stay the same across individuals, but vary over time and \(\varvec{\lambda }_i\) is an r-dimensional factor loading vector that stays constant over time but vary across i, Kong and Hsiao (2025) suggest first to follow Robinson’s procedure to get rid of sample selection effects, then apply the Pesaran (2006) common correlated effects approach or Hsiao et al. (2022a) transformal approach to estimate \(\varvec{\beta }\). When neither the conditional mean function \(g_1 (\varvec{x})\) or \(g_0 (\varvec{x})\) nor the joint distribution function of \(f(\varepsilon ^1, \varepsilon ^0, v)\) are known, under the unconfoundedness assumption, (i.e., no sample selection effects), nonparametric method can be used to identify E(y|x) (e.g., Li and Racine 2007). However, nonparametric methods suffer from the curse of dimensionality; (Stone 1980; Imbens and Angrist 1994; Imbens and Lemieux 2008; Rosenbaum and Rubin 1983, 1985), etc. have suggested various methods to get around the issues of the curse of dimensionality.
$$\begin{aligned} \delta ^j_{it} = \varvec{\lambda }'_{i} \varvec{f}_t,\end{aligned}$$
(2.17)
Table 1
Advantages and disadvantages—parametric, semiparametric, and nonparametric approaches
Advantage | Disadvantage | |
---|---|---|
Parametric approach | Simultaneously control the selection of observables and unobservables issues. Estimate the average treatment effects (ATE) and the impact of each factor. Can obtain efficient estimation of the parameters of the conditional mean function | Specification of the conditional mean function and the probability distribution of the impact of omitted factors |
Semiparametric approach | Simultaneously control the selection of observables and unobservables issues. Estimate the impact of most (or some) factors on the outcomes and ATE (in some cases). No need to specify the probability distribution of the impact of omitted factors | Specification of the conditional mean function |
Estimates, although may achieve the same speed of convergence as the parametric approach, it is less efficient | ||
Nonparametric approach | No need to specify the conditional mean function and the probability distribution of the impact of omitted factors | Unconfoundedness is the maintained hypothesis |
Curse of dimensionality |
Table 1 summarizes the advantages and disadvantages of the parametric, semiparametric, and nonparametric methods to construct the counterfactuals. Essentially, the advantage of parametric and semiparametric approaches to estimate the treatment effect is that it can simultaneously take account of selection on observables and selection on unobservables. The disadvantage is that the conditional mean functions \( E(y^1 | \varvec{x})\) and \( E(y^0|\varvec{x})\) are assumed known. The advantage of nonparametric approach is that there is no need to make any assumption of the conditional mean function, nor the joint distribution of the random error terms. The disadvantage is that some sort of unconfoundedness assumption has to be made, which is a maintained hypothesis, not a testable hypothesis. In other words, the advantages of parametric or semiparametric approach are the disadvantages of the nonparametric approach and the advantages of nonparametric approach are the disadvantages of the parametric or semiparametric approach. Unfortunately, without precise knowledge of how observables interact with unobservables, it is hard to choose between the two.
3 Non-causal approach
As discussed in introduction section, measuring the treatment effects is essentially a prediction issue. The focus of causal approach is to identify the parameters of the data-generating process of \(y_{it}\). Knowing the data-generating process of \(y^1_{it}\) and \(y^0_{it}\) provides useful information to generate a good prediction. However, it is not essential. Anything that is correlated with \(y^1_{it}\) or \(y^0_{it}\) could help prediction, even they are not causal. In that sense, the non-causal approach is less restrictive. No assumptions of the data-generating process of \(y_{it}\) need to be made. In this section, we consider non-causal approach to generate predictions very much similar to the idea of modeling time series data based on autocorrelations and partial autocorrelations, etc. (e.g., Box and Jenkins 1970). However, no lagged variables of treated units will be taken into consideration because they could be subject to the impact of treatment.
The non-causal approach of generating prediction only cares about how to generate good prediction. It does not concern about identifying the parameters of the true data-generating process. So we shall assume that the nontreated outcomes are unconfounded, i.e.,We consider two data-driven modeling approaches to generate predictions-factor approach or linear projection approach under the assumption that the model parameters stay constant over time.3 Namely, our focus is only on measurement.4
$$\begin{aligned} f(y_{it}| d_{1t}) =f(y_{it}), \ \text {for} \ i=1,\ldots , N; \ t=1,\ldots , T. \end{aligned}$$
(3.1)
(a) Factor Modeling
We assume strong cross section and time dependence across N cross-sectional units over T time periods of \(y_{it}\) can be captured by a factor model of the formwhere \(\varvec{f}_t\) is r dimensional common factors that are the same across i but vary over t, \(\varvec{\lambda }_i\) is r dimensional factor loadings that stay constant over t but vary across i to represent the innate differences or endowment between individuals, and \(u_{it}\) is the random error term that conditional on \(\varvec{\lambda }_i\) and \(\varvec{f}_t\) has mean zero, \(E(u_{it}|\varvec{\lambda }_i, \varvec{f}_t)=0\), but could be weakly cross-correlated.5
$$\begin{aligned} y_{it}=\varvec{\lambda }_{i}^{\prime }\varvec{f}_{t}+u_{it},\quad i=1,\ldots ,N;t=1,\ldots ,T, \end{aligned}$$
(3.2)
Stacking all N cross-sectional units one after another at time t, \(\varvec{y}_{t}=(y_{1t},y_{2t},\mathbf {\ldots },y_{Nt})^{\prime }=(y_{1t},\varvec{\tilde{y}}_{t}^{\prime })^{\prime }\), we havewhere \(\Lambda =(\varvec{\lambda }_1, \varvec{\lambda }_2, \ldots , \varvec{\lambda }_{N})'=(\varvec{\lambda }_1, \tilde{\Lambda })\) and \(\varvec{u}_t = ( u_{1t}, \ldots , u_{Nt})'= (u_{1t},\tilde{\varvec{u}}'_t )'\). Alternatively, we can stack the ith individual’s T time series observations as \(\textbf{y}_{i}=(y_{i1},\mathbf {\ldots },y_{iT})^{\prime }\),where \(F=(\varvec{f}_1, \ldots , \varvec{f}_T)'\) and \(\varvec{u}_i=(\varvec{u}_{i1}, \ldots , u_{iT})'\).
$$\begin{aligned} \textbf{y}_{t}=\Lambda \varvec{f}_{t}+\textbf{u}_{t},\quad \ \ t=1,\ldots ,T, \end{aligned}$$
(3.3)
$$\begin{aligned} \textbf{y}_{i}=\mathbf {F \varvec{\lambda }}_{i}+\textbf{u}_{i},\quad \ \ i=1,\ldots ,N, \end{aligned}$$
(3.4)
The common assumptions for \(\varvec{f}_t, \varvec{\lambda }_i\) and \(u_{it}\) are:
Assumption 1
The factor process satisfies \(E\left\| \varvec{f} _{t}\right\| ^{4}\le M<\infty \) and \(\frac{1}{T}\sum _{t=1}^{T}\varvec{f} _{t}\varvec{f}_{t}^{\prime }\rightarrow _{p}\Sigma _{f},\) where \(\Sigma _{f}\) is an \(r\times r\) non-singular constant matrix.
Assumption 2
The loading \(\varvec{\lambda }_{i}\) is either fixed constant or is stochastic with \(E\left\| \varvec{\lambda }_{i}\right\| ^{4}\le M<\infty .\) In either case \(\frac{1}{N} \sum _{i=1}^{N}\varvec{\lambda }_{i}\varvec{\lambda }_{i}^{\prime }\rightarrow _{p}\Sigma _{\lambda },\) where \( \Sigma _{\lambda }\) is an \(r\times r\) non-singular constant matrix.
We merge the impact of those common components that only exert influences over finite number of cross-sectional units into \(u_{it}\) by allowing the \(N \times 1\) vector, \(\varvec{u}_{t}\) to be weakly cross-dependent.
Assumption 3
The random error terms \(\textbf{u}_{t}=(u_{it}, \ldots , u_{it})'\) are independently, identically, distributed over t with non-singular covariance matrixwhere \(\sigma _{1}^{2}=E\left( u_{1t}^{2}\right) \), \(\Omega =E(\tilde{\textbf{u}}_{t}\tilde{\textbf{u}}_{t}^{\prime })\), and \(\textbf{c}=E(\tilde{\textbf{u}}_{t}u_{1t})\). Moreover, all N nonzero eigenvalues of \(\tilde{\Omega } \) are O(1).
$$\begin{aligned} E(\varvec{u}_{t}\varvec{u}_{t}^{\prime })=\tilde{\Omega }=\left( \begin{array}{cc} \sigma _{1}^{2} & \varvec{c}^{\prime } \\ \varvec{c} & \Omega \end{array} \right) , \end{aligned}$$
Modeling panel data by a factor model are a very useful dimensional reduction approach to summarize the variation across individual (i) over time (t), (e.g., Anderson and Rubin 1956; Lawley and Maxwell 1971). They are widely applied to macro and financial economics (e.g., Chamberlain and Rothschild 1983; Connor and Korajczyk 1986; Forni et al. 1998; Ross 1976; Sargent and Sims 1977) and are also used to generate parsimonious predictive models for high-dimensional time series data (e.g., Stock and Watson 1989, 2002). Factor models are also used as the basis for panel approach to construct counterfactuals to measure the treatment effects of a social program (e.g., Hsiao et al. 2012).
Under Assumptions 1 and 2, \(\varvec{\lambda }_1\) and \(\varvec{f}_{t}\) are not identified. Since our focus is on prediction, not identifying the parameters of the data-generating process of (3.2), there is no loss of generality to follow Anderson and Rubin (1956), Bai (2003, 2009), etc. to use the normalization conditions \( \Sigma _\lambda =I_r\) and \(\Sigma _f\) diagonal. Then, conditional on \(Y^T =(\varvec{y}_1, \ldots , \varvec{y}_T)\), \(\Lambda \) can be estimated as \(\sqrt{N} \) times the eigenvectors corresponding to the r largest eigenvalues of the determinant equationConditional on \(\hat{\Lambda } = (\varvec{\hat{\lambda }}_1,\ldots ,\varvec{\hat{\lambda }}_N) = (\varvec{\hat{\lambda }}_1,\hat{\tilde{\Lambda }})\), and \(\varvec{f}_t\) can be estimated by(b) Linear Projection Modeling6
$$\begin{aligned} \left| \frac{1}{T}\sum \limits _{t=1}^{T}\varvec{y}_{t}\varvec{y} _{t}^{\prime }-\delta \ \varvec{I}_{N}\right| =0. \end{aligned}$$
(3.5)
$$\begin{aligned} \varvec{\hat{f}}_t=(\hat{\tilde{\Lambda }}' \hat{\tilde{\Lambda }})^{-1} \hat{\tilde{\Lambda }}' \varvec{\tilde{y}}_t. \end{aligned}$$
(3.6)
Alternatively, we can express \(y_{1t}\) as a function of \(\varvec{\tilde{y}}_t=(y_{2t},\ldots ,y_{Nt}),\)where \(E^*(y_{1t}|\tilde{\varvec{y}} _t)\) denotes the linear projection of \(y_{1t}\) on \(\tilde{\varvec{y}}_t\) or the conditional mean of \(y_{1t}\) on \(\tilde{\varvec{y}_t}\) if the conditional mean is linear in \(\tilde{\varvec{y}} _t\) (e.g., \((y_{1t}, \tilde{\varvec{y}} _t)\) are Gaussian). Under Assumptions 1–3, the coefficients \(\varvec{w}\) are related to the underlying factor model aswith \(\tilde{\Lambda }\) denoting the factor loading matrix for control units \(\tilde{\varvec{y}}_t\). The coefficients \(\varvec{w}\) based on \(Y^T\) can be estimated byThe error term \(\eta _t\), by construction, is orthogonal to \(\varvec{\tilde{y}}_t\) with mean square error \(\sigma ^2_\eta =E(\eta ^2_t)\), whereThe LP model is closely related to the panel data approach (PDA) and synthetical control approach (SCM).
$$\begin{aligned} y_{1t}=E^*(y_{1t}|\tilde{\varvec{y}} _t)+ \eta _{t} = \varvec{w}' \tilde{\varvec{y}} _t + \eta _t, \end{aligned}$$
(3.7)
$$\begin{aligned} \varvec{w} = [ E (\tilde{\varvec{y}} _t \tilde{\varvec{y}}'_t)]^{-1} E(\tilde{\varvec{y}}_t y_{1t})=( \tilde{\Lambda }\tilde{\Sigma }_f \tilde{\Lambda '}+ \Omega )^{-1} (\tilde{\Lambda } \tilde{\Sigma }_f \lambda '_1+ \varvec{c}) \end{aligned}$$
(3.8)
$$\begin{aligned} \varvec{\hat{w}}=\left( \sum ^T_{t=1}\varvec{\tilde{y}}_t \varvec{\tilde{y}'}_t \right) ^{-1}\left( \sum ^T_{t=1}\varvec{\tilde{y}}_t y_{it}\right) . \end{aligned}$$
(3.9)
$$\begin{aligned} \sigma _{\eta }^{2} =\sigma _{1}^{2} +\varvec{\lambda }_{1}^{\prime }\Sigma _{f}\varvec{\lambda } _{1} -( \varvec{\lambda }_{1}'\Sigma _{f}\tilde{\Lambda }'+ \varvec{c}') (\tilde{\Lambda } \Sigma _{f}\tilde{\Lambda }^{\prime }+\Omega )^{-1} (\tilde{\Lambda }\Sigma _{f}\varvec{\lambda }_1+\varvec{c}). \end{aligned}$$
(3.10)
(i) The PDA Approach
Let \( \varvec{z}_t\) denote all observed variables that are independent of \(d_{1t}\) at time t. For simplicity, we let \( \varvec{z}_t= (y_{2t},\ldots , y_{Nt},\varvec{x}'_{1t}, \ldots , \varvec{x}'_{Nt})\). Under the assumption that
Hsiao et al. (2012) ((HCW)) suggest to approximate \(y_{1t}\) throughwhere \(E(\eta _{1t}| \varvec{z}_t)=0\). Then \(E^*(y_{1t}|\varvec{z}_t)\) is an unbiased predictor for \(y_{1t}\) conditional on \(\varvec{z}_t\) as shown in (3.7). HCW suggests to approximate (3.12) by a subset of \(\varvec{z}_t,\)where \(\varvec{z}^*_{t}\) is selected through a model selection procedure, while Li and Bell (2017) proposes to use LASSO (Tibshirani 1996).
$$\begin{aligned} \varvec{z}_{t} \perp d_{1t} \end{aligned}$$
(3.11)
$$\begin{aligned} y_{1t}=E^*(y_{1t}|\varvec{z}_t)+ \eta _{1t}, \ t=1,\ldots , T, \end{aligned}$$
(3.12)
$$\begin{aligned} y_{1t}= \ a+ \varvec{c}' \varvec{z}^*_t + \eta _{1t}, \end{aligned}$$
(3.13)
When \(\varvec{z}_t = \tilde{\varvec{y}}_t\), the PDA approach is identical to the LP approach. The process of selecting a subset of \(\varvec{z}_t, \varvec{z}^*_t\) in (3.12) is equivalent to consider a subset of \(\tilde{\varvec{y}}_t\), say \(\tilde{\varvec{y}}^*_t\). As long as the dimension of \(\tilde{\varvec{y}}^*_t\) is greater than r,LP of \(y_{1t}\) on \(\tilde{\varvec{y}}^*_t\) is equal towithwhere \( \Omega ^* = E(\varvec{\tilde{u}}^*_t \varvec{\tilde{u}}^{*'}_t)\) and \(\varvec{c}^*=E( \varvec{\tilde{u}}^* u_{1t})\).
$$\begin{aligned} \left( \begin{array}{cc} y_{1t} \\ \varvec{\tilde{y}}^*_t \end{array} \right) = \left( \begin{array}{cc} \lambda '_1 \\ \tilde{\Lambda }^* \end{array} \right) \varvec{f}_t + \left( \begin{array}{cc} u_{1t} \\ \varvec{\tilde{u}}^*_t \end{array} \right) . \end{aligned}$$
(3.14)
$$\begin{aligned} y_{1t}= \varvec{w}^{*'} \varvec{\tilde{y}^*_t} + \eta ^*_t, \end{aligned}$$
(3.15)
$$\begin{aligned} & \varvec{w}^* = (\tilde{\Lambda }^* \Sigma _f \tilde{\Lambda }^{*'}+ \Omega ^*)^{-1}(\tilde{\Lambda }^{*'} \Sigma _f \varvec{\lambda }_1 + \varvec{c}^*), \end{aligned}$$
(3.16)
$$\begin{aligned} & \sigma _{\eta ^*}^{2} =\sigma _{1}^{2} +\varvec{\lambda }_{1}^{\prime }\Sigma _{f}\varvec{\lambda } _{1} - (\varvec{\lambda }_{1}'\Sigma _{f} \tilde{\Lambda }^{*'}+ \varvec{c}^{*'}) (\tilde{\Lambda }^* \Sigma _{f}\tilde{\Lambda }^{*\prime }+\Omega ^*)^{-1} (\tilde{\Lambda }^*\Sigma _{f}\varvec{\lambda }_1+\varvec{c}^*), \nonumber \\ \end{aligned}$$
(3.17)
(ii) Synthetic Control Method
Abadie et al. (2010, 2015) proposed a synthetic control method (SCM) that predicts \(y_{1t}\) bywhere the \(y_{it}\) are selected to be those in the control units that have data-generating process similar to that of \(y_{1t}\). The \(b_i\) is obtained by minimizingsubject to the constraint thatwhere \(\varvec{y}_1\) and Y denote the \(T \times 1\) and \(T \times (N-1)\) matrix of pre-treatment \(y_{1t}\) and \(y_{jt}, j=2, \ldots , N\), respectively, \(\tilde{\varvec{x}}_1\) and \(\tilde{X}\) denote the pre-treatment time series average of \(\varvec{x}_{1t}\) and \(\varvec{x}_{jt}, j=2, \ldots , N\), respectively, and V is a positive definite matrix.
$$\begin{aligned} y_{1t}= \sum _{i=2}^{N} b_{i}y_{it}, \ \ t=T +1, \ldots , T+m, \end{aligned}$$
(3.18)
$$\begin{aligned} \left[ \left( \begin{array}{cc} \varvec{y}_{1} \\ \varvec{\tilde{x}}_1 \end{array} \right) - \left( \begin{array}{cc} Y \\ \tilde{X} \end{array} \right) \varvec{b} \right] ' V \left[ \left( \begin{array}{cc} \varvec{y}_{1} \\ \varvec{\tilde{x}}_1 \end{array} \right) - \left( \begin{array}{cc} Y \\ \tilde{X} \end{array} \right) \varvec{b} \right] , \end{aligned}$$
(3.19)
$$\begin{aligned} b_i \ge 0, \ \text {and} \ \sum ^N_{i=2} b_i=1, \end{aligned}$$
(3.20)
Conditional on (\(\varvec{y},\tilde{X}\)) independently of \(d_{1t}\), the difference between PDA and SCM is that the former is an unconstraint regression, while the latter restricts the regression model (3.18) with intercepts \(a=0\) and \( \varvec{c}\) satisfying (3.20). If the restrictions are correct, then the SCM is more efficient. If the restrictions are not correct, SCM could lead to biased prediction of the counterfactuals while the PDA remains unbiased. For a general discussion of LP, PDA, or SCM, Gardeazabal and Vega-Bayo (2016), Wan et al. (2018).
(c) Prediction Error Comparison
Under model (3.2) or (3.7), the best predictor for \(y_{1,T+h}\) isorrespectively. However, \(\Lambda , F\), and W are unknown. Estimation of \(\Lambda \) by (3.5), conditional on \(\hat{\Lambda } = (\varvec{\hat{\lambda }}_1,\ldots ,\varvec{\hat{\lambda }}_N) = (\varvec{\hat{\lambda }}_1,\hat{\tilde{\Lambda }})\), \(\varvec{f}_{T+h}\) can be estimated bySubstituting \(\hat{\varvec{\lambda }}_1\) and \(\hat{\varvec{f}}_{T+h}\) into (3.21), the prediction error of \(\hat{\tilde{y}}_{1,T+h}\) = \(\varvec{\hat{\lambda }}'_{1} \varvec{\hat{f}}_{T+h}\) isWhen N is fixed and \(T\rightarrow \infty , \Lambda \) is \(\sqrt{T}\) consistent, but \(\hat{f}_{T+h} - f_{T+h}= O(\frac{1}{\sqrt{N}})\). When \((N,T) \rightarrow \infty \), Bai (2003) showed that the asymptotic variance of \(\tilde{\varphi }_{1,T+h}\)The linear projection model coefficients \(\varvec{w}\) based on \(Y^T =(\varvec{y}_1,\ldots , \varvec{y}_T)\) are estimated by (3.8). Substituting (3.8) into (3.22), the prediction error of \(y_{1,T+h}\) by \(\hat{\hat{y}}_{1,T+h}=\varvec{\hat{w}'}\varvec{\tilde{y}}_{T+h} \) iswith prediction error varianceWe note that for a good estimate of \(\varvec{\lambda }_1\), (or \(\hat{\Lambda }\)), we need T to go to infinity. For a good estimate of \(\varvec{f}_{T+h}\), we need N to go to infinite. On the other hand, a good estimate of \(\varvec{w}\) only requires T to go to infinity. When N or T is finite, the predictor \(\varvec{\hat{\lambda }'}_1 \varvec{\hat{f}}_{T+h}\) may be a biased predictor of \(y_{T+h}\). The mean square prediction error of (3.24) depends on the configurations of N and T. On the other hand, the LP predictor \(\hat{\hat{y}}_{1,T+h}\) is always unbiased conditional on \((\varvec{\tilde{y}}_1, \ldots , \varvec{\tilde{y}}_T,\varvec{\tilde{y}}_{T+h})\) because \(\varvec{\hat{w}}\) is an unbiased estimator of \(\varvec{w}\).
$$\begin{aligned} \tilde{y}_{1,T+h}= \varvec{\lambda }'_1 \varvec{f}_{T+h} \end{aligned}$$
(3.21)
$$\begin{aligned} \hat{y}_{1,T+h}= \varvec{w}' \tilde{\varvec{y}} _{T+h}, \end{aligned}$$
(3.22)
$$\begin{aligned} \varvec{\hat{f}}_{T+h}=(\hat{\tilde{\Lambda }}' \hat{\tilde{\Lambda }})^{-1} \hat{\tilde{\Lambda }}' \varvec{\tilde{y}}_{T+h}, \quad h=1, \ldots , m. \end{aligned}$$
(3.23)
$$\begin{aligned} \tilde{\varphi }_{1,T+h}= y_{1,T+h}-\hat{\tilde{y}}_{1,T+h}=u_{1,T+h} +(\varvec{\lambda }'_1 \varvec{f}_{T+h} - \varvec{\hat{\lambda }}'_1 \varvec{\hat{f}}_{T+h} ). \end{aligned}$$
(3.24)
$$\begin{aligned} \text {Var} (\tilde{\varphi }_{1,T+h})= \sigma ^2_1 + \frac{1}{N} \varvec{\lambda }'_1 \ \Sigma ^{-1}_\lambda \left( \frac{1}{N}\Lambda ' \Omega \Lambda \right) \Sigma ^{-1}_{\lambda } \varvec{\lambda }_1 + \frac{\sigma ^2_1}{T} \varvec{f}'_{T+h} \Sigma ^{-1}_f \varvec{f}_{T+h} + o(1). \nonumber \\ \end{aligned}$$
(3.25)
$$\begin{aligned} \hat{\varphi }_{1,T+h}= y_{1,T+h} - \varvec{\hat{w} }' \tilde{\varvec{y}}_{T+h}, \quad h=1,\ldots , m \end{aligned}$$
(3.26)
$$\begin{aligned} \text {Var} \left( \hat{\varphi }_{1,T+h}\right) = \sigma ^2_\eta \left[ 1+\varvec{\tilde{y}}'_{T+h} \left( \sum \limits _{t=1}^{T}\tilde{\textbf{y}}_{t} \tilde{\textbf{y}} _{t}^{\prime }\right) ^{-1} \varvec{\tilde{y}}_{T+h} \right] \end{aligned}$$
(3.27)
(i) Case 1: \((N,T)\rightarrow \infty \). When \(\frac{N}{T}\rightarrow a\ne 0<\infty ,\) the asymptotic mean square prediction error (MSPE) of (3.24) and (3.26) is identical, \(MSPE\left( \widehat{\tilde{y}}_{1,T+h}\right) =MSPE\left( \widehat{\hat{y}}_{1,T+h}\right) \) if \(u_{1t}\) is independent of \(u_{jt}\) when \(j\ne 1.\) If \(u_{1t}\) is correlated with \(\tilde{\textbf{u}}_{t},\) then (3.26) has smaller MSPE than (3.24).
The reason LP approach \((\hat{\hat{y}}_{1,T+h})\) is in general more efficient than the factor modeling approach (FB \((\hat{\tilde{y}}_{1,T+h})\)) in view of MSPE because the LP approach is able to take into account the correlation between \(u_{1t}\) and \(\tilde{\textbf{u}}_{t}\) while the factor modeling approach (FB) does not. Should one replace the predictor \(\varvec{\hat{\lambda }}'_1 \varvec{\hat{f}}_{T+h} \) by \(\varvec{\tilde{\lambda }}'_1 \varvec{\hat{f}}_{T+h} + E(u_{1,T+h}| \varvec{\tilde{u}}_{T+h}) \) (see, e.g., Hsiao and Zhou 2019), we expect that they will have the same prediction error as the LP prediction. However, the LP approach is computationally more convenient because it only requires an OLS estimation of \(y_{1t}\) on \(\tilde{\textbf{y}} _{t}\), while the FB approach requires the identification of the number of factors and the principal component estimation of the latent factor structure is more laborious.
(ii) Case 2: N fixed, \(T\rightarrow \infty \). As long as \(N>r,\) when \(T\rightarrow \infty ,\) the LP predictor \(\varvec{\hat{w}}' \varvec{\tilde{y}} _{T+h}\) has smaller mean square prediction error (MSPE) than the FB predictor \(\varvec{\hat{\lambda }}'_1 \varvec{\hat{f}}_{T+h}\).7
(iii) Case 3: T fixed, \(N\rightarrow \infty \).
It is not feasible to directly implement the LP approach (4.7) when \(N> T\) because \(\frac{1}{T}\sum \limits _{t=1}^{T}\tilde{\textbf{y}}_{t}\tilde{\textbf{y}} _{t}^{\prime }\) could be a near singular matrix. Therefore, with finite T, \(\varvec{\hat{w}}\) is not likely to be a good estimator of \(\varvec{w}\). However, when N is large, it is not unreasonable to assume that the N cross-sectional units are randomly drawn at time t.Hsiao and Zhou (2024) have suggested to randomly break up the \(\left( N-1\right) \) control units into G subgroups, each consists of \(N_{g}\) cross-sectional units subject to \(N_{g}\) less than T, and then use LP to generate the gth group predicted value of \( y_{1,T+h}\) aswhereand \(\tilde{\textbf{y}}_{t}^{g}\) is a \(N_{g}\times 1\) vector consists of \( N_{g}\) cross-sectional units that belong to the gth subgroup, i.e., \( \tilde{\textbf{y}}_{t}^{g}=\left( 1_{\left( i\in g\right) }y_{it}\right) ,\) \( g=1,\ldots ,G.\) Then generate the predicted value of \(y_{1,T+h}\) as the average of G predictors \(\left\{ \hat{y}_{1,T+h}^{g}:g=1,\ldots ,G\right\} \)Hsiao and Zhou (2024) have shown that when T is fixed and \( N\rightarrow \infty ,\) the mean square prediction error (MSPE) of \(\varvec{\lambda }'_{1} \varvec{\hat{f}}_{T+h}\) is greater than (3.30).
$$\begin{aligned} \hat{y}_{1,T+h}^{g}=\hat{\textbf{w}}_{g}^{\prime }\tilde{\textbf{y}} _{T+h}^{g}, \ \ g=1,\ldots ,G, \end{aligned}$$
(3.28)
$$\begin{aligned} \hat{\textbf{w}}_{g}=\left( \sum \limits _{t=1}^{T}\tilde{\textbf{y}}_{t}^{g} \tilde{\textbf{y}}_{t}^{g\prime }\right) ^{-1}\sum \limits _{t=1}^{T}\tilde{\textbf{y}}_{t}^{g}y_{1t}, \end{aligned}$$
(3.29)
$$\begin{aligned} \widehat{\bar{y}}_{1,T+h}^{G}=\frac{1}{G}\sum \limits _{g=1}^{G}\hat{y} _{1,T+h}^{g}. \end{aligned}$$
(3.30)
(iv) Case 4: Both N and T are finite.
It is hard to compare the mean square prediction error between the FB and LP predictor. However, Hsiao and Zhou (2024) have argued that if post-treatment outcomes in the absence of treatment are similar to the outcomes before the treatment, the LP method is likely to be more accurate.
(v) Case 5: Combination of causal factors and factor modeling.
Xu (2017), Hsiao and Zhou (2019) considerwhere \(\varvec{\lambda }_{i},\) \(\textbf{f}_{t}\) and \(u_{it}\) satisfy Assumptions 1-3, where \(\textbf{x}_{it}\) could be correlated with \(\varvec{\lambda }_{i},\) and \(\varvec{f}_{t}\), but \(E(u_{it}|\varvec{x}_{it},\varvec{\lambda }'_{i}\varvec{f}_{t})=0\) (e.g., Bai 2009; Hsiao and Zhou 2019; Hsiao et al. 2022a; Pesaran 2006; Xu 2017). The error of predicting \(y_{it}\) then takes the formi.e., the prediction error consists of two parts, the part due to the error of estimating \(\varvec{\beta }\) and the part due to \(\left( \varepsilon _{it}-\hat{\varepsilon }_{it}\right) .\) If the same estimation method is applied to estimate \(\varvec{\beta }, \) the part due to \(\textbf{x} _{it}^{\prime }\left( \varvec{\beta }-\varvec{\hat{\beta }}\right) \) is identical in (3.33) independent of which method is utilized to estimate \(\varvec{\beta }.\) Thus, the analysis of the relative merits between FB and LP methods continues to hold for model of (3.32)–(3.33).
$$\begin{aligned} & y_{it} =\textbf{x}_{it}^{\prime }\varvec{\beta }+\varepsilon _{it}, \end{aligned}$$
(3.31)
$$\begin{aligned} & \varepsilon _{it} =\varvec{\lambda }_{i}^{\prime }\textbf{f}_{t}+u_{it}, { \ \ }i=1,\ldots ,T;i=1,\ldots ,N, \end{aligned}$$
(3.32)
$$\begin{aligned} \tilde{\varepsilon }_{it}=y_{it}-\hat{y}_{it}=\textbf{x}_{it}^{\prime }\left( \varvec{\beta }-\varvec{\hat{\beta }}\right) +\left( \varepsilon _{it}-\hat{ \varepsilon }_{it}\right) , \end{aligned}$$
(3.33)
Table 2
Mean square prediction error (MSPE)-linear projection (LP) versus factor models (FB)
N fixed | \(N\rightarrow \infty \) | |
---|---|---|
T fixed | No definite conclusion. The MSPE depends on realized \(\mathrm{y_{it}}\) and \({{\textbf {x}}_{\textrm{it}}}\) | MSPE (modified LP) \(\le \) MSPE (FB) |
\(T\rightarrow \infty \) | MSPE (LP) \(\le \) MSPE (FB) | MSPE (LP) \(\le \) MSPE (FB) |
We summarize the prediction error comparison between FB and LP methods in terms of N and T in Table 2. The limited Monte Carlo studies conducted by Hsiao and Zhou (2024) also showed that the above analytical results based on large sample analysis also hold when N and T are finite. Their empirical analysis of the 1990 Germany reunification effects based on LP projection approach appears to show steady and plausible results, while the factor analysis appears to be erronic.
4 Multiple treated units
In Sects. 2 and 3, we consider the accuracy of measuring treatment effects based on the assumption that a single unit receives the treatment. When there are multiple units receiving the treatment, in principle we can still consider the single equation approach one by one and then aggregate the micro-predictions to obtain the aggregated predictions. Based on the criterion of \(|| \varvec{y} - \hat{\varvec{y}}||\) and the transversality argument, as long as one method is likely to generate a more accurate prediction for any treated unit, aggregating more accurately predicted units is likely to generate more accurate aggregated predictions for whatever linear aggregation method is used (e.g., Hsiao et al. 2022b). However, predicting each treated unit one by one could be computationally laborious if there are many treated units.
(i) Distribution of Treatment Effects
An alternative approach could be to first aggregate multiple treated units into single unit; then, use single equation approach to generate the predictions for the aggregated units. However, aggregation could raise complicated issues as discussed in Hsiao (2022). Moreover, there could be issues of whether to summarize the outcomes of multiple units in terms of some moment conditions (say, ATE) or in terms of the distribution of individual outcomes yield more information. For instance, Maasoumi and Wang (2022) consider the measurement of treatment effects of closing earning gender gap between the policy option 1 where women and men are paid on the scale conditional on human capital characteristics (structural effect) and the policy option 2 where the current pay scale between men and women remains the same, but women’s human capital characteristics become the same as men’s (composition effect). Based on the US Current Population Survey data from 1976–2013, they found for some quantiles of the distributions of treatment effects between option 1 and option 2, option 1 could be preferred, but for other quantiles, option 2 could be preferred. However, it is difficult to summarize the information from decomposition of distribution analysis. To obtain a unique ranking, Maasoumi and Wang (2019, 2022) suggest a stochastic dominance ranking criterion within the class of weakly increasing utility function u(y) in the form such that when F(y) and G(y) are the distribution of treatment effects for options 1 and 2 and u(y) be every weakly increasing utility function of y, then F(y) is first-order stochastically dominating G(y) if and only ifThe criterion (4.1) not only provides an unambiguous ranking when there exists dispersion of treatment effects between different policy options, but also allows the checking of the robustness of treatment effects comparison through tests of stochastic dominance (e.g., Linton et al. 2005).
$$\begin{aligned} \int u(y) \ d F(y) \ge \int u(y) \ d G((y). \end{aligned}$$
(4.1)
(ii) Identifying Causal Factors
The prediction approach to measure the treatment effects is a non-causal approach. If the treatment effects for individuals, \(\varvec{\hat{\Delta }}_{it},\) are different for different i (or heterogeneous), it provides a way to link the non-causal approach with the causal approach. ConsiderOne may regress estimated \(\Delta _{it}\) on \(\varvec{x}_{it}\) to find the impact of changes in \(\varvec{x}\) on \(\Delta \). For example, Ke et al. (2017) showed that the impact of China’s high-speed rail projects are different for different localities. They then regressed the estimated treatment effects \(\Delta _{it}\) on causal factors such as industrial share in employment, service share in employment, size of state enterprises employment, university enrollment per 10,000 people, number of star hotels (an approximation of tourist attraction, etc.) and showed that they are important causal factors to differences in treatment effects.8
$$\begin{aligned} \hat{\Delta }_{it}=a+ \varvec{b}' \varvec{x}_{it} + \varepsilon _{it}.\end{aligned}$$
(4.2)
5 Concluding remarks
Under the assumption that the panel data contain both the pre-treatment and post-treatment control units information, we review both the panel causal approach and non-causal approach to construct the counterfactuals for the measurement of treatment effects. We argue that if the emphasis is on the measurement of treatment effects, then it is just an issue of predictions. There is no need to consider the identification and estimation of the parameters of the data-generating process. For models under the nonconfoundedness assumption, we mainly review data-driven LP and FB approach to generate counterfactuals. In general, the linear projection approach can be applied for any data-generating process and is likely to yield more accurate predictions whatever the sample configuration of N and T. Moreover, the LP approach can be applied to a variety of data-generating processes. The equationalways holds, where \(\textbf{X}_{t}\) can include \(\tilde{\textbf{y}}_{t},\) lagged \(\tilde{\textbf{y}}_{t}\) or any covariates that satisfy \(f\left( \textbf{X}_{t}|d_{1t}\right) =f\left( \textbf{X}_{t}\right) .\) Chen (2023) has shown that the LP is an attractive choice against a wide class of matching or difference-in-difference estimators.9
$$\begin{aligned} y_{it}=a+E\left( y_{it}|\textbf{X}_{t}\right) +\eta _{it}, \end{aligned}$$
(5.1)
We did not consider another widely applied difference-in-difference approach (DID, e.g., Cameron and Trivedi 2005). The DID approach is mainly suggested for the analysis of repeated cross-sectional data. Although one can apply the methodology to the panel data, the application of DID in panel data to the parametric approach is just the conventional dummy variable approach (e.g., Damrongplasit 2009). For the application to nonparametric approach, it requires a nonparametric estimation of \(E(y_{1,T+h}| \varvec{z}_{T+h})\) period by period, \(h=1, \ldots , m\). It is computationally more cumbersome than the LP or FB approach. Moreover, as shown by Stone (1980) the convergence rate for nonparametric estimates is \(N^{-\frac{2\alpha }{2 \alpha +r}}\), where \(\alpha \) denotes the degree of smoothness of \(E(y|\varvec{z})\) (e.g., Chen 2007; Newey 1997) and r denotes the dimension of the conditional covariate \(\varvec{z}\), while the LP and FP approach is computationally straightforward and has faster convergence rate of either \(T^{-1/2}\) or \(N^{-1/2}\).
However, it should be noted that the data-driven approaches to construct counterfactuals, although reasonably simple to implement, are difficult to simulate the effects of treatment outcomes under different policy scenarios because they do not involve the identification and estimation of the parameters of data-generating process. Under the assumption that policy change does not change the decision rules (i.e., Lucas 1976 Critique does not apply), a typical procedure to consider outcomes under different policy scenarios is through the following steps:
Step 1: Construct a theoretical model for the outcomes of interest.
Step 2: Estimate the parameters of the theoretical model from observed data.
Step 3: Simulate the potential outcomes under different scenarios by manipulating the conditional covariates of the theoretical model.
For instance, Pesaran and Yang (2022) use a stochastic model of epidemics on networks to consider COVID-19 infection rate outcomes under different policy scenarios. But without data showing the outcomes under different policy scenarios, it is not possible to duplicate what they did through a noncausal approach.
It should also be noted that our review is based on the assumption that the observed data are either subject to the treatment or in the absence of the treatment. In many cases, the observed data could be subject to multiple treatments (e.g., Fujiki and Hsiao 2015; Ke and Hsiao 2023). For instance, the data observed during the pandemic, say COVID-19, are either the outcomes due to the outcomes of the pandemic and the specific disease control policy under the pandemic or in the absence of both. To separate the impact of pandemic and the effectiveness of specific control policies, a single equation approach, in general, is incapable of separating the two. It may need the combination of different approaches to provide separate estimates of each specific treatment (e.g., Ke and Hsiao 2022, 2023). Moreover, our review is based on a single equation approach (i.e., cateris paribus approach). When a policy has macro (or global)-impacts, one needs to take a system of equations (or mutatis mutandis) approach (e.g., Hood and Koopmans 1953; Hsiao 1983; Theil 1958). In particular, when policy changes lead to changes in decision rule (e.g., Lucas 1976), the value of the conditional covariates could also change due to policy change. Without a model capturing how other variables would change due to policy changes, the predictions based only on a specific policy change variable while holding the values of other variables constant are bound to be misleading. Furthermore, incorporating machine learning algorithms to convert the text or sentimental data into digital form may also provide a real-time possibility to capture the underlying nonlinearities that could not be captured with a linear functional form (e.g., Hsiao 2024; Hsiao and Zhao 2000).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.