Top

Empirical Economics

Published in:

Open Access 06-04-2023

Generalized kernel regularized least squares estimator with parametric error covariance

Authors: Justin Dang, Aman Ullah

Published in: Empirical Economics | Issue 6/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

A two-step estimator of a nonparametric regression function via Kernel regularized least squares (KRLS) with parametric error covariance is proposed. The KRLS, not considering any information in the error covariance, is improved by incorporating a parametric error covariance, allowing for both heteroskedasticity and autocorrelation, in estimating the regression function. A two step procedure is used, where in the first step, a parametric error covariance is estimated by using KRLS residuals and in the second step, a transformed model using the error covariance is estimated by KRLS. Theoretical results including bias, variance, and asymptotics are derived. Simulation results show that the proposed estimator outperforms the KRLS in both heteroskedastic errors and autocorrelated errors cases. An empirical example is illustrated with estimating an airline cost function under a random effects model with heteroskedastic and correlated errors. The derivatives are evaluated, and the average partial effects of the inputs are determined in the application.

The authors are thankful to the two referees for their helpful comments and suggestions on the paper.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Peter Schmidt has made many seminal contributions in advancing the statistical inference methods and their applications in time series, cross section, and panel data econometrics in general (Schmidt 1976a) and, in particular, in the areas of dynamic econometric models, estimation and testing of cross-sectional and panel data models, crime and justice models (Schmidt and Witte 1984), survival models (Schmidt and Witte 1988). His fundamental and innovative contributions on the econometrics of stochastic frontier production/cost models have made significant impact on the generations of econometricians (e.g., Schmidt 1976b, Aigner et al. 1977, Amsler et al. 2017, Amsler et al. 2019). Also, he has contributed many influential papers on developing efficient procedures involving the generalized least squares (GLS) method (see Guilkey and Schmidt 1973, Schmidt 1977, Arabmazar and Schmidt 1981, Ahu and Schmidt 1995) among others. These were for the parametric models, whereas here we consider the nonparametric models.

Nonparametric regression function estimators are useful econometric tools. Common methods to estimate a regression function are kernel based methods, such as Kernel Regularized Least Squares (KRLS), Support Vector Machines (SVM), Local Polynomial Regression, etc. However, in order to avoid overfitting the data, some type of regularization, lasso or ridge, is generally used. In this paper, we will focus on KRLS; this method is also known as Kernel Ridge Regression (KRR) in the machine learning literature and is the kernelized version of the simple ridge regression to allow for nonlinearities in the model.

In this paper, we establish fitting a nonparametric regression function via KRLS under a general parametric error covariance. Some theoretical results, including pointwise marginal effects, unbiasedness, consistency and asymptotic normality, on KRLS are found in Hainmueller and Hazlett (2014). However, Hainmueller and Hazlett (2014) only consider errors to be homoskedastic and that the estimator is unbiased for estimating the postpenalization function, not for the true underlying function. Confidence interval estimates for Least Squares Support Vector Machine (LSSVM) are discussed in De Brabanter et al. (2011), allowing for heteroskedastic errors. Although not directly stated, the LSSVM estimator in De Brabanter et al. (2011) is equivalent to KRR/KRLS when an intercept term is included in the model. Following Hainmueller and Hazlett (2014), we will use KRLS without an intercept. Although De Brabanter et al. (2011) allow for heteroskedastic errors, none of the papers mentioned thus far discuss incorporating the error covariance in estimating the regression function itself, making these type of estimators inefficient. In this paper, we focus on making KRLS more efficient by incorporating a parametric error covariance, allowing for both heteroskedasticity and autocorrelation, in estimating the regression function. We use a two step procedure where in the first step, we estimate the parametric error covariance from the residuals obtained by KRLS and in the second step, we estimate a model by KRLS based on transformed variables using the error covariance. We also provide estimating derivatives based on the two step procedure, allowing us to determine the partial effects of the regressors on the dependent variable.

The structure of this paper is as follows: Sect. 2 discusses the model framework and the GKRLS estimator, Sects. 3, 4, and 5 show the finite sample properties, asymptotic properties, and partial effects and derivatives of the GKRLS estimator, respectively, Sect. 6 runs through a simulation example, Sect. 7 illustrates an empirical example for a random effects model with heteroskedastic and correlated errors, and Sect. 8 concludes the paper.

2 Generalized KRLS estimator

Consider the nonparametric regression model:

$$\begin{aligned} Y_i = m(X_i) + U_i, \quad i=1,\ldots ,n, \end{aligned}$$

(1)

where $X_i$ is a $q\times 1$ vector of exogenous regressors, and $U_i$ is the error term such that $\mathbb {E}[U_i|X_{1},\ldots ,X_{n}] = \mathbb {E}[U_i|\textbf{X}]=0$, where $\textbf{X}=(X_1,\ldots ,X_n)^\top $ and

$$\begin{aligned} \mathbb {E}[U_iU_j|\textbf{X}] = \omega _{ij}(\theta _0) \text { for some }\theta _0 \in \mathbb {R}^p, i, j = 1,\ldots ,n. \end{aligned}$$

(2)

In this framework, we allow the error covariance to be parametric, where the errors can be autocorrelated or non-identically distributed across observations.

2.1 KRLS estimator

For KRLS, the function $m(\cdot )$ can be approximated by some function in the space of functions constituted by

$$\begin{aligned} m(\textbf{x}_0) = \sum _{i=1}^n {c}_i K_{\sigma }(\textbf{x}_i,\textbf{x}_0), \end{aligned}$$

(3)

for some test observation $\textbf{x}_0$ and where ${c}_i,\; i=1,\ldots ,n$ are the parameters of interest, which can be thought of as the weights of the kernel functions $K_{\sigma }(\cdot )$. The subscript of the kernel function, $K_{\sigma }(\cdot )$, indicates that the kernel depends on the bandwidth parameter, $\sigma $.

We will use the Radial Basis Function (RBF) kernel,

$$\begin{aligned} K_\sigma (\textbf{x}_i,\textbf{x}_0) = {\text {e}}^{-\frac{1}{\sigma ^2} || \textbf{x}_i - \textbf{x}_0||^2}. \end{aligned}$$

(4)

Notice that the RBF kernel is very similar to the Gaussian kernel, in that it does not have the normalizing term out in front and that $\sigma $ is proportional to the bandwidth h in the Gaussian kernel often used in nonparametric local polynomial regression. This functional form is justified by a regularized least squares problem with a feature mapping function that maps $\textbf{x}$ into a higher dimension (Hainmueller and Hazlett 2014), where this derivation of KRLS is also known as Kernel Ridge Regression (KRR). Overall, KRLS uses a quadratic loss with a weighted $L_2$-regularization. Then, in matrix notation, the minimization problem is

$$\begin{aligned} \underset{\textbf{c}}{\arg \min } \; (\textbf{y} - \textbf{K}_{\sigma } \textbf{c})^\top (\textbf{y} - \textbf{K}_{\sigma } \textbf{c}) + \lambda \textbf{c}^\top \textbf{K}_{\sigma }\textbf{c}, \end{aligned}$$

(5)

where $\textbf{y}$ is the vector of training data corresponding to the dependent variable, $\textbf{K}_{\sigma }$ is the kernel matrix, with $K_{\sigma ,i,j} = K_{\sigma }(\textbf{x}_i,\textbf{x}_j)$ for $i,j=1,\ldots ,n$, and $\textbf{c}$ is the vector of coefficients that is optimized over. The solution to this minimization problem is

$$\begin{aligned} \widehat{\textbf{c}}_1 = (\textbf{K}_{\sigma _1}+\lambda _1 \textbf{I})^{-1}\textbf{y}. \end{aligned}$$

(6)

The kernel function can be user specified but in this paper we only consider the RBF kernel in Eq. (4). The kernel function’s hyperparameter $\sigma $ and the regularization parameter $\lambda $ can also be user specified or can be found via cross validation. The subscript of one denotes the KRLS estimator, or the first stage estimation. Finally, predictions for KRLS can be made by

$$\begin{aligned} \widehat{m}_{1}(\textbf{x}_0) = \sum _{i=1}^n \widehat{c}_{1,i} K_{\sigma _1}(\textbf{x}_i,\textbf{x}_0). \end{aligned}$$

(7)

2.2 An efficient KRLS estimator

The KRLS estimator, $\widehat{m}_{1}(\cdot )$ does not take into consideration any information in the error covariance structure and therefore is inefficient. As a result, consider the $n\times n$ error covariance matrix, $\Omega (\theta )$, where $\omega _{ij}(\theta )$ denotes the (i, j)th element. Assume that $\Omega (\theta )=P(\theta )P(\theta )'$ for some square matrix $P(\theta )$ and let $p_{ij}(\theta )$ and $v_{ij}(\theta )$ denote the (i, j)th element of $P(\theta )$ and $P(\theta )^{-1}$. Let $\textbf{m}\equiv (m(X_1), \ldots , m(X_n))^\prime $ and $\textbf{U} \equiv (U_1, \ldots , U_n)^\prime $. Now, premultiply the model in Eq. (1) by $P^{-1}$, where $P^{-1}=P^{-1}(\theta )$ and we condense the notation and the dependence on $\theta $ is implied.

$$\begin{aligned} P^{-1}\textbf{y} = P^{-1}\textbf{m}+P^{-1}\textbf{U}. \end{aligned}$$

(8)

The transformed error term, $P^{-1}\textbf{U}$ has mean $\varvec{0}$ and covariance matrix as the identity matrix. Therefore, we consider a regression of $P^{-1}\textbf{y}$ on $P^{-1}\textbf{m}$. This simply re-scales the variables by the inverse of their square root of their variances. Since $\textbf{m}=\textbf{K}_{\sigma }\textbf{c}$, the quadratic loss function with $L_2$ regularization under the transformed variables is

$$\begin{aligned} \underset{\textbf{c}}{\arg \min } (\textbf{y}-\textbf{K}_{\sigma }\textbf{c})^{\top }\Omega ^{-1} (\textbf{y}-\textbf{K}_{\sigma }\textbf{c}) + \lambda \textbf{c}^\top \textbf{K}_{\sigma }\textbf{c}. \end{aligned}$$

(9)

The solution for vector is

$$\begin{aligned} \hat{\textbf{c}}_2 =(\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\textbf{y} \end{aligned}$$

(10)

Note that the solution obtained depends on the bandwidth parameter $\sigma _2$ and ridge parameter $\lambda _2$, which can be different than the hyperparameters used in the KRLS estimator. In practice, cross validation can be used for obtaining estimates for both hyperparameters. Here, it is assumed that $\Omega $ is known if $\theta $ is known. However, if $\theta $ is unknown, it can be estimated consistently and $\Omega $ can be replaced by $\widehat{\Omega }=\widehat{\Omega }(\hat{\theta })$.¹

Furthermore, predictions for the generalized KRLS estimator can be made by

$$\begin{aligned} \widehat{m}_2(\textbf{x}_0) = \sum _{i=1}^n \widehat{c}_{2,i} K_{\sigma _2}(\textbf{x}_i,\textbf{x}_0) \end{aligned}$$

(11)

The two step procedure is outlined below

Estimate Eq. (1) by KRLS from Eq. (7) with bandwidth parameter, $\sigma _1$ and ridge parameter, $\lambda _1$. Obtain the residuals which can then be used to get a consistent estimate for $\Omega $.

Estimate Eq. (8) by KRLS under the transformed variables as in Eqs. (9) and (11). Denote these estimates as GKRLS.

2.3 Selection of hyperparameters

Throughout this paper, we focus on the RBF kernel in Eq. (4), which contains the hyperparameter $\sigma _1$ (and $\sigma _2$). Since these parameters are squared in the RBF kernel in Eq. (4), we can instead search for the hyperparameters $\sigma _1^2$ and $\sigma _2^2$. The selection of the hyperparameters $\lambda _1,\lambda _2,\sigma _1^2$, and $\sigma _2^2$ is selected via leave one out cross validation (LOOCV). However, prior to cross validation, it is common in penalized methods to scale the data to have mean of 0 and standard deviation of 1. This way, the penalty parameters $\lambda _1$ and $\lambda _2$ do not depend on the scale of the data or the magnitude of the coefficients. Note that the scaling of the data does not affect the interpretations of predictions and marginal effects since the estimates can be translated back to their original scale and location.

For the hyperparameters, $\sigma _1^2$ and $\sigma _2^2$, Hainmueller and Hazlett (2014) suggest setting $\sigma ^2=q$, the number of regressors. Therefore, in items 1 and 2 in the two step procedure, $\sigma _1^2=q$ and $\sigma _2^2=q$. Then, only the penalty hyperparameters $\lambda _1$ and $\lambda _2$ need to be chosen. $\lambda _1$ is chosen via LOOCV in item 1 of the two step procedure using Eq. (5). $\lambda _2$ is then chosen via LOOCV in item 2 of the two step procedure using Eq. (9). If one wishes to also search for $\sigma _1^2$ and $\sigma _2^2$, one would perform LOOCV to find $\lambda _1$ and $\sigma _1^2$ simultaneously in item 1 using Eq. (5) and then perform another LOOCV to find $\lambda _2$ and $\sigma _2^2$ simultaneously in 2 of the two step procedure using Eq. (9).

3 Finite sample properties

In this section, finite sample properties of both KRLS and GKRLS estimators, including the estimation procedures of bias and variance, are discussed in detail.

3.1 Estimation of bias and variance

In this subsection, we estimate the bias and variance of the two step estimator. Following, De Brabanter et al. (2011), notice that the GKRLS estimator is a linear smoother.

Definition 1

An estimator $\widehat{m}$ of m is a linear smoother if, for each $\textbf{x}_0\in \mathbb {R}^q$, there exists a vector $L(\textbf{x}_0)=(l_1(\textbf{x}_0),\ldots ,l_n(\textbf{x}_0))^\top \in \mathbb {R}^n$ such that

$$\begin{aligned} \widehat{m}(\textbf{x}_0) = \sum _{i=1}^n l_i(\textbf{x}_0)Y_i, \end{aligned}$$

(12)

where $\widehat{m}(\cdot ):\mathbb {R}^{q}\rightarrow \mathbb {R}$.

For in sample data, Eq. (12) can be written in matrix form as $\widehat{\textbf{m}}=\textbf{Ly}$, where $\widehat{\textbf{m}}=(\widehat{m}(X_1),\ldots ,\widehat{m}(X_n))^\top \in \mathbb {R}^n$ and $\textbf{L} = (l({X_1})^\top ,\ldots ,l({X_n})^\top )^\top \in \mathbb {R}^{n\times n}$, where $\textbf{L}_{ij}=l_j(X_i)$. The ith row of $\textbf{L}$ show the weights given to each $Y_i$ in estimating $\widehat{m}(X_i)$. For the rest of the paper, we will denote $\widehat{m}_2(\cdot )$ as the prediction made by GKRLS for a single observation and $\widehat{\textbf{m}}_2$ as the $n\times 1$ vector of predictions made for the training data.

To obtain the bias and variance of the GKRLS estimator, we assume the following:

Assumption 1

The regression function $m(\cdot )$ to be estimated falls in the space of functions represented by $m(\textbf{x}_0) = \sum _{i=1}^n {c}_i K_\sigma (\textbf{x}_i,\textbf{x}_0)$ and assume the model in Eq. (1).

Assumption 2

$\mathbb {E}[U_i| \textbf{X}] = 0$ and $\mathbb {E}[U_iU_j|\textbf{X}] = \omega _{ij}(\theta ) \text { for some }\theta \in \mathbb {R}^p, i, j = 1,\ldots ,n $

Using Definition 1, Assumption 1, and Assumption 2, the conditional mean and variance can be obtained by the following theorem.

Theorem 1

The GKRLS estimator in Eq. (11) is

$$\begin{aligned} \begin{aligned} \widehat{m}_{2}(\textbf{x}_0)&= \sum _{i=1}^n l_i(\textbf{x}_0)Y_i\\&= {L(\textbf{x}_0)}^\top \textbf{y}, \end{aligned} \end{aligned}$$

(13)

and $L(\textbf{x}_0)=(l_1(\textbf{x}_0),\ldots , l_n(\textbf{x}_0))^\top $ is the smoother vector,

$$\begin{aligned} L(\textbf{x}_0) = \left[ K_{\sigma _2,\textbf{x}_0}^{*\top } (\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\right] ^\top , \end{aligned}$$

(14)

with $K_{\sigma _2,\textbf{x}_0}^{*}= (K_{\sigma _2}(\textbf{x}_1,\textbf{x}_0),\ldots ,K_{\sigma _2}(\textbf{x}_n,\textbf{x}_0))^\top $ the kernel vector evaluated at point $\textbf{x}_0$.

Then, the estimator, under model Eq. (1), has conditional mean

$$\begin{aligned} \mathbb {E}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0]=L(\textbf{x}_0)^\top \textbf{m} \end{aligned}$$

(15)

and conditional variance

$$\begin{aligned} {\text {Var}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0] = L(\textbf{x}_0)^\top \Omega L(\textbf{x}_0). \end{aligned}$$

(16)

Proof

see Appendix A. $\square $

From Theorem 1, the conditional bias can be written as

$$\begin{aligned} \begin{aligned} {\text {Bias}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0]&=\mathbb {E}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}]-m(\textbf{x}_0)\\&=L(\textbf{x}_0)^\top \textbf{m} - m(\textbf{x}_0) \end{aligned} \end{aligned}$$

(17)

Following De Brabanter et al. (2011), we will estimate the conditional bias and variance by the following:

Theorem 2

Let $L(\textbf{x}_0)$ be the smoother vector evaluated at $\textbf{x}_0$ and let $\widehat{\textbf{m}}_2 = (\widehat{m}_2(\textbf{x}_1), [0]\ldots , \widehat{m}_2(\textbf{x}_n))^\top $ be the in sample GKRLS predictions. For a consistent estimator of the covariance matrix such that $\widehat{\Omega }\rightarrow \Omega $, the estimated conditional bias and variance for GKRLS are obtained by

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}_{2}(\textbf{x}_2)|X=\textbf{x}_0] = L(\textbf{x}_0)^\top \widehat{\textbf{m}}_2 - \widehat{m}_{2}(\textbf{x}_0) \end{aligned}$$

(18)

and

$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0] = L(\textbf{x}_0)^\top \widehat{\Omega } L(\textbf{x}_0). \end{aligned}$$

(19)

Proof

See Appendix B. $\square $

3.2 Bias and variance of KRLS

First, note that the KRLS estimator is also a linear smoother, so the bias and the variance take the same form as in Eqs. (18) and (19), except that the linear smoother vector $L(\textbf{x}_0)$ will be different. Let

$$\begin{aligned} L_{1}(\textbf{x}_0)&=\left[ K_{\sigma _1,\textbf{x}_0}^{*\top } (\textbf{K}_{\sigma _1}+\lambda _1 \textbf{I})^{-1} \right] ^\top \end{aligned}$$

(20)

be the smoother vector for KRLS. Then, Eq. (7) can be rewritten as

$$\begin{aligned} \widehat{m}_{1}(\textbf{x}_0) = L_{1}(\textbf{x}_0)^\top \textbf{y}. \end{aligned}$$

(21)

Using Theorem 1 and Theorem 2 and applying them to the KRLS estimator, the estimated conditional bias and variance of KRLS are

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}_{1}(\textbf{x}_0)|X=\textbf{x}_0]&= L_{1}(\textbf{x}_0)^\top \widehat{\textbf{m}}_{1} - \widehat{m}_{1}(\textbf{x}_0) \end{aligned}$$

(22)

$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}_{1}(\textbf{x}_0)|X=\textbf{x}_0]&= L_{1}(\textbf{x}_0)^\top \widehat{\Omega } L_{1}(\textbf{x}_0), \end{aligned}$$

(23)

where $\widehat{\textbf{m}}_{1}$ is the $n\times 1$ vector of fitted values for KRLS. Note that the estimate of the covariance matrix, $\Omega $, will be the same for both KRLS and GKRLS.

4 Asymptotic properties

The asymptotic properties of GKRLS, including consistency, asymptotic normality, and bias corrected confidence intervals are covered in this section. To obtain consistency of the GKRLS estimator, we also assume:

Assumption 3

Let $\lambda _1,\lambda _2,\sigma _1,\sigma _2>0$ and as $n\rightarrow \infty $, for singular values of $\textbf{L}P$ given by $d_i$, $\sum _{i=1}^n d_i^2$ grows slower than n once $n>M$ for some $M<\infty $.

Theorem 3

Under Assumptions 1–3, and let the bias corrected fitted values be denoted by

$$\begin{aligned} \widehat{\textbf{m}}_{2,c}=\widehat{\textbf{m}}_{2}-{\text {Bias}}[\widehat{\textbf{m}}_{2}|\textbf{X}], \end{aligned}$$

(24)

then

$$\begin{aligned} \underset{n\rightarrow \infty }{{\text {lim}}} {\text {Var}}[\widehat{\textbf{m}}_{2,c}|\textbf{X}]=0 \end{aligned}$$

(25)

and the bias corrected GKRLS estimator is $\sqrt{n}$-consistent with $\underset{n\rightarrow \infty }{{\text {plim}}} \; \widehat{m}_{c,n}(\textbf{x}_{i})=m(\textbf{x}_i)$ for all i.

Proof

See Appendix C. $\square $

The estimated conditional bias from Eq. (18) and conditional variance from Eq. (19) can be used to construct pointwise confidence intervals. Asymptotic normality of the proposed estimator is given via the central limit theorem.

Theorem 4

Under Assumptions 1–3, $\widehat{\textbf{m}}_2$ is asymptotically normal by the central limit theorem:

$$\begin{aligned} \sqrt{n}(\widehat{\textbf{m}}_2-{\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]-\textbf{m})\overset{d}{\rightarrow } N(\varvec{0},{\text {Var}}[\widehat{\textbf{m}}_2|\textbf{X}]), \end{aligned}$$

(26)

where ${\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}] = \textbf{Lm}-\textbf{m}$ and ${\text {Var}}[\widehat{\textbf{m}}_2|\textbf{X}] = \textbf{L}\Omega \textbf{L}^\top $.

Proof

See Appendix D. $\square $

Since GKRLS is a biased estimator for m, we need to adjust the pointwise confidence intervals to allow for bias. Since the exact conditional bias and variance are unknown, we can use Eqs. (18) and (19) as estimates and can conduct approximate bias corrected $100(1-\alpha )\%$ pointwise confidence intervals from Theorem 4 as

$$\begin{aligned} \widehat{{m}}_{2}(\textbf{x}_i)-\widehat{{\text {Bias}}}[\widehat{{m}}_{2}(\textbf{x}_i)|X =\textbf{x}_i] \pm z_{1-\alpha /2}\sqrt{\widehat{{\text {Var}}}[\widehat{{m}}_{2}(\textbf{x}_i)|X=\textbf{x}_i]} \end{aligned}$$

(27)

for all i. Furthermore, to test the significance of the estimated regression function at an observation point, we can use the bias corrected confidence interval to see if 0 is in the interval.

5 Partial effects and derivatives

We also derive an estimator for pointwise partial derivatives with respect to a certain variable $\textbf{x}^{(r)}$. The partial derivative of the GKRLS estimator, $\widehat{m}_{2}(\textbf{x}_0)$ with respect to the rth variable is

$$\begin{aligned} \begin{aligned} \widehat{m}_{2,r}^{(1)}(\textbf{x}_0)&= \sum _{i=1}^n\frac{\partial K_{\sigma _2}(\textbf{x}_i,\textbf{x}_0)}{\partial \textbf{x}_0^{(r)}} \widehat{c}_{2,i}\\&=\frac{2}{\sigma _2^2} \sum _{i=1}^n {\text {e}}^{-\frac{1}{\sigma _2^2} || \textbf{x}_i - \textbf{x}_0||^2} \big (\textbf{x}_i^{(r)}-\textbf{x}_0^{(r)}\big ) \widehat{c}_{2,i}, \end{aligned} \end{aligned}$$

(28)

using the RBF kernel in Eq. (4) and where $\widehat{m}_{2,r}^{(1)}(\textbf{x}_0)\equiv \frac{\partial \widehat{m}_{2}(\textbf{x}_0)}{\partial \textbf{x}^{(r)}}$. To find the conditional bias and variance of the derivative estimator, we use the following:

Theorem 5

The GKRLS derivative estimator in Eq. (28) with the RBF kernel in Eq. (4) can be rewritten as

$$\begin{aligned} \widehat{m}_{2,r}^{(1)}(\textbf{x}_0)&=S_r(\textbf{x}_0)^\top \textbf{y}, \end{aligned}$$

(29)

where $\Delta _r \equiv \frac{2}{\sigma _2^2}{\text {diag}} (\textbf{x}_1^{(r)}-\textbf{x}_0^{(r)},\ldots ,\textbf{x}_n^{(r)}-\textbf{x}_0^{(r)})$ is a $n\times n$ diagonal matrix, and

$$\begin{aligned} S_r(\textbf{x}_0)=\left[ K_{\sigma _2,\textbf{x}_0}^{*\top } \Delta _r (\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\right] ^\top \end{aligned}$$

(30)

is the smoother vector for the first partial derivative with respect to the rth variable. Then, the conditional mean of the GKRLS derivative estimator is

$$\begin{aligned} \mathbb {E}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0]=S_r(\textbf{x}_0)^\top \textbf{m} \end{aligned}$$

(31)

and conditional variance is

$$\begin{aligned} {\text {Var}}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0] = S_r(\textbf{x}_0)^\top \Omega S_r(\textbf{x}_0). \end{aligned}$$

(32)

Proof

see Appendix E. $\square $

Using Theorem 5, the conditional bias and variance can be estimated as follows

Theorem 6

Let $S_r(\textbf{x}_0)$ be the smoother vector for the partial derivative evaluated at $\textbf{x}_0$ and let $\widehat{\textbf{m}}_2 = (\widehat{m}_2(\textbf{x}_1), \ldots , \widehat{m}_2(\textbf{x}_n))^\top $ be the in sample GKRLS predictions. For a consistent estimator of the covariance matrix such that $\widehat{\Omega }\rightarrow \Omega $, the estimated conditional bias and variance for GKRLS derivative estimator in Eq. (28) are obtained by

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0] = S_r(\textbf{x}_0)^\top \widehat{\textbf{m}} - \widehat{m}^{(1)}_{2,r}(\textbf{x}_0) \end{aligned}$$

(33)

and

$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0] = S_r(\textbf{x}_0)^\top \widehat{\Omega } S_r(\textbf{x}_0). \end{aligned}$$

(34)

Proof

See Appendix F. $\square $

The average partial derivative with respect to the rth variable is

$$\begin{aligned} \widehat{m}_{avg,r}^{(1)}=\frac{1}{n^\prime } \sum _{j=1}^{n^\prime } \widehat{m}^{(1)}_{2,r}(\textbf{x}_{0,j}) \end{aligned}$$

(35)

The bias and variance of the average partial derivative estimator is given by

$$\begin{aligned} {\text {Bias}}[ \widehat{m}_{avg,r}^{(1)}|X]=\frac{1}{n^\prime } \varvec{\iota }_{n^\prime }^\top \textbf{S}_{0,r} \textbf{m}- \frac{1}{n^\prime }\varvec{\iota }_{n^\prime }^\top \textbf{m}_{0,r}^{(1)} \end{aligned}$$

(36)

and

$$\begin{aligned} {\text {Var}}[\widehat{m}_{avg,r}^{(1)}|X] = \frac{1}{n^{\prime ^2}} \varvec{\iota }^\top _{n^\prime } \textbf{S}_{0,r} \Omega \textbf{S}_{0,r}^\top \varvec{\iota }_{n^\prime } , \end{aligned}$$

(37)

where $n^\prime $ is the number of observations in the testing set, $\varvec{\iota }_{n^\prime }$ is a $n^\prime \times 1$ vector of ones, $\textbf{S}_{0,r}$ is the $n^\prime \times n$ smoother matrix with the jth row as $S_r(\textbf{x}_{0,j}), j=1,\ldots ,n^\prime $, and $\textbf{m}_{0,r}^{(1)}$ is the $n^\prime \times 1$ vector of derivatives evaluated at each $\textbf{x}_{0,j},j=1,\ldots ,n^\prime $.

5.1 First differences for binary independent variables

Unlike for the continuous case, partial effects for binary independent variables should be interpreted as and estimated by first differences. That is, the estimated effect of going from $x^{(b)}=0$ to $x^{(b)}=1$ can be determined by

$$\begin{aligned} \begin{aligned} \widehat{m}_{FD_b}(\textbf{x}_0)&=\widehat{m}(x^{(b)}=1,\textbf{x}_0) - \widehat{m}(x^{(b)}=0,\textbf{x}_0)\\&=L_{FD_b}(\textbf{x}_0)^\top \textbf{y} \end{aligned} \end{aligned}$$

(38)

where $\widehat{m}_{FD_b}(\cdot )$ is the first difference estimator for the bth binary independent variable, $x^{(b)}$ is a binary variable that takes the values 0 or 1, $\textbf{x}_0$ is the $(q-1)\times 1$ vector of the other independent variables evaluated at some test observation, and $L_{FD_b}(\textbf{x}_0) \equiv L(x^{(b)}=1,\textbf{x}_0)-L(x^{(b)}=0,\textbf{x}_0)$ is the first difference smoother vector. The conditional bias and variance of the first difference GKRLS estimator in Eq. (38) are shown in the following theorem.

Theorem 7

Using Theorems 1 and 2, the conditional bias and variance for the GKRLS first difference estimator in Eq. (38) are obtained by

$$\begin{aligned} {{\text {Bias}}}[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0] = L_{FD_b}(\textbf{x}_0)^\top {\textbf{m}} - m_{FD_b}(\textbf{x}_0) \end{aligned}$$

(39)

and

$$\begin{aligned} {{\text {Var}}}[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0] = L_{FD_b}(\textbf{x}_0)^\top {\Omega } L_{FD_b}(\textbf{x}_0), \end{aligned}$$

(40)

where $m_{FD_b}(\textbf{x}_0)={m}(x^{(b)}=1,\textbf{x}_0) - {m}(x^{(b)}=0,\textbf{x}_0)$.

Proof

See Appendix G. $\square $

Then, the conditional bias and variance can be estimated as follows:

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0]= & {} L_{FD_b}(\textbf{x}_0)^\top \widehat{\textbf{m}} - \widehat{m}_{FD_b}(\textbf{x}_0) \end{aligned}$$

(41)

$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0]= & {} L_{FD_b}(\textbf{x}_0)^\top \widehat{\Omega } L_{FD_b}(\textbf{x}_0). \end{aligned}$$

(42)

Note that Eq. (38) provides the pointwise first difference estimates. If one is interested in the average partial effect of going from $x^{(b)}=0$ to $x^{(b)}=1$, the following average first difference GKRLS estimator would be used.

$$\begin{aligned} \widehat{m}_{\overline{FD},b} = \frac{1}{n^\prime } \sum _{j=1}^{n^\prime } \widehat{m}_{FD_b}(\textbf{x}_{0,j}). \end{aligned}$$

(43)

This average partial effect of a discrete variable is similar to the continuous case and can be compared to traditional parametric partial effects as in the case of least squares coefficients. The conditional bias and variance of the average first difference GKRLS estimator in Eq. (43) are:

$$\begin{aligned} {{\text {Bias}}}[\widehat{m}_{\overline{FD}_b}(\textbf{x}_0)|X=\textbf{x}_0]= & {} \frac{1}{n^\prime }\varvec{\iota }^{\top }_{n^\prime } \textbf{L}_{FD_{0,b}} {\textbf{m}} - \frac{1}{n^\prime }\varvec{\iota }^{\top }_{n^\prime } \textbf{m}_{FD_{0,b}} \end{aligned}$$

(44)

$$\begin{aligned} {{\text {Var}}}[\widehat{m}_{\overline{FD}_b}|X=\textbf{x}_0]= & {} \frac{1}{n^{\prime ^2}}\varvec{\iota }^{\top }_{n^\prime } \textbf{L}_{FD_{0,b}} {\Omega } \textbf{L}_{FD_{0,b}}^\top , \end{aligned}$$

(45)

where $\textbf{L}_{FD_{0,b}}$ is the $n^\prime \times n$ smoother matrix with the jth row as $L_{FD_b}(\textbf{x}_{0,j}), j=1,\ldots ,n^\prime $, and $\textbf{m}_{FD_{0,b}}$ is the $n^\prime \times 1$ vector of first differences evaluated at each $\textbf{x}_{0,j},j=1,\ldots ,n^\prime $. The conditional bias and variance of the average first difference estimator can be estimated using Eqs. (41) and (42).

6 Simulations

We conduct simulations that show the performance with respect to gaining efficiency of the proposed generalized KRLS estimator. Consider the data generating process from Eq. (1):

$$\begin{aligned} Y_i = m(X_i)+U_i, \quad i=1,\ldots ,n. \end{aligned}$$

(1)

We consider the sample size of $n=200$ and three independent variables X that is generated from

$$\begin{aligned} \begin{aligned} X_1&\sim {Bern}(0.5)\\ X_2&\sim N(0,1)\\ X_3&\sim U(-1,1). \end{aligned} \end{aligned}$$

(46)

The specification for m is:

$$\begin{aligned} m(X_i)=5-2X_{i,1}+\sin (X_{i,2})+3X_{i,3} \end{aligned}$$

(47)

and the partial derivatives with respect to each independent variable are given by

$$\begin{aligned} \begin{aligned} m^{(1)}_1(X_i)&=-2\\ m^{(1)}_2(X_i)&= \cos (X_{i,2})\\ m^{(1)}_3(X_i)&= 3 \end{aligned} \end{aligned}$$

(48)

For the error terms, we consider two cases.

$$\begin{aligned} \begin{gathered} U_i=0.7U_{i-1}+V_i\\ V_i\sim N(0,5^2)\\ \end{gathered} \end{aligned}$$

(49)

and

$$\begin{aligned} U_i\sim N\left( 0,{\text {exp}}(X_{i,1}+0.2X_{i,2}-0.3X_{i,3})\right) \end{aligned}$$

(50)

First, in Eq. (49), $U_i$ is generated by an AR(1) process. Second, $U_i$ is heteroskedastic but independent of each other with ${\text {Var}}[U_i|\textbf{X}]={\text {exp}}(X_{i,1}+0.2X_{i,2}-0.3X_{i,3})$.

In addition to the proposed estimator, we compare four other nonparametric estimators: the KRLS estimator (KRLS), Local Polynomial (LP) estimator with degree zero, Random Forest (RF), and Support Vector Machine (SVM). The KRLS estimator is used as a comparison to GKRLS to show the magnitude of the efficiency loss from ignoring the information in the error covariance matrix. In addition, the KRLS, LP, RF, and SVM estimators do not utilize the covariance matrix in estimating the regression function and excludes heteroskedasticity or autocorrelation of the errors. For the GKRLS and KRLS estimators, we set $\sigma _1^2=\sigma _2^2=3$, the number of independent variables in this example, and implement leave one out cross validation to select the hyperparameters, $\lambda _1$ and $\lambda _2$.² The variance function under the heteroskedastic case is estimated by least squares from the regression of the log residuals on X. Taking the exponential would give the predicted variance estimates. Under the case of AR(1) errors, the covariance function is estimated from an AR(1) model. We run 200 simulations for each of the two cases and the bias corrected results are reported below in Table .³ To evaluate the estimators, mean squared error is used as the main criterion, where we also investigate the bias and variance. To compare results, all estimators are evaluated from 300 data points generated from Eqs. (46) and (47).

Table 1

The table reports the bias, variance, and MSE of GKRLS, KRLS, LP, RF, and SVM estimators for the regression function $m(\textbf{x}_0)$ under the cases of heteroskedastic and AR(1) errors generated from Eqs. (46),(47),(49) and (50). The GKRLS and KRLS estimates are bias corrected. All estimates are averaged across all simulations

		MSE	Variance	Bias
Simulation evaluation for $m(\textbf{x}_0)$
Autocor. errors	GKRLS	2.8562	1.6311	0.0140
	KRLS	2.9767	2.3835	$-0.0094$
	LP	3.4623	3.0822	$-0.0112$
	RF	3.8442	3.5013	0.0205
	SVM	5.7663	5.6482	0.0263
Heterosk. errors	GKRLS	0.2287	0.1702	0.0103
	KRLS	0.2366	0.1766	$-0.0148$
	LP	0.2696	0.1958	0.0055
	RF	0.5917	0.1372	0.0178
	SVM	0.2632	0.2105	$-0.0001$

Table 1 displays the evaluations, including bias, variance, and MSE of the estimators for the regression function under both error cases. Note that the GKRLS and KRLS estimates in Table 1 are bias corrected. All estimates are averaged across all simulations. Estimates based on GKRLS seem to exhibit similar finite sample bias as KRLS, and there is an obvious reduction in the variability with smaller variance of the proposed estimator relative to KRLS. Note that GKRLS estimation provides a 31.6% and a 3.6% decrease in the variance for estimating the regression function for the autocorrelated and heteroskedastic errors, relative to KRLS. With smaller variance, GKRLS also has a smaller MSE, making GKRLS superior to KRLS. Compared to the other nonparametric estimators, LP, RF, and SVM, the GKRLS estimator outperforms the others in terms of MSE and is the preferred method in the presence of heteroskedasticity or autocorrelation.

Table 2

The table reports the bias, variance, and MSE of the bias corrected GKRLS and KRLS estimators and the cases of heteroskedastic and AR(1) errors for the derivative of the regression function $m^{(1)}_r(\textbf{x}_0)$ generated from Eqs. (46)–(50). Each row represents the MSE, variance, and bias of the partial derivative estimates with respect to $X_r$, $r=1,2,3$. All estimates are averaged across all simulations

		GKRLS			KRLS
		MSE	Variance	Bias	MSE	Variance	Bias
Simulation evaluation for $m^{(1)}_r(\textbf{x}_0)$
Autocor. errors	$X_1$	1.1708	0.4092	0.8239	2.1013	1.7419	0.5017
	$X_2$	0.3800	0.0887	$-0.3567$	0.7745	0.5502	$-0.2700$
	$X_3$	5.2002	0.3361	$-2.0737$	5.5494	1.6599	$-1.7282$
Heterosk. errors	$X_1$	0.3290	0.2835	0.0950	0.3291	0.2922	0.0914
	$X_2$	0.2414	0.1695	$-0.0421$	0.2524	0.1718	$-0.0534$
	$X_3$	2.0529	0.5746	$-0.7904$	2.1461	0.5876	$-0.8218$

Table 3

The table reports the bias, variance, and MSE of the GKRLS estimator for both the regression function and the partial derivatives and for the cases of heteroskedastic and AR(1) errors generated from Eqs. (46)–(50) for different sample sizes, $n=100,200,400$. All reported estimates are biased corrected and are averaged across all simulations. The kernel hyperparameters are set as $\sigma _1^2=\sigma _2^2=3$ and the hyperparameters $\lambda _1$ and $\lambda _2$ are found by LOOCV

		Autocor. errors			Heterosk. errors
		MSE	Variance	${\text {Bias}}^2$	MSE	Variance	${\text {Bias}}^2$
Simulation results for consistency of GKRLS
$m(\textbf{x}_0)$	$n=100$	4.9665	2.8562	1.6112	0.4113	0.2287	0.1309
	$n=200$	2.7170	1.6311	0.8786	0.3012	0.1702	0.0993
	$n=400$	2.2496	1.2251	0.7326	0.1101	0.0585	0.0316
$m^{(1)}_{1}(\textbf{x}_0)$	$n=100$	2.3091	0.5590	1.7501	0.5880	0.5196	0.0683
	$n=200$	1.1708	0.4092	0.7615	0.3290	0.2835	0.0455
	$n=400$	0.6992	0.2647	0.4345	0.1964	0.1695	0.0269
$m^{(1)}_{2}(\textbf{x}_0)$	$n=100$	0.4614	0.1164	0.3449	0.3751	0.2702	0.1049
	$n=200$	0.3800	0.0887	0.2913	0.2414	0.1695	0.0719
	$n=400$	0.2962	0.0715	0.2247	0.1601	0.1063	0.0539
$m^{(1)}_{3}(\textbf{x}_0)$	$n=100$	6.6704	0.4951	6.1753	2.8633	0.8853	1.9780
	$n=200$	5.2002	0.3361	4.8641	2.0529	0.5746	1.4783
	$n=400$	4.4179	0.2261	4.1918	1.5181	0.3793	1.1388

Table displays the evaluations, including bias, variance, and MSE of the bias corrected GKRLS and KRLS estimators for the partial derivatives of the regression function with respect to each of the independent variables under both error cases.⁴ Since $X_1$ is discrete, the partial derivative is estimated by first differences discussed in Sect. 5.1. Similar to the regression estimates, for both heteroskedastic and AR(1) errors, the variability from estimating the derivative is reduced by GKRLS estimation relative to KRLS estimation. In addition, the efficiency gain in estimating both the regression and the derivative seems to be more evident in the AR(1) case compared to the heteroskedastic case. A possible explanation for this is that the covariance matrix contains more information in the off-diagonal elements compared to the diagonal covariance matrix in the heteroskedastic case. Overall, when estimating the regression function and its derivative for this simulation example, the reduction in variance and therefore MSE is clearly evident in Tables 1 and 2, making the GKRLS the preferred estimator.

Table shows the simulation results for the consistency of GKRLS. The bias, variance, and MSE are reported for sample sizes of $n=100,200,400$. In this example, we set $\sigma _1^2=\sigma _2^2=3$ and the hyperparameters $\lambda _1$ and $\lambda _2$ are found by LOOCV. For the regression function and the derivative and for both error covariance structures, the squared bias, variance, and MSE all decrease as the sample size increases, which implies that the GKRLS estimator is consistent in this simulation exercise.

7 Application

We implement an empirical application from the U.S. airline industry with heteroskedastic and autocorrelated errors using a panel of 6 firms over 15 years.⁵ For the data set, we set aside a portion of the data for training and the other for testing. We estimate the model with four methods, GKRLS, KRLS, LP, and Generalized Least Squares (GLS), and compare their results in terms of mean squared error (MSE). To evaluate the out of sample performance of each method, the predicted out of sample MSEs are computed as follows

$$\begin{aligned} MSE_e=\frac{1}{n^\prime T}\sum _{i=1}^{n^\prime }\sum _{t=1}^T \big (y_{0,it}-\widehat{m}_e(\textbf{x}_{0,it})\big )^2 \end{aligned}$$

(51)

where $MSE_e$ is the mean squared error for the $e^{th}$ estimator and $n^\prime $ is the number of observations in the testing data set and $j=1,\ldots ,n^\prime $. In this empirical exercise, $n^\prime =1$ and $T=15$ since we leave out the first firm as a test set. To assess the estimated average derivatives, we use the bootstrap to calculate the MSEs for the average partial effects. We report the bootstrapped MSEs for the average derivative by the following.⁶

$$\begin{aligned} MSE_{e,r}=\frac{1}{B} \sum _{b=1}^B \left( \widehat{m}^{(1)}_{avg,e,r,b} - \frac{1}{4}\sum _{e} \widehat{m}^{(1)}_{avg,e,r}\right) ^2 \end{aligned}$$

(52)

where B is the number of bootstraps with $b=1,\ldots ,B$, $\widehat{m}_{avg,e,r,b}^{(1)}(\cdot )$ is the $b^{th}$ bootstrapped average partial first derivative with respect to the $r^{th}$ variable for the $e^{th}$ estimator, and $\frac{1}{4}\sum _e\widehat{m}^{(1)}_{avg,e,r}$ is the simple average of the average partial first derivatives with respect to the $r^{th}$ variable from the four estimators (GLS, GKRLS, KRLS, and LP):

$$\begin{aligned} \begin{aligned} \hat{m}_{{avg,e,r}}^{{(1)}} =&\,\frac{1}{{nT}}\sum \limits _{{i = 1}}^{n} {\sum \limits _{{t = 1}}^{T} {\hat{m}_{{e,r}}^{{(1)}} } } (x_{{it}} ), \\ e =&\,\left\{ {{\text {GLS}},{\text {GKRLS}},{\text {KRLS}},{\text {LP}}} \right\} \\ \end{aligned} \end{aligned}$$

(53)

7.1 U.S. airline industry

We obtain the data on the efficiency in production of airline services from Greene (2018). Since the data are a panel of 6 firms for 15 years, we consider the one way random effects model:

$$\begin{aligned} \log C_{it}&=m(\log Q_{it},\log P_{it})+\alpha _i +\varepsilon _{it}, \end{aligned}$$

(54)

where the dependent variable $Y_{it} = \log C_{it}$ is the logarithm of total cost, the independent variables $X_{it} = (\log Q_{it}, \log P_{it})^{\top }$ are the logarithms of output and the price of fuel, respectively, $\alpha _i$ is the firm specific effect, and $\varepsilon _{it}$ is the idiosyncratic error term. In this empirical setting, we assume $\mathbb {E}[\varepsilon _{it}|\textbf{X}]=0,\; \mathbb {E}[\varepsilon _{it}^2|\textbf{X}]=\sigma ^2_ {\varepsilon _{i}},\; \mathbb {E}[\alpha _i|\textbf{X}]=0,\; \mathbb {E}[\alpha _i^2|\textbf{X}]=\sigma ^2_{\alpha _i},\; \mathbb {E}[\varepsilon _{it}\alpha _j|\textbf{X}]=0$ for all i, t, j, $\mathbb {E}[\varepsilon _{it}\varepsilon _{js}|\textbf{X}]=0$ if $t\ne s$ or $i\ne j$, and $\mathbb {E}[\alpha _i\alpha _j|\textbf{X}]=0$ if $i\ne j$. Consider the composite error term $U_{it}\equiv \alpha _i+\varepsilon _{it}$. Then, the model in Eq. (54) can be rewritten as

$$\begin{aligned} \log C_{it}=m(\log Q_{it},\log P_{it})+U_{it}, \end{aligned}$$

(55)

In Eq. (55), the independent variables are strictly exogenous to the composite error term, $\mathbb {E}[U_{it}|\textbf{X}]=0$. The variance of the composite error term is $\mathbb {E}[U_{it}^2|\textbf{X}]=\sigma ^2_{\alpha _i}+\sigma ^2_{\varepsilon _{i}}$. Therefore, in this empirical example, we allow for firm specific heteroskedasticity. In other words, the variance of the error terms are not constant across firms, but are constant over time for each firm. Since there is a time component, we allow an individual firm to be correlated across time but not with other firms, that is, $\mathbb {E}[U_{it}U_{is}|\textbf{X}]=\sigma ^2_{\alpha _i}, \; t\ne s$ and $\mathbb {E}[U_{it}U_{js}|\textbf{X}]=0$ for all t and s if $i\ne j$. Note that the correlation across time can be different for every firm. Therefore, in this empirical framework, we allow the error terms to be heteroskedastic across firms and correlated across time.

To estimate Eq. (55) by GKRLS and KRLS in the framework set up in this paper, we can write the model in matrix notation. Consider

$$\begin{aligned} \textbf{y} = \textbf{m}+\textbf{U}, \end{aligned}$$

(56)

where $\textbf{y}$ is the $nT\times 1$ vector of $\log C_{it}$, $\textbf{m}$ is the $nT\times 1$ vector of the regression function $m(X_{it})$, and $\textbf{U}$ is the $nT\times 1$ vector of $U_{it}$, $i=1,\ldots ,n$ and $t=1,\ldots ,T$. Then, the $nT\times nT$ error covariance matrix $\Omega $ is

$$\begin{aligned} \Omega ={\text {Var}}[\textbf{U}|\textbf{X}] = {\text {diag}}(\Sigma _1, \ldots ,\Sigma _n), \end{aligned}$$

(57)

where $\Sigma _i=\sigma ^2_{\varepsilon _{i}}\textbf{I}_T +\sigma ^2_{\alpha _i} \varvec{\iota }_T\varvec{\iota }^\top _T, i=1,\ldots ,n$ has dimension $T\times T$, $\textbf{I}_T$ is a $T\times T$ identity matrix and $\varvec{\iota }_T$ is a $T\times 1$ vector of ones. To use the GKRLS estimator in this empirical framework, we first estimate Eqs. (55) or (56) by KRLS and obtain the residuals, denoted by $\widehat{u}_{it}$. To estimate the error covariance matrix $\Omega $, the variances of the firm specific error and the idiosyncratic error, $\sigma ^2_{\alpha _i}$ and $\sigma ^2_{\varepsilon _{i}}$ need to be estimated. Consider the following consistent estimators using time averages,

$$\begin{aligned} \widehat{\sigma }^2_{U_i}= & {} \frac{1}{T} \widehat{\textbf{u}}_i^\top \widehat{\textbf{u}}_i \end{aligned}$$

(58)

$$\begin{aligned} \widehat{\sigma }^2_{\alpha _i}= & {} \frac{1}{T(T-1)/2} \sum _{t=1}^{T-1} \sum _{s=t+1}^{T} \widehat{u}_{it}\widehat{u}_{is} \end{aligned}$$

(59)

$$\begin{aligned} \widehat{\sigma }^2_{\varepsilon _{i}}= & {} \widehat{\sigma }^2_{U_i} - \widehat{\sigma }^2_{\alpha _i}, \end{aligned}$$

(60)

where $\widehat{\textbf{u}}_i$ is the $T\times 1$ vector of residuals for the ith firm. Now, plugging these estimates in for $\Omega $, the GKRLS estimator can be estimated as in the previous sections. For further details, please see Appendix H.

With regards to the other comparable estimators, the KRLS and LP estimators are used to estimate Eqs. (55) or (56) ignoring the heteroskedasticity and correlation in the composite error, $\textbf{U}$. Note that the KRLS estimator uses the error covariance matrix in the variances and standard errors but does not use the error covariance in estimating the regression function. Lastly, the GLS estimator is used as a parametric benchmark to compare to the standard random effects panel data model.⁷

The data contain 90 observations of 6 firms for 15 years, from 1970–1984. We split the data into two parts, where the first 15 observations, which corresponds to the first firm, are used as testing data and 75 observations, which corresponds to the last five firms, are set as training data to evaluate out of sample performance. Thus, the training data, $i=1,\ldots ,5$ and $t=1,\ldots ,15$, has a total of 75 observations. For the GKRLS and KRLS estimators, all hyperparameters are chosen via LOOCV.⁸

Table 4

Bias corrected average partial derivatives and their standard errors in parentheses are reported for GLS, GKRLS, KRLS, and LP estimators. The columns represent the estimates of the average partial derivative with respect to each regressor

	log (Q)	log (P)
Average partial derivatives for airline data
GLS	0.8436	0.4188
GLS	(0.0311)	(0.0181)
GKRLS	0.8130	0.4247
GKRLS	(0.0034)	(0.0082)
KRLS	0.8248	0.4581
KRLS	(0.016)	(0.0457)
LP	0.5885	0.2260
LP	(0.0276)	(0.0138)

The bias corrected average partial derivatives and corresponding standard errors are reported in Table . These averages are calculated by training each estimator on the five firms with 75 observations in the training data set. The estimates are bias corrected and the results from Sect. 5 are used in our calculations. All estimators display positive and significant relationships between cost and each of the regressors, output and price, with their average partial derivatives being positive. The elasticity with respect to output ranges from 0.5885 to 0.8436 and with respect to price ranges from 0.2260 to 0.4581. More specifically, for the GKRLS estimator, a 10% increase in output would increase the total cost by an average of 8.13% and a 10% increase in fuel price would increase the total cost by an average of 4.25% holding all else fixed. Comparing the GKRLS and KRLS methods, the estimates of the average partial derivatives are similar but the standard errors are significantly reduced for GKRLS for both output and fuel price, implying a gain in efficiency. Therefore, using the information and the structure of the error covariance in Eq. (57) in estimated the regression function allows GKRLS to provide more robust estimates of the average partial effects of each independent variable compared to KRLS.

Table 4 shows that the GLS estimator slightly overestimates the elasticity with respect to output and underestimates the elasticity with respect to fuel price compared to those of GKRLS. The LP estimator appears to provide different average partial effect estimates compared to the rest of the estimators. One possible explanation is that the bandwidths may not be the most optimal since data-driven bandwidth selection methods (e.g., cross validation) fail when there is correlation in the errors (De Brabanter et al. 2018). Since the data is panel structured, there is correlation across time, making bandwidth selection for LP estimators difficult. The LP estimates are from the local constant estimator; however, the local linear estimator provides similar estimates of the average partial effects to those of the local constant estimator. Nevertheless, the LP average partial effects of each variable are positive and significant, which are consistent with the other methods. Furthermore, GKRLS provides similar average partial effects with respect to output and price but is more efficient in terms of smaller standard errors relative to the other considered estimators.

Table 5

The MSEs are reported for the GLS, GKRLS, KRLS, and LP, estimators. The first column are the out of sample MSEs calculated by Eq. (51) and the second and third columns are the bootstrapped MSEs for the average partial derivatives calculated by Eq. (52). The GKRLS and KRLS estimates are bias corrected

	${\text {MSE}}$	${\text {MSE}}_{\log Q}$	${\text {MSE}}_{\log P}$
MSEs for airline data
GLS	0.0106	0.0042	0.0018
GKRLS	0.0091	0.0030	0.00001
KRLS	0.0306	0.0031	0.0024
LP	0.0191	0.2900	0.0867

To assess the estimators in terms of out of sample performance, we calculate the MSEs using the 15 observations in the testing data set. Table reports MSEs for the four considered estimators. The first column reports the out of sample MSEs using the 15 observations from the first firm. Out of all the considered estimators, the GKRLS estimator outperforms the others in terms of MSE. In other words, the GKRLS estimator can be seen as the superior method in estimating the regression function in this empirical example. The bootstrapped MSEs for the average partial derivatives, calculated by Eq. (52), are reported in the second and third columns of Table 5. For both the average partial derivatives with respect to output and price, GKRLS produces the lowest MSE, outperforming the other estimators. In addition, since GKRLS incorporates the error covariance structure, efficiency is gained and therefore reductions in MSEs are made relative to KRLS. Overall, GKRLS is considered to be the best method in terms of MSE for estimating both the airline cost function and the average partial effects with respect to output and price.

8 Conclusion

Overall, this paper proposes a nonparametric regression function estimator via KRLS under a general parametric error covariance. The two step procedure allows for heteroskedastic and serially correlated errors, where in the first step, KRLS is used to estimate the regression function and the parametric error covariance, and in the second step, KRLS is used to estimate the regression function using the information in the error covariance. The method improves efficiency in the regression estimates as well as the partial effects estimates compared to standard KRLS. The conditional bias and variance, pointwise marginal effects, consistency, and asymptotic normality of GKRLS are provided. Simulations show that there are improvements in variance and MSE reduction when considering GKRLS relative to KRLS. An empirical example is illustrated with estimating an airline cost function under a random effects model with heteroskedastic and correlated errors. The average derivatives are evaluated, and the average partial effects of the inputs are determined in the application. In the empirical exercise, GKRLS is more efficient compared to KRLS and is the most preferred method for estimating the airline cost function and its average partial derivatives in terms of MSE.

Declarations

Conflicts of interest

Justin Dang declares that he has no conflict of interest. Aman Ullah declares that he has no conflict of interest.

Human or animal rights

This article does not contain any studies with human participants performed by any of the authors.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Multivariate models of commodity futures markets: a dynamic copula approach

next article Predicting binary outcomes based on the pair-copula construction

Appendices

A Proof of Theorem 1

First, we note that the GKRLS estimator is a linear smoother by substituting Eqs. (10) into (11)

$$\begin{aligned} \widehat{m}_{2}(\textbf{x}_0)&= \sum _{i=1}^n \widehat{c}_{2,i} K_{\sigma _2}(\textbf{x}_i,\textbf{x}_0)\\&=K_{\sigma _2,\textbf{x}_0}^{*\top } \widehat{\textbf{c}}_2\\&=K_{\sigma _2,\textbf{x}_0}^{*\top } (\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\textbf{y}\\&=L(\textbf{x}_0)^\top \textbf{y}, \end{aligned}$$

where $L(\textbf{x}_0)=\left[ K_{\sigma _2,\textbf{x}_0}^{*\top } (\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\right] ^\top $ and $K_{\sigma _2,\textbf{x}_0}^{*}= (K_{\sigma _2}(\textbf{x}_1,\textbf{x}_0),\ldots ,K_{\sigma _2}(\textbf{x}_n,\textbf{x}_0))^\top $ the kernel vector evaluated at point $\textbf{x}_0$.

Then, the conditional mean and variance of GKRLS can be derived as follows

$$\begin{aligned} \mathbb {E}[\widehat{m}_{2}|X=\textbf{x}_0]&=L(\textbf{x}_0)^\top \mathbb {E}[\textbf{y}|\textbf{X}]\\&=L(\textbf{x}_0)^\top \textbf{m} \end{aligned}$$

and

$$\begin{aligned} {\text {Var}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0]&=L(\textbf{x}_0)^\top {\text {Var}}[\textbf{y}|\textbf{X}]L(\textbf{x}_0)\\&= L(\textbf{x}_0)^\top \Omega L(\textbf{x}_0). \end{aligned}$$

B Proof of Theorem 2

The exact bias for GKRLS for the training data is given by

$$\begin{aligned} \mathbb {E}[\widehat{\textbf{m}}_2|X=\textbf{x}]-\textbf{m} = (\textbf{L}-\textbf{I})\textbf{m}, \end{aligned}$$

and observe that the residuals are obtained by

$$\begin{aligned} \widehat{\textbf{u}}_2&= \textbf{y} - \widehat{\textbf{m}}_2\\&= \textbf{y} - \textbf{Ly}\\&= (\textbf{I}-\textbf{L})\textbf{y}. \end{aligned}$$

And the expectation of the residuals is given by

$$\begin{aligned} \mathbb {E}[\widehat{\textbf{u}}_2|X=\textbf{x}]&= \textbf{m}- \textbf{Lm}\\&= -{\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]. \end{aligned}$$

De Brabanter et al. (2011) suggests estimating the conditional bias by smoothing the negative residuals

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{\textbf{m}}_2|\textbf{X}]&= - \textbf{L}\widehat{\textbf{u}}_2\\&= -\textbf{L}(\textbf{I} - \textbf{L})\textbf{y}\\&= (\textbf{L}-\textbf{I})\widehat{\textbf{m}}_2. \end{aligned}$$

Therefore, the conditional bias can be estimated at any point $\textbf{x}_0$ by

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0] = L(\textbf{x}_0)^\top \widehat{\textbf{m}} - \widehat{m}_{2}(\textbf{x}_0) \end{aligned}$$

For the conditional variance, we assume that the error covariance matrix $\Omega =\Omega (\theta )$ can be consistently estimated by $\widehat{\Omega }=\widehat{\Omega }(\widehat{\theta })$. Then, using a consistent estimator of the error covariance matrix, the conditional variance of GKRLS can be estimated by

$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}_{2}(\textbf{x}_0)|X=\textbf{x}_0] = L(\textbf{x}_0)^\top \widehat{\Omega } L(\textbf{x}_0). \end{aligned}$$

C Proof of Theorem 3

Since the bias corrected fitted values, $\widehat{\textbf{m}}_c$, have zero conditional bias, we can focus on the conditional variance. From Theorem 1, the conditional variance of the GKRLS estimator is

$$\begin{aligned} {\text {Var}}[\widehat{\textbf{m}}_2|\textbf{X}]&=\textbf{L}\Omega \textbf{L}^\top \\&=\textbf{L}PP^\top \textbf{L}^\top \\&=\textbf{L}P(\textbf{L}P)^\top \\&=\textbf{A}\textbf{A}^\top , \end{aligned}$$

where $\textbf{A}\equiv \textbf{L}P$. Consider the singular value decomposition of $\textbf{A}$, where $\textbf{D}$, $\textbf{U}$, $\textbf{V}$ are the singular values, left singular vectors, and right singular vectors respectively.

$$\begin{aligned} {\text {Var}}[\widehat{\textbf{m}}_2|\textbf{X}]&=\textbf{A}\textbf{A}^\top \\&=\textbf{UDV}(\textbf{UDV})^\top \\&=\textbf{UD}^2\textbf{U}^\top \\&=\textbf{U}\begin{pmatrix} d_1^2 &{}\ldots &{}0\\ \vdots &{} \ddots &{}\vdots \\ 0 &{}\ldots &{} d_n^2 \end{pmatrix} \textbf{U}^\top , \end{aligned}$$

where $d_i, i=1,\ldots ,n$ denotes the ith diagonal element of $\textbf{D}$, i.e. the ith singular value of $\textbf{L}P$. To examine the sum of the variances of $\widehat{\textbf{m}}_2$, the trace of the variance matrix is evaluated.

$$\begin{aligned} {\text {tr}}({\text {Var}}[\widehat{\textbf{m}}_2|\textbf{X}])&={\text {tr}}(\textbf{UD}^2\textbf{U}^\top )\\&={\text {tr}}(\textbf{D}^2\textbf{U}^\top \textbf{U})\\&={\text {tr}}(\textbf{D}^2)\\&=\sum _i^n d_i^2. \end{aligned}$$

For large enough n, ${\text {tr}}(\textbf{D}^2)$ slows in growth and converges to some constant, M, and the average variance of $\widehat{m}(\textbf{x}_i)$ is $\frac{1}{n}\sum _{i=1}^n d_i^2$. Recall that $d_i^2$ denotes the ith squared singular value of $\textbf{L}P$ and is proportional to the variance explained by a given singular vector of $\textbf{L}P$. Given the construction of $\textbf{L}P$, the columns of this product matrix can be thought of as weights of the data, scaled by the standard deviation of the error term. Therefore, the number of large singular values will grow initially with n but the number of important dimensions or singular values will start to grow slowly with n. As a result, the average variance of $\widehat{m}(\textbf{x}_i)$, which is $\frac{1}{n}\sum _{i=1}^n d_i^2$, shrinks to zero as $n\rightarrow \infty $. Since the average variance shrinks to zero, then each individual variance must approach zero as n becomes large.

We also provide an alternative proof of consistency. Consider the GKRLS coefficient estimator of $\textbf{c}$ in Eq. (10):

$$\begin{aligned} \widehat{c}_2&= (\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2 \textbf{I})^{-1}\Omega ^{-1}\textbf{y}\\&=(\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2 \textbf{I})^{-1}\Omega ^{-1}\left( \textbf{K}_{\sigma _2}\textbf{c}+\textbf{u}\right) \\&=\left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}+\frac{\lambda _2}{n} \textbf{I}\right) ^{-1}\frac{1}{n}\Omega ^{-1}\left( \textbf{K}_{\sigma _2}\textbf{c}+\textbf{u}\right) \\&=\left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}+\frac{\lambda _2}{n} \textbf{I}\right) ^{-1}\left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}\right) \textbf{c} +\left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}+\frac{\lambda _2}{n} \textbf{I}\right) ^{-1}\left( \frac{1}{n}\Omega ^{-1}\right) \textbf{u}\\ \end{aligned}$$

Again, since we consider the bias corrected estimator, $\widehat{\textbf{m}}_{2,c}$, we can focus on the conditional variance. However, below we also show that the non-bias corrected estimator has zero conditional bias in the limit. Taking the conditional bias of $\widehat{\textbf{c}}_2$:

$$\begin{aligned} {\text {Bias}}[\widehat{\textbf{c}}_2|\textbf{X}]&= \left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}+\frac{\lambda _2}{n} \textbf{I}\right) ^{-1} \left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}\right) \textbf{c} - \textbf{c}, \end{aligned}$$

where the strict exogeneity assumption $\mathbb {E}[\textbf{u}|\textbf{X}]=\varvec{0}$ is used. Furthermore, if we assume $\lambda _2$ is fixed or does not grow as fast as n and $\left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}\right) \rightarrow \textbf{Q}$, a positive definite matrix with finite elements, when $n\rightarrow \infty $, then ${\text {Bias}}[\widehat{\textbf{c}}_2|\textbf{X}]\rightarrow \varvec{0}$ as $n\rightarrow \infty $.

Taking the conditional variance of $\widehat{\textbf{c}}_2$:

$$\begin{aligned} {\text {Var}}[\widehat{\textbf{c}}_2|\textbf{X}]&=\frac{1}{n}\left( \frac{\Omega ^{-1}\textbf{K}_{\sigma _2}}{n}+\frac{\lambda _2\textbf{I}}{n} \right) ^{-1}\left( \frac{\Omega ^{-1}}{n}\right) \left[ \left( \frac{\Omega ^{-1}\textbf{K}_{\sigma _2}}{n}+\frac{\lambda _2\textbf{I}}{n} \right) ^{-1}\right] ^\top . \end{aligned}$$

Again, we assume that $\lambda _2$ is fixed or does not grow as fast as n and $\left( \frac{1}{n}\Omega ^{-1}\textbf{K}_{\sigma _2}\right) \rightarrow \textbf{Q}$, a positive definite matrix with finite elements. Furthermore, if we assume that $\left( \frac{1}{n}\Omega ^{-1}\right) \rightarrow \textbf{Q}_\Omega $, a matrix with finite elements when $n\rightarrow \infty $, then ${\text {Var}}[\widehat{\textbf{c}}_2|\textbf{X}]\rightarrow \varvec{0}$ as $n\rightarrow \infty $. Therefore, $\underset{n\rightarrow \infty }{{\text {plim}}} \; \widehat{\textbf{c}}_2 = \textbf{c}$.

Now, consider the GKRLS estimator $\widehat{\textbf{m}}_2 = \textbf{K}_{\sigma _2} \widehat{\textbf{c}}_2$. Then,

$$\begin{aligned} \underset{n\rightarrow \infty }{{\text {plim}}}\; \widehat{\textbf{m}}_2&= \textbf{K}_{\sigma _2} \left( \underset{n\rightarrow \infty }{{\text {plim}}}\; \widehat{\textbf{c}}_2\right) \\&=\textbf{K}_{\sigma _2}\textbf{c}\\&=\textbf{m}, \end{aligned}$$

proving consistency of $\widehat{\textbf{m}}_2$. Note that since the variance is O(1/n), $\widehat{\textbf{m}}_2$ is $\sqrt{n}$-consistent.

D Proof of Theorem 4

Consider the difference between the bias corrected fitted values and the true values, $\widehat{\textbf{m}}_2 - {\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]-\textbf{m}$, where ${\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]=\textbf{Lm}-\textbf{m}$,

$$\begin{aligned} \widehat{\textbf{m}}_2-{\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]-\textbf{m} =\textbf{Lu} \end{aligned}$$

Note that ${\text {E}}[\textbf{Lu}|\textbf{X}]=\varvec{0}$ and ${\text {Var}}[\textbf{Lu}|\textbf{X}]=\textbf{L}\Omega \textbf{L}^\top $. The following results will be for the case of heteroskedastic errors, where observations are independent and heterogeneously distributed. Consider the individual variances for each observation,

$$\begin{aligned} {\text {Var}}[L(\textbf{x}_i)u_i|\textbf{X}] = L(\textbf{x}_i)^\top \Omega L(\textbf{x}_i) \end{aligned}$$

and let $s_n^2$ be the sum of the variances,

$$\begin{aligned} s_n^2 = \sum _{i=1}^n L(\textbf{x}_i)^\top \Omega L(\textbf{x}_i). \end{aligned}$$

As long as the sum is not dominated by any particular term and if $L(\textbf{x}_i)u_i$ are independent vectors distributed with mean $\varvec{0}$ and variance $L(\textbf{x}_i)^\top \Omega L(\textbf{x}_i)<\infty $ and $s_n^2\rightarrow \infty $ as $n\rightarrow \infty $, then

$$\begin{aligned} \sqrt{n}\textbf{Lu}\overset{d}{\rightarrow }\ N(\varvec{0},\textbf{L}\Omega \textbf{L}^\top ), \end{aligned}$$

by Lindeberg-Feller central limit theorem. It then follows that

$$\begin{aligned} \sqrt{n}(\widehat{\textbf{m}}_2-{\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}]-\textbf{m})\overset{d}{\rightarrow } N(\varvec{0},\textbf{L}\Omega \textbf{L}^\top ). \end{aligned}$$

The following results will be for the case of autocorrelated errors, where observations are dependent and identically distributed.⁹ Define $\textbf{L}_n \equiv \textbf{K}_{\sigma _2} \left( \frac{\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2 \textbf{I}}{n}\right) ^{-1}\Omega ^{-1}$ and ${L}_n(\textbf{X}_t)$ as the $t-$th row of $\textbf{L}_n$. Given (i) $Y_t = m(\textbf{X}_t) + u_t, t=1,2,\ldots $; (ii) $\left\{ (\textbf{X}_t, u_t) \right\} $ is a stationary ergodic sequence; (iii) (a) $\left\{ L_n(X_{thi}) u_{th}, {\mathcal {F}}_t \right\} $ is an adapted mixingale of size $-1$, $h=1,\ldots ,p, i=1,\dots ,n$; (b) $\mathbb {E}|L_n(X_{thi}) u_{th} |^2 < \infty , h=1,\ldots ,p, i=1, \ldots , n$; (c) $\textbf{V}_n \equiv {\text {Var}}\left( \frac{1}{\sqrt{n}} \textbf{L}_n\textbf{u}\right) $ is uniformly positive definite; (iv) $\mathbb {E}|L_n(X_{thi})|^2 < \infty , h=1,\ldots ,p, i=1,\ldots ,n$; (v) $\underset{n\rightarrow \infty }{\lim }\ L_n(\textbf{X}_t)=L(X_t)$ and $\underset{n\rightarrow \infty }{\lim }\ \textbf{L}_n=\textbf{L}$.

Consider $n^{-1/2}\sum _{t=1}^n \varvec{\lambda }^\top \textbf{V}^{-1/2} L_n(\textbf{X}_t) u_t$, where $\textbf{V}$ is any finite positive definite matrix. By Theorem 3.35 of White (2001), $\left\{ Z_t, {\mathcal {F}}_t \right\} $ is an adapted stochastic sequence because $Z_t$ is measurable with respect to ${\mathcal {F}}_t$. To see that $\mathbb {E}(Z_t^2)<\infty $, note that we can write

$$\begin{aligned} Z_t&= \varvec{\lambda }^\top \textbf{V}^{-1/2} L_n(\textbf{X}_t) u_t \\&= \sum _{h=1}^p \varvec{\lambda }^\top \textbf{V}^{-1/2} L_n(\textbf{X}_{th}) u_{th}\\&= \sum _{h=1}^p \sum _{i=1}^n \tilde{\lambda }_i L_n(X_{thi}) u_{th}, \end{aligned}$$

where $\tilde{\lambda }_i$ is the ith element of the $n\times 1$ vector $\tilde{\varvec{\lambda }} \equiv \textbf{V}^{-1/2} \varvec{\lambda }$. By definition of $\varvec{\lambda }$ and $\textbf{V}$, there exists $\Delta < \infty $ such that $|\tilde{\lambda }_i|<\Delta $ for all i. It follows from Minkowski’s inequality that

$$\begin{aligned} \mathbb {E}(Z_t^2)&\le \left[ \sum _{h=1}^p \sum _{i=1}^n \left( \mathbb {E} | \tilde{\lambda }_i L_n(X_{thi}) u_{th} |^2 \right) ^{1/2} \right] ^2 \\&\le \left[ \Delta \sum _{h=1}^p \sum _{i=1}^n \left( \mathbb {E} | L_n(X_{thi}) u_{th} |^2 \right) ^{1/2} \right] ^2\\&\le [ \Delta p n \Delta ^{1/2} ]^2 \le \infty , \end{aligned}$$

since for $\Delta $ sufficiently large, $\mathbb {E} | L_n(X_{thi}) u_{th} |^2< \Delta < \infty $ given (iii.b) and the stationarity assumption. Next, we show $\left\{ Z_t, {\mathcal {F}}_t \right\} $ is a mixingale of size $-1$. Using the expression for $Z_t$ just given, we can write

$$\begin{aligned} \mathbb {E}([\mathbb {E}(Z_0|{\mathcal {F}}_{-m})]^2)&= \mathbb {E} \left( \left[ \mathbb {E}\left( \sum _{h=1}^p \sum _{i=1}^n \tilde{\lambda }_i L_n(X_{0hi}) u_{0h} | {\mathcal {F}}_{-m}) \right) \right] ^2 \right) \\&= \mathbb {E} \left( \left[ \sum _{h=1}^p \sum _{i=1}^n \mathbb {E} \left( \tilde{\lambda }_i L_n(X_{0hi}) u_{0h} | {\mathcal {F}}_{-m} \right) \right] ^2 \right) . \end{aligned}$$

Applying Minkowski’s inequality, it follows that

$$\begin{aligned} \mathbb {E}([\mathbb {E}(Z_0|{\mathcal {F}}_{-m})]^2)&\le \left[ \sum _{h=1}^p \sum _{i=1}^n \left( \mathbb {E}\left[ \mathbb {E}\left( \tilde{\lambda }_i L_n(X_{0hi}) u_{0h} | {\mathcal {F}}_{-m} \right) ^2 \right] \right) ^{1/2} \right] ^2\\&\le \left[ \Delta \sum _{h=1}^p \sum _{i=1}^n \left( \mathbb {E}\left[ \mathbb {E}(L_n(X_{0hi}) u_{0h} | {\mathcal {F}}_{-m})^2 \right] \right) ^{1/2} \right] ^2\\&\le \left[ \Delta \sum _{h=1}^p \sum _{i=1}^n c_{0hi} \gamma _{mhi} \right] ^2\\&\le [\Delta pn \bar{c}_0 \bar{\gamma }_m]^2, \end{aligned}$$

where $\bar{c}_0 = \max _{h,i} c_{0hi} < \infty $ and $\bar{\gamma }_m = \max _{h,i} \gamma _{mhi}$ is of size $-1$. Thus, $\lbrace Z_t, {\mathcal {F}}_t\rbrace $ is a mixingale of size $-1$. Note that

$$\begin{aligned} {\text {Var}}(\sqrt{n}\bar{Z}_n)&= {\text {Var}}\left( \frac{1}{\sqrt{n}}\sum _{t=1}^n \varvec{\lambda }^\top \varvec{V}^{-1/2} L_n(\textbf{X}_t) u_t \right) \\&= \varvec{\lambda }^\top \textbf{V}^{-1/2} \textbf{V}_n \textbf{V}^{-1/2} \varvec{\lambda } \rightarrow \bar{\sigma }^2 < \infty . \end{aligned}$$

Hence $\textbf{V}_n$ converges to a finite matrix. Set $\textbf{V}=\lim _{n\rightarrow \infty } \textbf{V}_n=\textbf{L}\Omega \textbf{L}^{\top }$ which is positive definite given (iii.c). Then, $\bar{\sigma }^2 = \varvec{\lambda }^\top \textbf{V}^{-1/2} \textbf{V} \textbf{V}^{-1/2} \varvec{\lambda }=1$. Then by the martingale central limit theorem, $n^{-1/2}\sum _{t=1}^n \varvec{\lambda }^\top \textbf{V}^{-1/2} L_n(\textbf{X}_t) u_t \overset{d}{\rightarrow }\ N(0,1)$. Since this holds for every $\varvec{\lambda }$ such that $\varvec{\lambda }^\top \varvec{\lambda }=1$, it follows from Cramér-Wold Theorem, that $n^{-1/2}\textbf{V}^{-1/2} \sum _{t=1}^n L_n(\textbf{X}_t) u_t \overset{d}{\rightarrow }\ N(\varvec{0}, \textbf{I})$. Hence, $\sqrt{n}\textbf{Lu}\overset{d}{\rightarrow } N(\varvec{0},\textbf{L}\Omega \textbf{L}^\top )$ and it then follows that

$$\begin{aligned} \sqrt{n}(\widehat{\textbf{m}}_2-{\text {Bias}}[\widehat{\textbf{m}}_2|\textbf{X}] -\textbf{m})\overset{d}{\rightarrow } N(\varvec{0},\textbf{L}\Omega \textbf{L}^\top ). \end{aligned}$$

E Proof of Theorem 5

First, we note that the GKRLS derivative estimator is a linear smoother by substituting Eqs. (10) into (28),

$$\begin{aligned} \widehat{m}_{2,r}^{(1)}(\textbf{x}_0)&= \frac{2}{\sigma _2^2} \sum _{i=1}^n {\text {e}}^{-\frac{1}{\sigma _2^2} || \textbf{x}_i - \textbf{x}_0||^2} (\textbf{x}_i^{(r)} -\textbf{x}_0^{(r)}) \widehat{{c}}_{2,i}\\&= K_{\sigma _2,\textbf{x}_0}^{*\top } \Delta _r \widehat{\textbf{c}}_2\\&=K_{\sigma _2,\textbf{x}_0}^{*\top } \Delta _r(\Omega ^{-1}\textbf{K}_{\sigma _2}+\lambda _2\textbf{I})^{-1}\Omega ^{-1}\textbf{y}\\&=S_r(\textbf{x}_0)^\top \textbf{y}, \end{aligned}$$

where $\Delta _r \equiv \frac{2}{\sigma _2^2}{\text {diag}} (\textbf{x}_1^{(r)} -\textbf{x}_0^{(r)},\ldots ,\textbf{x}_n^{(r)}-\textbf{x}_0^{(r)})$ is a $n\times n$ diagonal matrix and

$$\begin{aligned} S_r(\textbf{x}_0)=\left[ K_{\sigma _2,\textbf{x}_0}^{*\top } \Delta _r(\Omega ^{-1}\textbf{K}_{\sigma _2} +\lambda _2\textbf{I})^{-1}\Omega ^{-1}\right] ^\top \end{aligned}$$

(61)

is the smoother vector for the first partial derivative with respect to the rth variable. Then, the conditional mean and variance of the GKRLS derivative can be derived as follows

$$\begin{aligned} \mathbb {E}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0]&=S_r(\textbf{x}_0)^\top \mathbb {E}[\textbf{y}|\textbf{X}]\\&=S_r(\textbf{x}_0)^\top \textbf{m} \end{aligned}$$

and

$$\begin{aligned} {\text {Var}}[\widehat{m}^{(1)}_{2,r}(\textbf{x}_0)|X=\textbf{x}_0]&=S_r(\textbf{x}_0)^\top {\text {Var}}[\textbf{y}|\textbf{X}] S_r(\textbf{x}_0)\\&= S_r(\textbf{x}_0)^\top \Omega S_r(\textbf{x}_0). \end{aligned}$$

F Proof of Theorem 6

The bias of the GKRLS derivative estimator in Eq. (28)

$$\begin{aligned} {\text {Bias}}[\widehat{m}_{2,r}^{(1)}(\textbf{x}_0)|X=\textbf{x}_0]&= S_r(\textbf{x}_0)^{\top } \mathbb {E}[\textbf{y}|\textbf{X}]-{m}_r^{(1)}(\textbf{x}_0)\\&=S_r(\textbf{x}_0)^{\top }\textbf{m}- {m}_r^{(1)}(\textbf{x}_0), \end{aligned}$$

where ${m}_r^{(1)}(\textbf{x}_0)$ is the true first partial derivative of m with respect to the rth variable. Since this quantity as well as $\textbf{m}$ is unknown, we estimate both to calculate the conditional bias.

$$\begin{aligned} \widehat{{\text {Bias}}}[\widehat{m}_{2,r}^{(1)}(\textbf{x}_0)|X=\textbf{x}_0]&= S_r(\textbf{x}_0)^{\top }\widehat{\textbf{m}}_2- \widehat{m}_{2,r}^{(1)}(\textbf{x}_0), \end{aligned}$$

where $\widehat{\textbf{m}}_2$ is the $n\times 1$ vector of in sample GKRLS predictions of $\textbf{m}$ and $\widehat{m}_{2,r}^{(1)}(\textbf{x}_0)$ is the estimated GKRLS derivative prediction evaluated at point $\textbf{x}_0$.

$$\begin{aligned} \widehat{{\text {Var}}}[\widehat{m}_{2,r}^{(1)}(\textbf{x}_0)|X=\textbf{x}_0]&= S_r(\textbf{x}_0)^\top \widehat{\Omega } S_r(\textbf{x}_0) \end{aligned}$$

(62)

G Proof of Theorem 7

The conditional bias of the GKRLS first difference estimator in Eq. (38) is

$$\begin{aligned} {\text {Bias}}&[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0] \\ \negthickspace&= L(x^{(b)}=1,\textbf{x}_0)^\top \textbf{m} - m(x^{(b)}=1,\textbf{x}_0) - \left[ L(x^{(b)}=0,\textbf{x}_0)^\top \textbf{m} - m(x^{(b)}=0,\textbf{x}_0) \right] \\&= \left[ L(x^{(b)}=1,\textbf{x}_0) - L(x^{(b)}=0,\textbf{x}_0)\right] ^\top \textbf{m} - \left[ m(x^{(b)}=1,\textbf{x}_0) - m(x^{(b)}=0, \textbf{x}_0) \right] \\&= L_{FD_b}(\textbf{x}_0)^\top \textbf{m} - m_{FD_b}(\textbf{x}_0), \end{aligned}$$

where $m_{FD_b}(\textbf{x}_0) = m(x^{(b)}=1,\textbf{x}_0) - m(x^{(b)}=0, \textbf{x}_0)$ is the true first difference of m with respect to the bth variable and $L_{FD_b}(\textbf{x}_0) = L(x^{(b)}=1,\textbf{x}_0) - L(x^{(b)}=0,\textbf{x}_0)$ is the first difference smoother vector.

The conditional variance of the GKRLS first difference estimator in Eq. (38) is

$$\begin{aligned} {{\text {Var}}}&[\widehat{m}_{FD_b}(\textbf{x}_0)|X=\textbf{x}_0] \\&= L(x^{(b)}=1,\textbf{x}_0)^\top \Omega L(x^{(b)}=1,\textbf{x}_0) + L(x^{(b)}=0, \textbf{x}_0)^\top \Omega L(x^{(b)}=0,\textbf{x}_0) \\&\qquad - L(x^{(b)}=1,\textbf{x}_0)^\top \Omega L(x^{(b)}=0,\textbf{x}_0) - L(x^{(b)}=0, \textbf{x}_0)^\top \Omega L(x^{(b)}=1,\textbf{x}_0)\\&= \left[ L(x^{(b)}=1,\textbf{x}_0) - L(x^{(b)}=0,\textbf{x}_0) \right] ^\top \Omega \left[ L(x^{(b)}=1,\textbf{x}_0) - L(x^{(b)}=0,\textbf{x}_0) \right] \\&= L_{FD_b}(\textbf{x}_0)^\top \Omega L_{FD_b}(\mathbf {x_0}). \end{aligned}$$

H A random effects model for airline sata used in Sect. 7

Consider the following random effects model for an airline cost function:

$$\begin{aligned} Y_{it}=m(X_{it})+\alpha _i +\varepsilon _{it}, \end{aligned}$$

$Y_{it} = \log C_{it}$, $X_{it} = (\log Q_{it}, \log P_{it})^{\top }$, $\alpha _i$ is the firm specific effect, and $\varepsilon _{it}$ is the idiosyncratic error term. In this empirical setting, we assume

$$\begin{aligned}{} & {} \mathbb {E}[\varepsilon _{it}|\textbf{X}]=0\\{} & {} \mathbb {E}[\varepsilon _{it}^2|\textbf{X}]=\sigma ^2_ {\varepsilon _{i}}\\{} & {} \mathbb {E}[\alpha _i|\textbf{X}]=0\\{} & {} \mathbb {E}[\alpha _i^2|\textbf{X}]=\sigma ^2_{\alpha _i}\\{} & {} \mathbb {E}[\varepsilon _{it}\alpha _j|\textbf{X}]=0\, \text { for all }\, i, t, j\\{} & {} \mathbb {E}[\varepsilon _{it}\varepsilon _{js}|\textbf{X}]=0\, \text { if}\, t\ne s\, \text {or}\, i\ne j\\{} & {} \mathbb {E}[\alpha _i\alpha _j|\textbf{X}]=0\, \text { if}\, i\ne j \end{aligned}$$

Consider the composite error term $U_{it}\equiv \alpha _i+\varepsilon _{it}$. Then, the model with the composite error term is

$$\begin{aligned} Y_{it}=m(X_{it})+U_{it} \end{aligned}$$

Note that the independent variables are strictly exogenous; the regressors are mean independent of each error term and therefore of the composite error term:

$$\begin{aligned} \mathbb {E}[U_{it}|\textbf{X}]&= \mathbb {E}[\alpha _i|\textbf{X}] + \mathbb {E}[\varepsilon _{it}|\textbf{X}]\\&=0. \end{aligned}$$

In this framework, we allow for the errors to be heteroskedastic and correlated across time. The variance of the composite error term is

$$\begin{aligned} \mathbb {E}[U_{it}^2|\textbf{X}]&=\mathbb {E}[\alpha _i^2|\textbf{X}] + \mathbb {E}[\varepsilon _{it}^2|\textbf{X}] + 2\mathbb {E}[\alpha _i\varepsilon _{it}|\textbf{X}]\\&=\sigma ^2_{\alpha _i}+\sigma ^2_{\varepsilon _{i}}, \end{aligned}$$

where $\mathbb {E}[\alpha _i\varepsilon _{it}|\textbf{X}]=0$ by assumption. The covariance of the composite errors is

$$\begin{aligned} \mathbb {E}[U_{it}U_{is}|\textbf{X}]&= \mathbb {E}[(\alpha _i + \varepsilon _{it}) (\alpha _i + \varepsilon _{is})|\textbf{X}]\\&= \mathbb {E}[\alpha _i^2|\textbf{X}]\\&=\sigma ^2_{\alpha _i} \;\text {for } t\ne s \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}[U_{it}U_{js}|\textbf{X}]&= \mathbb {E}[(\alpha _i+\varepsilon _{it}) (\alpha _j+\varepsilon _{js})|\textbf{X}]\\&=0 \;\text {for all}\, t \,\text {and}\, s\, \text {if}\, i\ne j. \end{aligned}$$

Therefore, this framework allows for heteroskedasticity with respect to firms and correlation across time and the correlation across time can be firm specific.

Define the $T\times 1$ vector of errors for firm i as $\textbf{u}_i=(u_{i1},\ldots ,u_{iT})^\top $, $i=1,\ldots ,n$, where we stack the errors over time for each firm. Then define the $T\times T$ error covariance matrix for each firm, $\Sigma _i$, as

$$\begin{aligned} \Sigma _i&=\mathbb {E}[\textbf{u}_i\textbf{u}_i^\top |\textbf{X}]\\&=\sigma ^2_{\alpha _i}\varvec{\iota }_T\varvec{\iota }_T^\top + \sigma ^2_{\varepsilon _{i}}\textbf{I}_T\\&= \begin{pmatrix} \sigma ^2_{\alpha _i} + \sigma ^2_{\varepsilon _{i}} &{} \sigma ^2_{\alpha _i} &{} \ldots &{} \sigma ^2_{\alpha _i}\\ \sigma ^2_{\alpha _i} &{} \sigma ^2_{\alpha _i} + \sigma ^2_{\varepsilon _{i}} &{} \ddots &{} \vdots \\ \vdots &{} \ddots &{} \ddots &{} \sigma ^2_{\alpha _i}\\ \sigma ^2_{\alpha _i} &{} \ldots &{}\sigma ^2_{\alpha _i} &{} \sigma ^2_{\alpha _i} + \sigma ^2_{\varepsilon _{i}} \end{pmatrix}. \end{aligned}$$

Therefore, the $nT\times nT$ error covariance matrix $\Omega $ is block diagonal as

$$\begin{aligned} \Omega&= {\text {diag}}(\Sigma _1,\ldots , \Sigma _n)\\&= \begin{pmatrix} \Sigma _1 &{} \varvec{0} &{} \ldots &{} \varvec{0}\\ \varvec{0} &{} \Sigma _2 &{} \ddots &{} \vdots \\ \vdots &{} \ddots &{} \ddots &{} \varvec{0}\\ \varvec{0} &{} \ldots &{} \varvec{0} &{} \Sigma _n \end{pmatrix} \end{aligned}$$

To estimate the random effects model of airline cost by GKRLS, first, we follow item 1 of the two step procedure outlined in Sect. 2. To get a consistent estimate of the error covariance matrix $\Omega $, we can estimate the error variances using the residuals from the first step as

$$\begin{aligned} \widehat{\sigma }^2_{U_i} = \frac{1}{T} \widehat{\textbf{u}}_i^\top \widehat{\textbf{u}}_i\\ \widehat{\sigma }^2_{\alpha _i} = \frac{1}{T(T-1)/2} \sum _{t=1}^{T-1} \sum _{s=t+1}^{T} \widehat{u}_{it}\widehat{u}_{is}\\ \widehat{\sigma }^2_{\varepsilon _{i}} = \widehat{\sigma }^2_{U_i} - \widehat{\sigma }^2_{\alpha _i}. \end{aligned}$$

Since averages are used to estimate the variances and by the law of large numbers $\widehat{\sigma }^2_{\alpha _i}$ and $\widehat{\sigma }^2_{\varepsilon _{i}}$ are consistent estimators of ${\sigma }^2_{\alpha _i}$ and ${\sigma }^2_{\varepsilon _{i}}$. Then, using these estimates for the error covariance, we follow 2 of the two step procedure to get GKRLS estimates of the cost function.

In order to apply the asymptotic results established in Sect. 4, we must have $nT\rightarrow \infty $. Then, consistency and asymptotic normality of the GKRLS estimator under the random effects model discussed in Sect. 7 can be applied. In addition, since time averages are used to estimate the variances, we also must have $T\rightarrow \infty $. $T\rightarrow \infty $ is needed to apply the law of large numbers to get consistent estimates of ${\sigma }^2_{\alpha _i}$ and ${\sigma }^2_{\varepsilon _{i}}$. Since we assume that $T\rightarrow \infty $, it must be that $nT\rightarrow \infty $, and applying Theorems 3 and 4, the GKRLS estimator is consistent and asymptotically normal.

$\widehat{\Omega }$ can be thought of as a working covariance matrix since the parametric functional form may be subject to misspecification. One method to avoid misspecification is to estimate $\Omega $ nonparametrically. For example, under heteroskedasticity, one can estimate $\Omega $ by a semiparametric KRLS estimator of the conditional variance (Dang and Ullah 2022). Other solutions may be explored as future work.

The hyperparameters of the LP, RF, and SVM estimators are chosen by their default methods in their respective R packages.

The following R packages were used for conducting simulations: Borchers (2021), Hyndman and Khandakar (2008), McLeod et al. (2007), Boos and Nychka (2022), Hayfield and Racine (2008), Liaw and Wiener (2002), and Meyer et al. (2022).

The derivatives are not reported for LP, RF, and SVM since derivative estimation for RF and SVM methods are uncommon. The derivative estimates for LP can be obtained but in this simulation the GKRLS estimator is superior with respect to MSE.

The data for the application is from Greene (2018) and can be downloaded at https://pages.stern.nyu.edu/~wgreene/Text/Edition7/tablelist8new.htm.

The R package by Callaway (2022) was used to obtain the bootstrap samples.

The R package by Croissant and Millo (2008) was used to obtained the Random Effects GLS estimator.

For the LP estimator, cross validation is used to select the hyperparameters. The local constant estimator is used, although one can use the local linear estimator, which gives similar results to that of the local constant.

We follow the proof similar to the case of dependent identically distributed observations provided by White (2001).

Ahu SC, Schmidt P (1995) A separability result for gmm estimation, with applications to gls prediction and conditional moment tests. Econ Rev 14(1):19–34. https://doi.org/10.1080/07474939508800301CrossRef

Aigner D, Lovell C, Schmidt P (1977) Formulation and estimation of stochastic frontier production function models. J Econ 6(1):21–37CrossRef

Amsler C, Prokhorov A, Schmidt P (2017) Endogenous environmental variables in stochastic frontier models. J Econ 199(2):131–140. https://doi.org/10.1016/j.jeconom.2017.05.005CrossRef

Amsler C, Schmidt P, Tsay W-J (2019) Evaluating the cdf of the distribution of the stochastic frontier composed error. J Prod Anal 52(1–3):29–35. https://doi.org/10.1007/s11123-019-00554-9CrossRef

Arabmazar A, Schmidt P (1981) Further evidence on the robustness of the tobit estimator to heteroskedasticity. J Econ 17(2):253–258. https://doi.org/10.1016/0304-4076(81)90029-4CrossRef

Boos DD, Nychka D (2022) Rlab: functions and datasets required for ST370 class, R package version 4.0

Borchers HW (2021) pracma: Practical Numerical Math Functions, R package version 2.3.3

Callaway B (2022) BMisc: miscellaneous functions for panel data, quantiles, and printingresults. R package version 1.4.5

Croissant Y, Millo G (2008) Panel data econometrics in R: the plm package. J Stat Softw 27(2):1–43. https://doi.org/10.18637/jss.v027.i02CrossRef

Dang J, Ullah A (2022) Machine-learning-based semiparametric time series conditional variance: Estimation and forecasting. J Risk Financ Manag. https://doi.org/10.3390/jrfm15010038CrossRef

De Brabanter K, Cao F, Gijbels I, Opsomer J (2018) Local polynomial regression with correlated errors in random design and unknown correlation structure. Biometrika 105(3):681–690. https://doi.org/10.1093/biomet/asy025CrossRef

De Brabanter K, De Brabanter J, Suykens JAK, De Moor B (2011) Approximate confidence and prediction intervals for least squares support vector regression. IEEE Trans Neural Netw 22(1):110–120. https://doi.org/10.1109/TNN.2010.2087769CrossRef

Greene W (2018) Econometric analysis. Pearson, ISBN 9780134461366

Guilkey DK, Schmidt P (1973) Estimation of seemingly unrelated regressions with vector autoregressive errors. J Am Stat Assoc 68(343):642–647CrossRef

Hainmueller J, Hazlett C (2014) Kernel regularized least squares: Reducing misspecification bias with a flexible and interpretable machine learning approach. Polit Anal 22(2):143–168CrossRef

Hayfield T, Racine JS (2008) Nonparametric econometrics: the np package. J Stat Softw 27(5):1–32. https://doi.org/10.18637/jss.v027.i05CrossRef

Hyndman RJ, Khandakar Y (2008) Automatic time series forecasting: the forecast package for R. J Stat Softw 26(3):1–22

Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22

McLeod AI, Yu H, Krougly Z (2007) Algorithms for linear time series analysis: with r package. J Stat Softw 23(5):1CrossRef

Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2022) e1071: Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien, . R package version 1.7-12

Schmidt P (1976a) Econometrics. Marcel Dekker Inc, New York

Schmidt P (1976b) On the statistical estimation of parametric frontier production functions. Rev Econ Stat 58(2):238–239CrossRef

Schmidt P (1977) Estimation of seemingly unrelated regressions with unequal numbers of observations. J Econ 5(3):365–377. https://doi.org/10.1016/0304-4076(77)90045-8CrossRef

Schmidt P, Witte AD (1984) An economic analysis of crime and justice. Academic Press, New York

Schmidt P, Witte AD (1988) Predicting recidivism using survival models. Springer-Verlag, New YorkCrossRef

White H (2001) Asymptotic theory for econometricians. Econometrics, and mathematical economics. Emerald Group Publishing Limited, Economic Theory. 9780127466521

Title: Generalized kernel regularized least squares estimator with parametric error covariance
Authors: Justin Dang
Aman Ullah
Publication date: 06-04-2023
Publisher: Springer Berlin Heidelberg
Published in: Empirical Economics / Issue 6/2023
Print ISSN: 0377-7332
Electronic ISSN: 1435-8921
DOI: https://doi.org/10.1007/s00181-023-02411-z

Springer Professional

Generalized kernel regularized least squares estimator with parametric error covariance

Abstract

Publisher's Note

1 Introduction

2 Generalized KRLS estimator

2.1 KRLS estimator

2.2 An efficient KRLS estimator

2.3 Selection of hyperparameters

3 Finite sample properties

3.1 Estimation of bias and variance

3.2 Bias and variance of KRLS

4 Asymptotic properties

5 Partial effects and derivatives

5.1 First differences for binary independent variables

6 Simulations

7 Application

7.1 U.S. airline industry

8 Conclusion

Declarations

Conflicts of interest

Human or animal rights

Publisher's Note

Appendices

A Proof of Theorem 1

B Proof of Theorem 2

C Proof of Theorem 3

D Proof of Theorem 4

E Proof of Theorem 5

F Proof of Theorem 6

G Proof of Theorem 7

H A random effects model for airline sata used in Sect. 7

Premium Partner

		MSE	Variance	Bias
Simulation evaluation for \(m(\textbf{x}_0)\)
Autocor. errors	GKRLS	2.8562	1.6311	0.0140
	KRLS	2.9767	2.3835	\(-0.0094\)
	LP	3.4623	3.0822	\(-0.0112\)
	RF	3.8442	3.5013	0.0205
	SVM	5.7663	5.6482	0.0263
Heterosk. errors	GKRLS	0.2287	0.1702	0.0103
	KRLS	0.2366	0.1766	\(-0.0148\)
	LP	0.2696	0.1958	0.0055
	RF	0.5917	0.1372	0.0178
	SVM	0.2632	0.2105	\(-0.0001\)

		Autocor. errors			Heterosk. errors
		MSE	Variance	\({\text {Bias}}^2\)	MSE	Variance	\({\text {Bias}}^2\)
Simulation results for consistency of GKRLS
\(m(\textbf{x}_0)\)	\(n=100\)	4.9665	2.8562	1.6112	0.4113	0.2287	0.1309
	\(n=200\)	2.7170	1.6311	0.8786	0.3012	0.1702	0.0993
	\(n=400\)	2.2496	1.2251	0.7326	0.1101	0.0585	0.0316
\(m^{(1)}_{1}(\textbf{x}_0)\)	\(n=100\)	2.3091	0.5590	1.7501	0.5880	0.5196	0.0683
	\(n=200\)	1.1708	0.4092	0.7615	0.3290	0.2835	0.0455
	\(n=400\)	0.6992	0.2647	0.4345	0.1964	0.1695	0.0269
\(m^{(1)}_{2}(\textbf{x}_0)\)	\(n=100\)	0.4614	0.1164	0.3449	0.3751	0.2702	0.1049
	\(n=200\)	0.3800	0.0887	0.2913	0.2414	0.1695	0.0719
	\(n=400\)	0.2962	0.0715	0.2247	0.1601	0.1063	0.0539
\(m^{(1)}_{3}(\textbf{x}_0)\)	\(n=100\)	6.6704	0.4951	6.1753	2.8633	0.8853	1.9780
	\(n=200\)	5.2002	0.3361	4.8641	2.0529	0.5746	1.4783
	\(n=400\)	4.4179	0.2261	4.1918	1.5181	0.3793	1.1388

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Generalized KRLS estimator

2.1 KRLS estimator

2.2 An efficient KRLS estimator

2.3 Selection of hyperparameters

3 Finite sample properties

3.1 Estimation of bias and variance

3.2 Bias and variance of KRLS

4 Asymptotic properties

5 Partial effects and derivatives

5.1 First differences for binary independent variables

6 Simulations

7 Application

7.1 U.S. airline industry

8 Conclusion

Declarations

Conflicts of interest

Human or animal rights

Publisher's Note

Appendices

A Proof of Theorem 1

B Proof of Theorem 2

C Proof of Theorem 3

D Proof of Theorem 4

E Proof of Theorem 5

F Proof of Theorem 6

G Proof of Theorem 7

H A random effects model for airline sata used in Sect. 7

Other articles of this Issue 6/2023

Identification and estimation of categorical random coefficient models

Hotelling tubes, confidence bands and conformal inference

DS-HECK: double-lasso estimation of Heckman selection model

Dynamic panel GMM estimators with improved finite sample properties using parametric restrictions for dimension reduction

Forecasting in the presence of in-sample and out-of-sample breaks

Introduction

Premium Partner