Stochastic frontier models and methods as pioneered by Peter Schmidt in Aigner et al. (J Econom 6:21–37, 1977), Horrace and Schmidt (J Product Anal 7:257–282, 1996), Amsler et al. (J Econom 190:280–288, 2016) constitute a rare departure from the usual econometric obsession with models for conditional means. They also provided an early stimulus for the development of quantile regression methods. After a brief tutorial on Hotelling tube methods for constructing confidence bands for nonparametric quantile regression, strengthened performance guarantees for such bands are described based on recent developments in conformal inference. These methods may be considered to be a rather idiosyncratic new approach to nonparametric inference for stochastic frontier models.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
One of my indelible memories of Peter Schmidt was a conversation we had in my kitchen at a party for Midwest Econometrics Group participants in 1993 about the uneasy relationship between statistics and econometrics. “If a statistical tree falls in the forest, but no econometrician sees it,” Peter said matter-of-factly, “then it never happened.” In 1939 Harold Hotelling, arguably one of the most eminent statisticians and econometricians of the twentieth century witnessed such an event and wrote about in Hotelling (1939). The paper inspired Hermann Weyl to write a highly influential paper, Weyl (1939) generalizing it. Hotelling’s idea has attracted a small coterie of admirers in statistics, but it is fair to say that it remains almost unknown in econometrics.
My quixotic aim in this paper is to rescue Hotelling’s idea from econometric obscurity. I will begin by describing a simple setting in which the idea can be employed to construct a confidence interval for a scalar parameter that enters awkwardly in a standard regression problem. Then, I will describe how it can be used to construct uniform confidence bands for nonparametric regression using penalty methods, and finally I will compare performance with confidence bands constructed with recently developed methods of conformal inference.
where \(\alpha , \beta , \tau \) are unknown parameters, \(\lambda _i(\cdot )\) are known functions and \(\varepsilon _i\sim {{\mathcal {N}}}(0, \sigma ^2)\). For the sake of concreteness, we might interpret \(\lambda _i (\tau )\) as a Box-Cox transformation of another covariate, say \((z_i^\tau -1)/\tau \). We would like to test \(H_0: \beta =0\). Under the null, the Box-Cox parameter \(\tau \) is not identified, so we need to consider strategies that properly account for this.1
By the familiar (Frisch and Waugh 1933) trickery, we can eliminate the \(\alpha \) effect.2 Redefining the notation and assuming for convenience that \(\sigma ^2 = 1\), we are left with the likelihood ratio statistic
Now \( U =Y/ \Vert Y \Vert \) is uniformly distributed on the sphere \(S^{n-1}\) and \(\gamma (\tau ) = \lambda (\tau )/ \Vert \lambda (\tau ) \Vert \) is a curve in \(S^{n-1}\). Thus, the test rejects when \(W=\sup _\tau \gamma (\tau )^\top U \) exceeds some value \(w=\cos \theta \) which is equivalent to
Note that the original definition of L is such that we reject for small values, so \(L<c, \) implies we reject for \(\sup _\tau \gamma (\tau )^\top U > w = \cos \theta \) for some critical value of \(\theta \). This is illustrated in Fig. 1 of Johansen and Johnstone (1990), reproduced here as Fig. 1. They call this the “angular or geodesic radius \(\theta \) about \(\gamma \):”
So when the distance \(d(u,\gamma )\) is small, U falls inside tube, and we reject. This may seem a bit counter-intuitive, but is nonetheless correct. There are probably many ways to it sound more intuitive. Here is one possibility. Since it all boils down to a cosine, that is the simple correlation between \(\lambda (\tau )\) and Y, we want to reject \(H_0: \beta =0\) if this correlation/cosine is too large, but Y’s that make it too large are the \(Y's\) that fall inside the tube.
×
So how do we compute the critical w or equivalently the critical \(\theta \)? Since \(W>w \equiv \cos \theta \) is equivalent to U being in the tube, we need the volume of the tube. Let \(| \gamma |\) denote the length of the arc \(\gamma (\tau )\) on the sphere. This can be approximated by the finite difference formula,
and the \(\tau \)’s are chosen on some relatively fine grid of m points. Note that in the finite difference approximation the \(\tau _i - \tau _{i-1}\) that would normally appear in the denominator of the difference quotient inside the norm expression cancels with the contribution of the \(\text {d}\tau \).
Theorem 1
If \(\gamma \) is a non-closed regular curve in \(S^{d-1}\) then for w near 1,
where \(B(1/2, (d-1)/2) \) is a beta random variable. If \(\gamma \) is closed, i.e., forms a closed loop without end points, then the second “cap” term is omitted.
Advertisement
We ignore pathological complications involving self-intersections of the curve, \(\gamma \). This follows from a result of Hotelling (1939), as does the next theorem.
Theorem 2
Let \(\gamma \) be a regular closed curve in \(S^{d-1}\) with length \(|\gamma |\). And
where \(\Omega _{d-2} = \pi ^{(d-2)/2}\Gamma (d/2)\) is the volume of the unit ball in \(R^{d-2}.\)
Heuristically, the formula is,
$$\begin{aligned} V(\gamma ^\theta ) = (\text{ length } \text{ of } \text{ tube}) \cdot (\text{ Volume } \text{ of } \text{ unit } \text{ ball}) \cdot \text{ radius}^{d-2} \end{aligned}$$
Recall that the volume of the unit ball in dimension d is \(V=\pi ^{d/2}/\Gamma ((d+2)/2)\). When \(\theta \) is larger, or \(\gamma \) is twisty, then the tube may intersect itself and the formula would need some refinement. Figure 2 is a crude attempt to depict tube on the 2-sphere, those with enhanced geometric imagination may try to visualize a three dimensional tube on the 3-sphere embedded in 4-space.
×
When the curve is not closed, then it needs “caps” on each end. These caps are given by
where \(w_{d-2} = 2\pi ^{(d-1)/2} /\Gamma ((d-1)/2)\) is the \((d-2)\)-volume of \(S^{d-2}\). Note that the volume of the sphere, \(V(S^{d-1}) = 2 \pi ^{d/2} /\Gamma (d/2)\), is not the same as the volume of the ball. Note also that \((1-z^2)^{1/2}\) is again the radius and integrating out the \(r^{d-3}\) yields a \(d-2\) dimensional volume. A useful reference for this sort of geometry is Kendall (1961).
How do we get from (2) to (1)? Recall that U is uniform on the \((d-1)\) sphere so we need to divide by the volume of that sphere to evaluate the probability of being in the tube, so for closed curves,
It remains to show that \(B^{-1} = w_{d-2}/\text{V(sphere) }\), which follows after a little simplification and recalling that \(\Gamma (1/2) = \sqrt{\pi }\).
To check how the Hotelling tube procedure performs in moderate sample sizes, Table 1 reports results of a small simulation experiment. Data are generated with iid \(x_i\) standard log-normal and
Three values of \(\tau \) are considered \(\tau \in \{ -0.5, 0, 0.5 \}\). Local alternatives, \(\beta _n = \beta _0/\sqrt{n}\), are considered with \(\beta _0 \in \{ 0, 1, 2\}\). The nominal level of the Hotelling test is taken to be 0.05. and 1000 replications of the experiment are made for each parametric setting. When \(\beta = 0\) so the null is true, the test delivers quite accurate size for all of the sample sizes considered, and power is respectable when \(\beta \) deviates from zero.
3 Uniform confidence bands for nonparametric regression
with \(\varepsilon _i\sim {{\mathcal {N}}}(0, \sigma ^2)\) as before and \(t\in I \subset {\mathbb {R}}\). Our objective is to find a positive c such that
Rejection frequencies for the Hotelling likelihood ratio test for a simple Box-Cox example
\(\beta _0 = \)0
\(\beta _0 = \)1
\(\beta _0 = \)2
\(\tau = \) −0.5
\(\tau = \) 0
\(\tau = \) 0.5
\(\tau = \) −0.5
\(\tau = \) 0
\(\tau = \) 0.5
\(\tau = \) −0.5
\(\tau = \) 0
\(\tau = \) 0.5
n = 20
0.056
0.058
0.049
0.313
0.193
0.182
0.781
0.459
0.380
n = 50
0.049
0.051
0.057
0.275
0.225
0.342
0.639
0.577
0.782
n = 100
0.063
0.048
0.056
0.350
0.261
0.281
0.840
0.637
0.704
n = 500
0.048
0.052
0.055
0.298
0.243
0.288
0.747
0.612
0.735
n = 1000
0.063
0.046
0.047
0.299
0.218
0.250
0.724
0.549
0.667
Tests are nominal level \(\alpha = 0.05\). Local alternatives are employed of the form: \(\beta _n = \beta _0 / \sqrt{n}\)
Now consider \(X\sim {{\mathcal {N}}}(\xi , \Sigma )\), so X plays the role of \({\hat{\beta }}\) and \(\xi \) of \(\beta .\) We’d like to make a confidence statement about \(\{a^\top \xi |a\in C\}\) and C is some sort of “curve.” So now we write,
So as before, \(\gamma =\gamma (C) \subset S^{d-1}\), and U is uniform on \(S^{d-1}\). R and W do not depend on \(\xi , \Sigma \) or they do, but only via \(\gamma \). \(R^2\) is independent of W and \(R^2\sim \chi _d^2\) so,
This integration may appear somewhat miraculous, but does actually work out provided that one carefully observes the \({{\mathcal {P}}}(R\in \text {d}r)\) term. Since \(R^2\sim \chi _d^2\), letting F denote the distribution function of \(\chi _d^2\), we have,
The components \(g = (g_1, \cdots , g_J)\) can be univariate or bivariate. Their smoothness can be controlled by penalizing total variation of the functions themselves or their gradients. Estimation is carried out by solving the linear program,
where \(\rho _\tau (u) = u (\tau - \mathbb {1}(u < 0))\) is the usual quantile objective function, \(\Vert \theta _0 \Vert _1 = \sum _{k=1}^{\scriptscriptstyle K} |\theta _{0k}|\) and \(\bigvee (\nabla g_j)\) denotes the total variation of the derivative or gradient of the function g. Recall that for g with absolutely continuous derivative \(g'\) we can express the total variation of \(g':{\mathcal R} \rightarrow {{\mathcal {R}}}\) as
where \(\nabla ^2g(z)\) denotes the Hessian of g and \(\Vert \cdot \Vert \) denotes the Hilbert–Schmidt norm for matrices. In contrast, total variation penalization of the component functions themselves yields piecewise constant solutions.
Adapting the Hotelling tube idea to construct uniform confidence bands for these components is also described in Koenker (2011), as is selection of the smoothing parameters \(\lambda _j, j = 0, 1, \cdots J\). It should be stressed that all of this machinery relies on the validity of Gaussian approximations for the fitted parameters and estimated functions and is conditional on selected tuning parameters. This is in accord with a large strand of earlier literature including (Wahba 1983; Nychka 1983), and Krivobokova et al. (2010); however, there are inevitable questions that can be raised about both aspects. To explore this, we consider some recent proposals for strengthening coverage guarantees based on conformal inference in the next section.
5 Conformal quantile regression
Conformal prediction, and conformal inference more generally, has grown out of work by Vladimir Volk and colleagues, see, e.g., Shafer and Vovk (2008) for an overview. It has emerged as an essential tool in uncertainty quantification throughout statistics and machine learning. An essential feature of the conformal inference approach in regression is a sample splitting device that allows one to adjust a confidence band constructed with training data based on its performance on a validation sample. Strong finite sample performance guarantees can be proven based on seemingly rather weak exchangeability assumptions. In regression settings, early work presumed conventional iid error structure when constructing the initial bands from the training data; however, (Romano et al. 2019) noted that in more heterogeneous settings narrower bands could be constructed using quantile regression methods. This approach has been further developed in Lei and Candès (2022). In high-dimensional regression, this typically would involve some form of random forest or neural network model for the initial bands, but the same methods can be used in simpler models like the additive models described above.
Construction of conformal prediction bands for additive quantile regression models can be described briefly as follows:
×
Note that the conformal adjustment of the initial band can make it wider or narrower. When \(Q<0\), it indicates that the validation sample fell well inside the initial band indicating that it is safe to shrink the width of the initial band.
There are several potential difficulties with the foregoing recipe.
Predictions based on the training sample typically are not equipped to extrapolate beyond the empirical support of the training data, so if the validation data, or new data requiring a conformal interval, lie outside that support some accommodation must be made.
Performance guarantees are based on marginal coverage of the band, so it may happen that in certain regions of design space there may be failures of coverage that are compensated by satisfactory coverage elsewhere. As shown by Foygel Barber et al. (2020), conditional coverage is not achievable in any generality.
All of the familiar challenges of penalty methods for regression smoothing persist, so choice of smoothing parameters, in particular, can cause headaches, even though poor \(\lambda \) selection can in principle be ameliorated by the conformal adjustment.
×
×
We conclude this section by illustrating the use of the conformal method in an artificial data example taken from Romano et al. (2019). Simulated data are generated as,
There are 7000 observations plotted in grey. The Poisson contribution to the response produces a banded structure to the scatterplot with pronounced heteroscedasticity. There are a small number of extreme outliers many of which lie outside the frame of the figure; such outliers are harmless since we are estimating conditional quantile functions. Penalizing total variation of \(g'\) yields a piecewise linear fit that does not fit the scatter as well as the piecewise constant estimate obtained by penalizing the total variation of g itself. It is striking here that the conformal adjustment in both figures is almost imperceptible. Thus, if interest focuses on prediction intervals for the response, the initial estimates provided by the penalized quantile regression estimates are fine, even though they are based on only half the original sample.
×
Prediction bands for Y are fine as far as they go, but what if we wanted confidence bands for the conditional quantile functions? Some might argue, e.g., Geiser (1993); Clarke and Clarke (2018), that it is pointless to predict quantities that can never be observed, but I subscribe to the principle: every decent estimate deserves a standard error. Figure 5 illustrates confidence bands for the lower, \(\tau = 0.05\) and upper, \(\tau = 0.95\) conditional quantile functions as estimated using penalization of \(g'\). The dark grey bands are the pointwise bands, while the lighter grey bands are those based on the Hotelling tube approach. Note that the bands for the 0.05 estimate are extremely narrow since the data are very concentrated in this region, so the \(\tau = 0.05\) conditional quantile is very precisely estimated.
6 Discussion
The large literature in econometrics about stochastic frontier models is mostly concerned with parametric models of the tail behavior of the response “near the production frontier.” Nonparametric quantile regression offers yet another perspective on estimating such models. It would be extremely foolish to make any claims for alternative methodology described here on the basis of the flimsy evidence offered, let me conclude simply by saying that it might be worthy of further consideration.
Declarations
Conflict of interest
The author has had no funding support or other potential conflict of interest.
Human and animal rights
Nor does the study involve any human or animal participants; nor have any vegetables been harmed in the preparation of this work.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
There is of course a large literature on such problems, notably: Davies (1977, 1987); Andrews and Ploberger (1994); Hansen (1996), none of whom mention Hotelling. An exception that justifies the qualified “almost unknown” above is Kim et al. (1998). I do not claim that the Hotelling approach is “best” in any sense, only that it is worthy of further consideration. To this end, software to compute the confidence bands described below is available in the R package quantreg, Koenker (1999) for a general class of total variation penalized, additive, nonparametric quantile regression models.