Consistency and asymptotic distribution of the Theil–Sen estimator
Introduction
We consider a simple linear regression modelwhere are non-identical constants and independent and identically distributed (iid) random errors with an unknown cumulative distribution function (cdf) F. A well-known robust estimator of the slope is the Theil–Sen estimator that was first proposed by Theil (1950) and then extended by Sen (1968). More precisely, we defineThen the Theil–Sen estimator is defined as the median of all slopes in , , where “med” stands for “median”.
We deliberately leave out the intercept in model (1.1). Nonetheless, our model covers the linear regression model with an unknown intercept by simply letting , where is an error satisfying certain identifiability conditions. Our formulation in model (1.1) does not impose any assumptions about the error. The intercept can be estimated, for example, using the median of under the identifiability condition that the error has a unique median. For the regression model with a zero intercept, a more robust estimator of the slope is the median of the slopes of lines joining the origin with all observations . This is the least absolute deviation estimator of the slope, which has a bounded influence function and a large-sample high breakdown point of . Since our principal focus in this paper is the asymptotic behavior of the slope estimator for the general regression model (1.1), we stop here the discussion of intercept estimations and the linear regression model through the origin.
The Theil–Sen estimator is robust with a high breakdown point of about and also has a bounded influence function. It compares favorably with the ordinary least squares estimator in small-sample efficiency (Wilcox, 1998) and is competitive in terms of mean squared error with alternative slope estimators (Dietz, 1987).
The univariate Theil–Sen estimator has numerous multivariate extensions. Oja and Niinimaa (1984) generalized the Theil–Sen estimator to multiple regression models using pseudo-observations and the Oja's (1983) median. The Oja's median is a special spatial median. For the asymptotic properties of spatial medians, see Arcones et al. (1994) and Bose (1998). Zhou and Serfling (2007) gave another natural extension of the Theil–Sen estimator based on multivariate spatial -quantiles. It is interesting to establish some of the properties of the univariate Theil–Sen estimator for its multivariate spatial extensions in multiple regression models, including semi-parametric generalized linear models, partially linear models and single index models. We pursue this matter in a separate study.
In review of the asymptotic results of the Theil–Sen estimator in the literature, we found that further study on this classical estimator was worthwhile. For instance, the consistency of the estimator has, to our knowledge, not been studied thus far. Sen (1968) investigated the asymptotic normality of the estimator only for absolutely continuous . However, as we point out in this paper, there is a gap in Sen's proof. Sen used a theorem from Hoeffding (1948), but his set-up does not satisfy the assumptions of the theorem.
In this paper we establish the strong consistency and asymptotic distribution of the Theil–Sen estimator for a general error distribution F (i.e., the of the error is arbitrary, thus including both discontinuous and continuous ones). To our surprise, the Theil–Sen estimator turns out to be super-efficient for discontinuous error distributions (see Section 2). We also obtain a general theorem on the asymptotic distribution (Theorem 3) when the error distribution is continuous (not necessarily absolutely continuous). The asymptotic normality claimed by Sen (1968) follows as a special case. We find that the Theil–Sen estimator is not asymptotically normal in general, though it does converge in distribution. We also give the conditions under which it is asymptotically normal (Remark 2 in Section 3). We provide two sets of conditions under which the general theorem holds. These conditions are easy to verify and satisfied by most common distributions; furthermore, they enable us to obtain an explicit formula for the scaling constant. Under these conditions, we show that the asymptotic distribution is normal when the is absolutely continuous and may not be normal when the is not absolutely continuous. An example is given in which the Theil–Sen estimator has a non-normal asymptotic distribution. We conduct a small simulation study that confirms the super-efficiency and the asymptotic non-normality.
The Theil–Sen estimator has been widely acknowledged in several popular textbooks on non-parametric statistics and robust regression. See, e.g., Sprent (1993), Hollander ann Wolfe, 1973, Hollander and Wolfe, 1999, and Rousseeuw and Leroy (2003). It also has been extensively studied in the literature. Sen (1968) and Wilcox (1998) investigated its asymptotic relative efficiency to the least squares estimator. Akritas et al. (1995) applied it to astronomy and Fernandes and Leblanc (2005) to remote sensing. Wang (2005) studied its asymptotic properties for model (1.1) with a random covariate. Many of its extensions can be found in the literature, for example, in censored data; for details, see, e.g., Akritas et al. (1995), Jones (1997), and Mount and Netanyahu (2001).
The rest of this paper is organized as follows: In Section 2, we investigate consistency. In Section 3, we address asymptotic normality, present an example and conduct a small simulation study. In Section 4, we prove Theorem 3. Some technical details are given in the appendix.
Section snippets
Strong consistency
In this section, we establish the strong consistency of the Theil–Sen estimator for both discontinuous and continuous error distributions. We start by introducing a general lemma.
First, for each , we divide the slope set defined in (1.2) into two subsets and : and define and as the cardinalities of and , respectively.
Under model (1.1), we can write , .
Asymptotic distribution
In this section, we study the asymptotic distribution of the Theil–Sen estimator for both discontinuous and continuous error . For discontinuous , we show that the Theil–Sen estimator is super-efficient; for continuous , we establish a general theorem on the asymptotic distribution. We provide two sets of sufficient conditions that satisfy the general theorem and permit explicit formulas of the scaling constants as well. We also present a sufficient condition for non-normal
Proof of Theorem 3
In the proof of Theorems 1 and 2, we constructed several -statistics to establish the strong consistency of the Theil–Sen estimator . One significant feature of our -statistics is that the kernels vary with the sample size n, which presents us some technical challenges. These -statistics are different from the so-called Kendall's tau, which was used by Sen (1968). Furthermore, when the covariates are non-identical non-random constants, neither our -statistics nor the Kendall's tau
Acknowledgments
This research is supported in part by Grant T32MH-014235 from the National Institute of Health. The authors would like to express their sincere thanks to an Associated Editor and the referees for helpful suggestions and comments that improved the presentations.
References (24)
- et al.
Parametric (modified least squares) and non-parametric (Theil–Sen) linear regressions for predicting biophysical parameters in the presence of measurement errors
Remote Sensing of Environment
(2005) - et al.
Efficient randomized algorithms for robust estimation of circular arcs and aligned ellipses
Comput. Geometry: Theory Appl.
(2001) Descriptive statistics for multivariate distributions
Statist. Probab. Lett.
(1983)Simulations on the Theil–Sen regression estimator with right-censored data
Statist. Probab. Lett.
(1998)- et al.
Estimators related to -processes with applications to multivariate medians: asymptotic normality
Ann. Statist.
(1994) - et al.
The Theil–Sen estimator with doubly censored data and applications to astronomy
J. Amer. Statist. Assoc.
(1995) Bahadur representation of estimates
Ann. Statist.
(1998)A comparison of robust estimators in simple linear regression
Comm. Statist., Part B Simulation Comput.
(1987)A Course in Large Sample Theory
(1996)A class of statistics with asymptotically normal distribution
Ann. Math. Statist.
(1948)
Probability inequalities for sums of bounded variables
J. Amer. Statist. Assoc.
Cited by (40)
On a class of linear regression methods
2024, Journal of ComplexityData based identification and prediction of nonlinear and complex dynamical systems
2016, Physics ReportsInferring Aggregate Market Expectations from the Cross-Section of Stock Prices
2023, Journal of Financial and Quantitative AnalysisTSLiNGAM: DirectLiNGAM under heavy tails
2023, arXivIntegrating Jackknife into the Theil–Sen Estimator in Multiple Linear Regression Model
2023, Revstat Statistical Journal