Consistency and asymptotic distribution of the Theil–Sen estimator

https://doi.org/10.1016/j.jspi.2007.06.036Get rights and content

Abstract

In this paper, we obtain the strong consistency and asymptotic distribution of the Theil–Sen estimator in simple linear regression models with arbitrary error distributions. We show that the Theil–Sen estimator is super-efficient when the error distribution is discontinuous and that its asymptotic distribution may or may not be normal when the error distribution is continuous. We give an example in which the Theil–Sen estimator is not asymptotically normal. A small simulation study is conducted to confirm the super-efficiency and the non-normality of the asymptotic distribution.

Introduction

We consider a simple linear regression modelYi=βxi+εi,i=1,,n,where xi are non-identical constants and εi independent and identically distributed (iid) random errors with an unknown cumulative distribution function (cdf) F. A well-known robust estimator of the slope β is the Theil–Sen estimator that was first proposed by Theil (1950) and then extended by Sen (1968). More precisely, we defineBn=bij:bij=Yj-Yixj-xi,ifxixj,1i<jn.Then the Theil–Sen estimator β˜n is defined as the median of all slopes in Bn, β˜n=med(Bn), where “med” stands for “median”.

We deliberately leave out the intercept in model (1.1). Nonetheless, our model covers the linear regression model with an unknown intercept Yi=α+βxi+ɛi,i=1,,n,by simply letting εi=α+ɛi, where ɛi is an error satisfying certain identifiability conditions. Our formulation in model (1.1) does not impose any assumptions about the error. The intercept α can be estimated, for example, using the median of {Yi-β˜nXi:i=1,,n} under the identifiability condition that the error has a unique median. For the regression model with a zero intercept, a more robust estimator of the slope β is the median of the slopes of lines joining the origin with all observations (xi,Yi). This is the least absolute deviation estimator of the slope, which has a bounded influence function and a large-sample high breakdown point of 0.5. Since our principal focus in this paper is the asymptotic behavior of the slope estimator β˜n for the general regression model (1.1), we stop here the discussion of intercept estimations and the linear regression model through the origin.

The Theil–Sen estimator is robust with a high breakdown point of about 0.293 and also has a bounded influence function. It compares favorably with the ordinary least squares estimator in small-sample efficiency (Wilcox, 1998) and is competitive in terms of mean squared error with alternative slope estimators (Dietz, 1987).

The univariate Theil–Sen estimator has numerous multivariate extensions. Oja and Niinimaa (1984) generalized the Theil–Sen estimator to multiple regression models using pseudo-observations and the Oja's (1983) median. The Oja's median is a special spatial median. For the asymptotic properties of spatial medians, see Arcones et al. (1994) and Bose (1998). Zhou and Serfling (2007) gave another natural extension of the Theil–Sen estimator based on multivariate spatial U-quantiles. It is interesting to establish some of the properties of the univariate Theil–Sen estimator for its multivariate spatial extensions in multiple regression models, including semi-parametric generalized linear models, partially linear models and single index models. We pursue this matter in a separate study.

In review of the asymptotic results of the Theil–Sen estimator in the literature, we found that further study on this classical estimator was worthwhile. For instance, the consistency of the estimator has, to our knowledge, not been studied thus far. Sen (1968) investigated the asymptotic normality of the estimator only for absolutely continuous cdfF. However, as we point out in this paper, there is a gap in Sen's proof. Sen used a theorem from Hoeffding (1948), but his set-up does not satisfy the assumptions of the theorem.

In this paper we establish the strong consistency and asymptotic distribution of the Theil–Sen estimator for a general error distribution F (i.e., the cdfF of the error ε is arbitrary, thus including both discontinuous and continuous ones). To our surprise, the Theil–Sen estimator turns out to be super-efficient for discontinuous error distributions (see Section 2). We also obtain a general theorem on the asymptotic distribution (Theorem 3) when the error distribution is continuous (not necessarily absolutely continuous). The asymptotic normality claimed by Sen (1968) follows as a special case. We find that the Theil–Sen estimator is not asymptotically normal in general, though it does converge in distribution. We also give the conditions under which it is asymptotically normal (Remark 2 in Section 3). We provide two sets of conditions under which the general theorem holds. These conditions are easy to verify and satisfied by most common distributions; furthermore, they enable us to obtain an explicit formula for the scaling constant. Under these conditions, we show that the asymptotic distribution is normal when the cdfF is absolutely continuous and may not be normal when the cdfF is not absolutely continuous. An example is given in which the Theil–Sen estimator has a non-normal asymptotic distribution. We conduct a small simulation study that confirms the super-efficiency and the asymptotic non-normality.

The Theil–Sen estimator has been widely acknowledged in several popular textbooks on non-parametric statistics and robust regression. See, e.g., Sprent (1993), Hollander ann Wolfe, 1973, Hollander and Wolfe, 1999, and Rousseeuw and Leroy (2003). It also has been extensively studied in the literature. Sen (1968) and Wilcox (1998) investigated its asymptotic relative efficiency to the least squares estimator. Akritas et al. (1995) applied it to astronomy and Fernandes and Leblanc (2005) to remote sensing. Wang (2005) studied its asymptotic properties for model (1.1) with a random covariate. Many of its extensions can be found in the literature, for example, in censored data; for details, see, e.g., Akritas et al. (1995), Jones (1997), and Mount and Netanyahu (2001).

The rest of this paper is organized as follows: In Section 2, we investigate consistency. In Section 3, we address asymptotic normality, present an example and conduct a small simulation study. In Section 4, we prove Theorem 3. Some technical details are given in the appendix.

Section snippets

Strong consistency

In this section, we establish the strong consistency of the Theil–Sen estimator for both discontinuous and continuous error distributions. We start by introducing a general lemma.

First, for each 0<r, we divide the slope set Bn defined in (1.2) into two subsets Bn,r+ and Bn,r-: Bn,r+={bijBn:bij>β+1/r},Bn,r-={bijBn:bij<β-1/r},and define Nn,r+=#(Bn,r+) and Nn,r-=#(Bn,r-) as the cardinalities of Bn,r+ and Bn,r-, respectively.

Under model (1.1), we can write bij=β+eij, eij=(εi-εj)/(xi-xj),xixj.

Asymptotic distribution

In this section, we study the asymptotic distribution of the Theil–Sen estimator for both discontinuous and continuous error cdfF. For discontinuous cdfF, we show that the Theil–Sen estimator is super-efficient; for continuous cdfF, we establish a general theorem on the asymptotic distribution. We provide two sets of sufficient conditions that satisfy the general theorem and permit explicit formulas of the scaling constants as well. We also present a sufficient condition for non-normal

Proof of Theorem 3

In the proof of Theorems 1 and 2, we constructed several U-statistics to establish the strong consistency of the Theil–Sen estimator β˜. One significant feature of our U-statistics is that the kernels vary with the sample size n, which presents us some technical challenges. These U-statistics are different from the so-called Kendall's tau, which was used by Sen (1968). Furthermore, when the covariates x1,,xn are non-identical non-random constants, neither our U-statistics nor the Kendall's tau

Acknowledgments

This research is supported in part by Grant T32MH-014235 from the National Institute of Health. The authors would like to express their sincere thanks to an Associated Editor and the referees for helpful suggestions and comments that improved the presentations.

References (24)

  • W. Hoeffding

    Probability inequalities for sums of bounded variables

    J. Amer. Statist. Assoc.

    (1963)
  • Hollander, M., Wolfe, D.A., 1973. Nonparametric Statistical Methods, first ed., Wiley, New York, pp....
  • Cited by (40)

    • On a class of linear regression methods

      2024, Journal of Complexity
    • Inferring Aggregate Market Expectations from the Cross-Section of Stock Prices

      2023, Journal of Financial and Quantitative Analysis
    View all citing articles on Scopus
    View full text