Skip to main content

2003 | Buch

Developments in Robust Statistics

International Conference on Robust Statistics 2001

herausgegeben von: Professor Dr. Rudolf Dutter, Professor Dr. Peter Filzmoser, Professor Dr. Ursula Gather, Professor Dr. Peter J. Rousseeuw

Verlag: Physica-Verlag HD

insite
SUCHEN

Über dieses Buch

Aspects of Robust Statistics are important in many areas. Based on the International Conference on Robust Statistics 2001 (ICORS 2001) in Vorau, Austria, this volume discusses future directions of the discipline, bringing together leading scientists, experienced researchers and practitioners, as well as younger researchers. The papers cover a multitude of different aspects of Robust Statistics. For instance, the fundamental problem of data summary (weights of evidence) is considered and its robustness properties are studied. Further theoretical subjects include e.g.: robust methods for skewness, time series, longitudinal data, multivariate methods, and tests. Some papers deal with computational aspects and algorithms. Finally, the aspects of application and programming tools complete the volume.

Inhaltsverzeichnis

Frontmatter
Robust Time Series Estimation via Weighted Likelihood

In this paper we introduce a method for efficient and robust estimation of the unknown parameters of an autoregressive-moving average model based on weighted likelihood. Two types of outliers, i.e. additive and innovation, are taken into account without knowing their number, position or intensity. A new procedure is used to classify the outliers and to bound the impact of additive outliers in order to improve the breakdown point of the method. Two examples and a Monte Carlo simulation are presented.

C. Agostinelli
An Exchange Algorithm for Computing the Least Quartile Difference Estimator

We propose an exchange algorithm (EA) for computing the least quartile difference estimate in a multiple linear regression model. Empirical results suggest that the EA is faster and more accurate than the usual p-subset algorithm.

J. Agulló
Selected Algorithms for Robust M- and L-Regression Estimators

The main objective of this survey paper is to discuss the numerical aspects of robust estimation in the linear model. Due to the space available we concentrate on M - and L - estimators, both nonrecursive and recursive ones. The emphasis is on numerical algorithms and computational efficiency, not on their statistical properties. While the main interest is on convex ϱ-functions generating M- estimators, it is pointed out that for non-convex ϱ-functions one can run into serious trouble and that the recursion can give little help in finding the optimal solution.

J. Antoch, H. Ekblom
A Simple Test to Identify Good Solutions of Redescending M Estimating Equations for Regression

Since recent interests are to consider not just one solution but all possible solutions to the redescending M estimating equations to identify possible multiple structure in a data set (Morgenthaler, 1990; Meer and Tyler, 1998), the focus in this paper is the re-descending M estimators for regression. We use multiple local minima for finding particular structures, such as lines, in the data set by associating them with the local minima in a re-descending M estimation problem for regression. A simple qualitative measure is constructed to assess the importance of each local minimum point, and hence importance of each fit to the data. This testing procedure allows us to distinguish between good and bad fits to the data.

O. Arslan
Algorithms to Compute CM- and S-Estimates for Regression

Constrained M-estimators for regression were introduced by Mendes and Tyler (1995) as an alternative class of robust regression estimators with high breakdown point and high asymptotic efficiency. To compute the CM-estimate, the global minimum of an objective function with an inequality constraint has to be localized. To find the S-estimate for the same problem, we instead restrict ourselves to the boundary of the feasible region. The algorithm presented for computing CM-estimates can easily be modified to compute S-estimates as well. Testing is carried out with a comparison to the algorithm SURREAL by Ruppert (1992).

O. Arslan, O. Edlund, H. Ekblom
Quantile Models and Estimators for Data Analysis

Quantile regression is used to estimate the cross sectional relationship between high school characteristics and student achievement as measured by ACT scores. The importance of school characteristics on student achievement has been traditionally framed in terms of the effect on the expected value. With quantile regression the impact of school characteristics is allowed to be different at the mean and quantiles of the conditional distribution. Like robust estimation, the quantile approach detects relationships missed by traditional data analysis. Robust estimates detect the influence of the bulk of the data, whereas quantile estimates detect the influence of co-variates on alternate parts of the conditional distribution. Since our design consists of multiple responses (individual student ACT scores) at fixed explanatory variables (school characteristics) the quantile model can be estimated by the usual regression quantiles, but additionally by a regression on the empirical quantile at each school. This is similar to least squares where the estimate based on the entire data is identical to weighted least squares on the school averages. Unlike least squares however, the regression through the quantiles produces a different estimate than the regression quantiles.

G. W. Bassett Jr., M. -Y. S. Tam, K. Knight
Estimation in the Generalized Poisson Model via Robust Testing

An estimation method is presented which compromises robust efficiency with computational feasibility in the case of the generalized Poisson model. The formal setup is built on flexible nonparametric extensions of the underlying model. The estimation efficiency is expressed via minimax properties of tests resulting from expansions of estimators. The non-parametric neighborhoods related to the proposed score function are exemplified and a real data case is analysed. The resulting method balances several qualitative features of statistical inference: strong differentiability (asymptotic derivations are more accurate), efficiency and natural model extension (quality of formal basic assumptions).

T. Bednarski
A Comparison of Some New Measures of Skewness

Asymmetry of a univariate continuous distribution is commonly described as skewness. The well-known classical skewness coefficient is based on the first three moments of the data set, and hence it is strongly affected by the presence of one or more outliers. In this paper we propose several new measures of skewness which are more robust against outlying values. Their properties are compared using both real and simulated data.

G. Brys, M. Hubert, A. Struyf
Robust Inference Based on Quasi-likelihoods for Generalized Linear Models and Longitudinal Data

In this paper we introduce and develop robust versions of quasi-likelihood functions for model selection via an analysis-of-deviance type of procedure in generalized linear models and longitudinal data analysis. These robust functions are built upon natural classes of robust estimators and can be seen as weighted versions of their classical counterparts. The asymptotic theory of these test statistics is studied and their robustness properties are assessed for both generalized linear models and longitudinal data analysis. The proposed class of test statistics yields reliable inference even under model contamination. The analysis of a real data set completes the article.

E. Cantoni
Robust Tools in SAS

In this article, I introduce robust routines and a procedure in SAS. The routines are in SAS/IML, acting as function calls. They are LTS, LMS, MCD, MVE, LAV, and MAD. These routines have been released in SAS/IML V8.2 or in previous versions. The SAS/STAT procedure, ROBUSTREG, is experimental.

C. Chen
Robustness Issues Regarding Content-corrected Tolerance Limits

This article reviews the content-corrected method for tolerance limits proposed by Fernholz and Gillespie (2001) and addresses some robustness issues that affect the length of the corresponding interval as well as the corrected content value. The content-corrected method for k-factor tolerance limits consists of obtaining a bootstrap corrected value p* that is robust in the sense of preserving the confidence coefficient for a variety of distributions. We propose several location/scale robust alternatives to obtain robust corrected-content k-factor tolerance limits that produce shorter intervals when outlying observations are present. We analyze the Hadamard differentiability to insure bootstrap consistency for large samples. We define the breakdown point for the particular statistic to be bootstrapped, and we obtain the influence function and the value of the breakdown point for the traditional and the robust statistics. Two examples showing the advantage of using these robust alternatives are also included.

L. T. Fernholz
Breakdown-Point for Spatially and Temporally Correlated Observations

In this paper, we implement a new definition of breakdown in both finite and asymptotic samples with correlated observations arising from spatial statistics and time series. In such situations, existing definitions typically fail because parameters can sometimes breakdown to zero, i.e. the center of the parameter space. The reason is that these definitions center around defining an explicit critical region for either the parameter or the objective function. If for a particular outlier constellation the critical region is entered, breakdown is said to occur. In contrast to the traditional approach, we use a definition that leaves the critical region implicit but still encompasses all previous definitions of breakdown in linear and nonlinear regression settings. We provide examples involving simultaneously specified spatial autoregressive models, as well as autoregressions from time series, for illustration. In particular, we show that in this context the least median of squares estimator has a breakdown-point much lower than the familiar 50%.

M. G. Genton
On Marginal Estimation in a Semiparametric Model for Longitudinal Data with Time-independent Covariates

We consider M-estimators for a class of semiparametric mixed-effect models without time-dependent covariates and show that the simple marginal estimation method is generally better than the same M-estimator applied to the de-correlated response based on a known or estimated covariance matrix for each subject.

X. He, M.-O Kim
Robust PCA for High-dimensional Data

Principal component analysis (PCA) is a well-known technique for dimension reduction. Classical PCA is based on the empirical mean and covariance matrix of the data, and hence is strongly affected by outlying observations. Therefore, there is a huge need for robust PCA. When the original number of variables is small enough, and in particular smaller than the number of observations, it is known that one can apply a robust estimator of multivariate location and scatter and compute the eigenvectors of the scatter matrix.The other situation, where there are many variables (often even more variables than observations), has received less attention in the robustness literature. We will compare two robust methods for this situation. The first one is based on projection pursuit (Li and Chen, 1985; Rousseeuw and Croux, 1993; Croux and Ruiz-Gazen, 1996, 2000; Hubert et al., 2002). The second method is a new proposal, which combines the notion of outlyingness (Stahel, 1981; Donoho, 1982) with the FAST-MCD algorithm (Rousseeuw and Van Driessen, 1999). The performance and the robustness of these two methods are compared through a simulation study. We also illustrate the new method on a chemometrical data set.

M. Hubert, P. J. Rousseeuw, S. Verboven
Robustness Analysis in Forecasting of Time Series

The problems of statistical forecasting of time series under distortions of traditional hypothetical models are considered. The following distorted models of time series are used: trend models under “outliers” and functional distortions, regression models under “outliers” and “errors-in-regressors”, autoregressive time series with parameter specification errors and non-homogeneous innovations. Robustness characteristics based on the mean square risk of forecasting are introduced and evaluated for these cases. In addition, new robust forecasting procedures are presented.

Y. Kharin
Lift-zonoid and Multivariate Depths

Tukey (1975) proposed the halfspace depth concept as a geometrical tool to handle measures. However, only recently (Koshevoy, 1999b; Struyf and Rousseeuw, 1999), it was shown that, for the class of atomic measures, this depth determines the measure. Here we extend this characterization result for the class of absolutely continuous measures for which the function exp(<p, x>) is integrable with any $$ p \in {{\Bbb R}^d}$$. Three issues play a key role in proving this characterization. The first, the Tukey median has a depth $$ \geqslant 1/(K + 1)$$ for any k-variate distribution. The second, let two measures μ and v have the same Tukey depth. Then the restrictions of these measures to any trimmed region are measures with identical Tukey depths. The third, a relation between Tukey depth of a measure with compact support and some projections of lift-zonoid. This relation allows to use the support theorem for the Radon transform.We also show that, for the class of measures with full-dimensional convex hull of the support, the Oja depth determines the measure. The proof of this results are based on some relation between the Oja depth and projections of the lift zonoid. This relation allows us to use another result of integral geometry: the uniqueness theorem of Alexandrov (1937).

G. A. Koshevoy
Asymptotic Distributions of Some Scale Estimators in Nonlinear Models

Often in the robust analysis of regression and time series models there is a need for having a robust scale estimator of a scale parameter of the errors. One often used scale estimator is the median of the absolute residuals s1. It is of interest to know its limiting distribution and the consistency rate. Its limiting distribution generally depends on the estimator of the regression and/or autoregressive parameter vector unless the errors are symmetrically distributed around zero. To overcome this difficulty it is then natural to use the median of the absolute differences of pairwise residuals, s2, as a scale estimator. This paper derives the asymptotic distributions of these two estimators for a large class of nonlinear regression and autoregressive models when the errors are independent and identically distributed. It is found that the asymptotic distribution of a suitably standardized s2 is free of the initial estimator of the regression/autoregressive parameters. A similar conclusion also holds for s1 in linear regression models through the origin and with centered designs, and in linear autoregressive models with zero mean errors.This paper also investigates the limiting distributions of these estimators in nonlinear regression models with long memory moving average errors. An interesting finding is that if the errors are symmetric around zero, then not only is the limiting distribution of a suitably standardized s1 free of the regression estimator, but it is degenerate at zero. On the other hand a similarly standardized s2 converges in distribution to a normal distribution, regardless of the errors being symmetric or not. One clear conclusion is that under the symmetry of the long memory moving average errors, the rate of consistency for s1 is faster than that of s2.

H. L. Koul
Robust Nonparametric Regression and Modality

The paper considers the problem of nonparametric regression with emphasis on controlling the number of local extremes and on resistance against patches of outliers. The robust taut string method is introduced and robustness properties are discussed. An automatic procedure is described.

A. Kovac
Computing a High Depth Point in the Plane

Given a set S = {P1,…,Pn} of n points in Rd, the depthδ (Q)of n points in Q ∈ Rd is the minimum number of points of S that must be in a closed halfspace containing Q. A high depth point is a point whose depth is at least maxi [δ(Pi)] For dimension d = 2 we give a simple, easily implementable O(n(log n)2) deterministic algorithm to compute a high depth point and we give an Ω(n log n) lower bound for this task.

S. Langerman, W. Steiger
Robust Portfolio Optimization

We address the problem of estimating risk-minimizing portfolios from a sample of historical returns, when the underlying distribution that generates returns exhibits departures from the standard Gaussian assumption. Specifically, we examine how the underlying estimation problem is influenced by marginal heavy tails, as modeled by the univariate Student-t distribution, and multivariate tail-dependence, as modeled by the copula of a multivariate Student-t distribution. We show that when such departures from normality are present, robust alternatives to the classical variance portfolio estimator have lower risk.

G. J. Lauprete, A. M. Samarov, R. E. Welsch
BootQC: Bootstrap for Robust Analysis of Aviation Safety Data
Statistical Quality Control by Bootstrap

Effective regulation of air traffic and safety requires a comprehensive aviation safety analysis plan. Most of the existing statistical analysis methods prescribed in the current aviation safety databases are parametric in nature, and thus their applicability is much restricted. A nonparametric alternative based on bootstrap methods is pursued in this paper. By assigning proper false alarm rates to the bootstrap control charts proposed in Liu and Tang (1996), a meaningful threshold system is developed for the purpose of regulating and monitoring aviation safety. The threshold system can serve as a set of standards for evaluating the performance of aviation entities, and provide guidelines for identifying unexpected performances and for assigning appropriate corrective measures. Both bootstrap control charts and threshold systems are demonstrated in an analysis of surveillance results collected by the Federal Aviation Administration (FAA) from several air carriers. The demonstration uses the software BootQC developed in Liu and Teng (2000).

R. Y. Liu
Optimal Weights of Evidence with Bounded Influence

A fundamental statistical problem is to indicate which of two hypotheses is better supported by the data. Statistics designed for this purpose are called weights of evidence. In this paper we study the problem of robust weights of evidence, optimal in their performance while robust in the infinitesimal sense of the influence function.

S. Morgenthaler, R. Staudte
Robust Estimators for Estimating Discontinuous Functions

We study the asymptotic behavior of a wide class of kernel estimators for estimating an unknown regression function. In particular we derive the asymptotic behavior at discontinuity points of the regression function. It turns out that some kernel estimators based on outlier robust estimators are consistent at jumps.

C. H. Müller
Breakdown Point and Computation of Trimmed Likelihood Estimators in Generalized Linear Models

A review of the studies concerning the finite sample breakdown point (BP) of the trimmed likelihood (TL) and related estimators based on the d—fullness technique of Vandev (1993), and Vandev and Neykov (1998) is made. In particular, the BP of these estimators in the frame of the generalized linear models (GLMs) depends on the trimming proportion and the quantity N(X) introduced by Müller (1995). A faster iterative algorithm based on resampling techniques for derivation of the TLE is developed. Examples of real and artificial data in the context of grouped logistic and log-linear regression models are used to illustrate the properties of the TLE.

N. M. Neykov, C. H. Müller
Comparison of Three Methods for Robust Redundancy Analysis

Robust methods are very useful in multivariate statistical analysis since the presence of outliers and departures from the usual model assumptions are very frequent in multivariate data.Three robust estimation methods of the parameters of the redundancy analysis model (based on a robust correlation matrix, partial least squares and projection pursuit) are presented. Artificial data from a simulation study, designed to compare the methods, is used to see how they perform. Methods based on a robust correlation matrix and on the projection pursuit procedure show better results.

M. R. Oliveira, J. A. Branco
A Test for Normality Based on Robust Regression Residuals

We attempt to investigate the effects of using residuals from robust regression replacing OLS residuals in test statistics for the normality of the errors. We have found that this can lead to substantially improved ability to detect lack of normality in suitable situations. We derive the asymptotic distribution of the robustified normality test as chi-squared with 2 degrees of freedom under the null hypothesis of normality of the error terms. The high breakdown property of the test statistic is discussed. By using simulations, we have found that situations where a small subpopulation exhibits characteristics which are different from the main population are the ones which ideally suit to the use of robustified normality tests. We have employed several real data sets from the literature to show that these types of situations arise frequently in real data sets.

A. Ö. Önder, A. Zaman
Tests on Fractional Cointegration Comparison of a Finite M— and ML—test on Fractional Cointegration

Cointegration describes the pattern that pairs of time series keep together in long run, although they diverge in short run. A generalisation of this behaviour is the fractional cointegration. Two statistical tests, the M— and ML—test are formulated for fractional cointegration in different situations. It turns out that the robust M—test reaches almost the same power as the maximum likelihood test under certain assumptions. In contrast to this, the power of the M—test is much higher compared with the ML—test if the examined time series is contaminated following the general replacement model.

A. Peters, P. Sibbertsen
Robust Linear Discriminant Analysis and the Projection Pursuit Approach
Practical Aspects

This paper starts with a short review of previous work on robust discriminant analysis with emphasis on the projection pursuit approach. Some theoretical aspects are briefly discussed. The core of the paper deals with practical issues related to the projection pursuit approach which generalizes Fisher’s linear discriminant analysis. The choices of univariate estimators, starting points and maximization procedure are discussed and exemplified. The results of a simulation study are presented.

A. M. Pires
Small Sample Corrections for LTS and MCD

The least trimmed squares estimator and the minimum covariance determinant estimator Rousseeuw (1984) are frequently used robust estimators of regression and of location and scatter. Consistency factors can be computed for both methods to make the estimators consistent at the normal model. However, for small data sets these factors do not make the estimator unbiased. Based on simulation studies we therefore construct formulas which allow us to compute small sample correction factors for all sample sizes and dimensions without having to carry out any new simulations. We give some examples to illustrate the effect of the correction factor.

G. Pison, S. Van Aelst, G. Willems
Computation of the Multivariate Oja Median

The multivariate Oja median (Oja, 1983) is an affine equivariant multivariate location estimate with high efficiency. This estimate has a bounded influence function but zero breakdown. The computation of the estimate appears to be highly intensive. We consider different, exact and stochastic, algorithms for the calculation of the value of the estimate. In the stochastic algorithms, the gradient of the objective function, the rank function, is estimated by sampling observation. hyperplanes. The estimated rank function with its estimated accuracy then yields a confidence region for the true sample Oja median, and the confidence region shrinks to the sample median with the increasing number of the sampled hyperplanes. Regular grids and the grid given by the data points are used in the construction. Computation times of different algorithms are discussed and compared. For a k-variate data set with n observations our exact and stochastic algorithms have rough time complexity estimates of O(k2nk log n) and O(5k (1/ε)2), respectively, where ε is the radius of confidence L-ball.

T. Ronkainen, H. Oja, P. Orponen
Robust Estimation in the Linear Structural Relation Model: A Study on Tuning Constants

The parameters of a linear structural relation are unidentifiable when the errors of the model are normally distributed. For this reason the estimation of the parameters usually requires additional information and for each type of information different families of estimators were proposed; but in general they are not robust.Bounded influence estimators have been derived when the errors variance ratio is known and when instrumental variables are used. Both types of estimators depend on tuning constants, which must be chosen. The approach of fixing a priori the cut off value without knowing the population distribution has the advantage of simplifying the process. However it can introduce variability on the efficiency of the estimator, since efficiency depends on the true underlying distribution and on the form of the estimator. So, the need of objective criteria for the choice of the constant has motivated several suggestions intended for specific models.When the estimation is carried with instrumental variables, in Branco and Souto de Miranda (2000) it is suggested a method based on the influence function of the classical estimator of the relation parameters. In this study the case of known errors variance ratio is considered and the criterion based on the influence function will be adapted to the corresponding estimators.

M. M. Souto de Miranda
Control Charts for the Median and Interquartile Range

This paper advocates the use of a new type of nonparametric control charts for the median and interquartile range based on jackknifed histograms.

A. J. Stromberg, W. Griffith, M. Smith
Unbiasedness in Least Quantile Regression

We develop an abstract notion of regression which allows for a non-parametric formulation of unbiasedness. We prove then that least quantile regression is unbiased in this sense even in the heteroscedastic case if the error distribution has a continuous, symmetric, and uni-modal density. An example shows that unbiasedness may break down even for smooth and symmetric but not uni-modal error distributions. We compare these results to those for least MAD and least squares regression.

D. Tasche
Tests of Independence Based on Sign and Rank Covariances

In this paper three different concepts of bivariate sign and rank, namely marginal sign and rank, spatial sign and rank and affine equivariant sign and rank, are considered. The aim is to see whether these different sign and rank covariances can be used to construct tests for the hypothesis of independence. In some cases (spatial sign, affine equivariant sign and rank) an additional assumption on the symmetry of marginal distribution is needed. Limiting distributions of test statistics under the null hypothesis as well as under interesting sequences of contiguous alternatives are derived. Asymptotic relative efficiencies with respect to the regular correlation test are calculated and compared. Finally the theory is illustrated by a simple example.

S. Taskinen, A. Kankainen, H. Oja
Java and Computing for Robust Statistics

The recent advances in the Java technology provoked increasing interest in using Java as a programming language for development of numerical software. Given its attractive language features it worths investigating the possibilities of writing software for robust computing in Java. In this paper the design and implementation of an object-oriented library for computing of high breakdown point robust multivariate location and covariance matrix estimates called JMCD is presented and its performance as compared to the corresponding FORTRAN implementations is examined. The FAST-MCD algorithm of Rousseeuw and Van Driessen (1999) written in Java can be competitive to FORTRAN (on PC/Windows platforms). The Java implementation is on average 30% slower with its efficiency depending on the size of the problem, but this could be accepted given the other advantages of the language.

V. Todorov
A Robust Hotelling Test

Hotelling’s T2 statistic is an important tool for inference about the center of a multivariate normal population. However, hypothesis tests and confidence intervals based on this statistic can be adversely affected by outliers. Therefore, we construct an alternative inference technique based on a statistic which uses the highly robust MCD estimator (Rousseeuw, 1984) instead of the classical mean and covariance matrix. Recently, a fast algorithm was constructed to compute the MCD (Rousseeuw and Van Driessen, 1999). In our test statistic we use the reweighted MCD, which has a higher efficiency. The distribution of this new statistic differs from the classical one. Therefore, the key problem is to find a good approximation for this distribution. Similarly to the classical T2 distribution, we obtain a multiple of a certain F-distribution. A Monte Carlo study shows that this distribution is an accurate approximation of the true distribution. Finally, the power and the robustness of the one-sample test based on our robust T2 are investigated through simulation.

G. Willems, G. Pison, P. J. Rousseeuw, S. Van Aelst
Metadaten
Titel
Developments in Robust Statistics
herausgegeben von
Professor Dr. Rudolf Dutter
Professor Dr. Peter Filzmoser
Professor Dr. Ursula Gather
Professor Dr. Peter J. Rousseeuw
Copyright-Jahr
2003
Verlag
Physica-Verlag HD
Electronic ISBN
978-3-642-57338-5
Print ISBN
978-3-642-63241-9
DOI
https://doi.org/10.1007/978-3-642-57338-5