Classical and robust/resistant procedures for the estimation of population parameters and the identification of multiple outliers in univariate and multivariate populations are reviewed. The successful identification of anomalous observations depends on the statistical procedures employed. Commercial industries, local communities, and government agencies such as the United States Environmental Protection Agency (U.S. EPA), often need to assess the extent of contamination at polluted sites. Identification of these contaminants having potentially adverse effects on human health is especially important in various ecological and environmental applications. An environmental scientist typically generates and analyzes large amounts of multidimensional data. These practioners often need to identify experimental conditions and results which look suspicious and are significantly different from the rest of the data. The classical Mahalanobis distance (MD) and its variants (e.g., multivariate kurtosis) are routinely used to identify these anomalies. These test statistics depend upon the estimates of population location and scale. The presence of anomalous observations usually results in distorted and unreliable maximum likelihood estimates (MLEs) and ordinary least-squares (OLS) estimates of the population parameters. These in turn result in deflated and distorted classical MDs and lead to masking effects. This means that the results from statistical tests and inference based upon these classical estimates may be misleading. For example, in an environmental monitoring application, it is possible that the classification procedure based upon the distorted estimates may classify a contaminated sample as coming from the clean population and a clean sample as coming from the contaminated part of the site. This in turn can lead to incorrect remediation decisions.It is well established among practioners that, for the identification of multiple outliers, one should use robust procedures with a high breakdown point. The estimates obtained using the robust procedures should be in close agreement with the corresponding classical OLS and MLEs when no discordant observations (from different population(s)) are present. Robust procedures for the identification of outliers and the estimation of population parameters of location and scale typically use an influence function. The robust procedure based upon a recently developed “proposed” influence function, called the PROP function, works quite effectively in estimating population parameters accurately, and correctly identifying multiple outliers in univariate and multivariate populations. The control-chart-type quantile-quantile (Q-Q) graphical display of multivariate data combines the effect of a formal test procedure and an informal graphical display into one powerful multiple outlier identification procedure. The scatter plot of the robustified square root leverage distances vs the residuals identifies all regression outliers and distinguishes between significant and insignificant leverage points. The procedures discussed here unmask multiple anomalies and provide reliable estimates of the population parameters in several areas of interest, including linear regression models, discriminant and principal component analyses, and variogram modeling in geostatistical applications. The U.S. EPA, through the Office of Research and Development (ORD), has research interests in optimizing its quality assurance program by developing statistical procedures that are insensitive to outliers (resistant) and the underlying assumptions (robust).
Weitere Kapitel dieses Buchs durch Wischen aufrufen
- Robust Procedures for the Identification of Multiple Outliers
John M. Nocerino
- Springer Berlin Heidelberg