Skip to main content

2004 | Buch

Exploring Multivariate Data with the Forward Search

verfasst von: Anthony C. Atkinson, Marco Riani, Andrea Cerioli

Verlag: Springer New York

Buchreihe : Springer Series in Statistics

insite
SUCHEN

Über dieses Buch

Why We Wrote This Book This book is about using graphs to explore and model continuous multi­ variate data. Such data are often modelled using the multivariate normal distribution and, indeed, there is a literatme of weighty statistical tomes presenting the mathematical theory of this activity. Our book is very dif­ ferent. Although we use the methods described in these books, we focus on ways of exploring whether the data do indeed have a normal distribution. We emphasize outlier detection, transformations to normality and the de­ tection of clusters and unsuspected influential subsets. We then quantify the effect of these departures from normality on procedures such as dis­ crimination and duster analysis. The normal distribution is central to our book because, subject to our exploration of departures, it provides useful models for many sets of data. However, the standard estimates of the parameters, especially the covari­ ance matrix of the observations, are highly sensitive to the presence of outliers. This is both a blessing and a curse. It is a blessing because, if we estimate the parameters with the outliers excluded, their effect is appre­ ciable and apparent if we then include them for estimation. It is however a curse because it can be hard to detect which observations are outliers. We use the forward search for this purpose.

Inhaltsverzeichnis

Frontmatter
1. Examples of Multivariate Data
Abstract
In our first example the data form a 200 × 6 matrix: six readings on the dimensions of the heads of 200 young men. This rectangular array is the form of all our data sets, an n × υ matrix representing υ observations on each of n units, here people. In most examples we first look at a scatterplot matrix of the data and then fit a multivariate normal distribution. This model can be fitted to any such rectangular array of numbers, so we need to explore the data to see whether this is an appropriate model for these data. Some departures from the multivariate normal model include:
  • The presence of a single outlier;
  • The presence of a group of outliers;
  • Two or more distinct groups in the data;
  • A transformation is required to obtain approximate normality of the data.
Anthony C. Atkinson, Marco Riani, Andrea Cerioli
2. Multivariate Data and the Forward Search
Abstract
Unlike the other chapters in the book, this chapter contains little data analysis. The emphasis is on theory and on the description of the search. In the first half of the chapter we provide distributional results on estimation, testing and on the distribution of quantities such as squared Mahalanobis distances from samples of size n.
Anthony C. Atkinson, Marco Riani, Andrea Cerioli
3. Data from One Multivariate Distribution
Abstract
In this chapter we extend our analyses of the examples in Chapter 1 in order to display further features of the forward search. We use our analysis of the Swiss heads data to exemplify the properties of bivariate boxplots for data analysis. As a preparation for material on transformations of data in Chapter 4 we compare analyses of the data on national track records for women when the response is the time for the race and also its reciprocal, speed. This transformation leads to an appreciably simpler analysis. Our further analysis of the data on municipalities in Emilia-Romagna focuses on the last sixteen units to enter the forward search. For part of our analysis we reduce the data to five selected variables that explain much of the structure of the outliers. The last example is the data on Swiss bank notes. We analyse all 200 observations together and also look at the two groups separately. Forward plots of individual Mahalanobis distances, calibrated by plots of a large number of units of known origin, are shown to be a powerful tool for determining group membership.
Anthony C. Atkinson, Marco Riani, Andrea Cerioli
4. Multivariate Transformations to Normality
Abstract
The analysis of data is often improved by using a transformation of the response, rather than the original response itself. There are physical reasons why a transformation might be expected to be helpful in some examples. If the data arise from a counting process, they often have a Poisson distribution and the square root transformation will provide observations with an approximately constant variance, independent of the mean. Similarly, concentrations are nonnegative variables and so cannot strictly be subject to additive errors of constant variance. Such effects are most noticeable if there are observations both close to, and far from, zero as they are for the viscosity measurements of the babyfood data introduced in §2.13.2. In this chapter we analyze such data using the multivariate version of the parametric family of power transformations introduced by Box and Cox (1964).
Anthony C. Atkinson, Marco Riani, Andrea Cerioli
5. Principal Components Analysis
Abstract
Principal components analysis is a way of reducing the number of variables in the model. It may be that some of the variables are highly correlated with each other, so that not all are needed for a description of the subject of study; perhaps a few linear combinations of the variables would suffice. Other variables may be unrelated to any features of interest. The data on communities in Emilia-Romagna offer many such possibilities. In Chapter 4 we arbitrarily divided the variables into three groups. But do we need all the nine demographic variables in order to describe the variation in the communities or would a few variables suffice, or a few combinations of variables? Then the other variables would be contributing nothing but noise to the measurements.
Anthony C. Atkinson, Marco Riani, Andrea Cerioli
6. Discriminant Analysis
Abstract
In discriminant analysis the multivariate observations are divided into g groups the membership of which is assumed known without error. The purpose of the analysis is to develop a rule for the allocation of a new observation of unknown origin to the most likely group. For example, in the case of the Swiss bank notes there are two groups, genuine notes and forgeries. The purpose of the analysis would be to develop a rule for determining whether or not a new note was genuine.
Anthony C. Atkinson, Marco Riani, Andrea Cerioli
7. Cluster Analysis
Abstract
In cluster analysis the multivariate observations are to be divided into g groups. The membership of the groups is not known, nor is the number of groups. The situation is seemingly different from that of discriminant analysis considered in Chapter 6 where both the number of groups and group membership are known. However, there is much in common between our procedure for clustering and the methods we used in the earlier chapters.
Anthony C. Atkinson, Marco Riani, Andrea Cerioli
8. Spatial Linear Models
Abstract
The main goal of spatial modelling is to provide a description of continuous or categorical phenomena observed at locations (i.e. points or surfaces) in space. By far the most common applications have been on the earth’s surface, where each location can be described by a two-dimensional vector of geographical coordinates. One example is the analysis of yield from spatially contiguous plots in an experimental design, when the plots are subject to different treatments the effects of which are to be estimated. Another major example in environmental sciences is the study of pollution data recorded at a number of monitoring stations within the same area.
Anthony C. Atkinson, Marco Riani, Andrea Cerioli
Backmatter
Metadaten
Titel
Exploring Multivariate Data with the Forward Search
verfasst von
Anthony C. Atkinson
Marco Riani
Andrea Cerioli
Copyright-Jahr
2004
Verlag
Springer New York
Electronic ISBN
978-0-387-21840-3
Print ISBN
978-1-4419-2353-0
DOI
https://doi.org/10.1007/978-0-387-21840-3