Skip to main content

2011 | Buch

An Introduction to Applied Multivariate Analysis with R

verfasst von: Brian Everitt, Torsten Hothorn

Verlag: Springer New York

Buchreihe : Use R!

insite
SUCHEN

Über dieses Buch

The majority of data sets collected by researchers in all disciplines are multivariate, meaning that several measurements, observations, or recordings are taken on each of the units in the data set. These units might be human subjects, archaeological artifacts, countries, or a vast variety of other things. In a few cases, it may be sensible to isolate each variable and study it separately, but in most instances all the variables need to be examined simultaneously in order to fully grasp the structure and key features of the data. For this purpose, one or another method of multivariate analysis might be helpful, and it is with such methods that this book is largely concerned. Multivariate analysis includes methods both for describing and exploring such data and for making formal inferences about them. The aim of all the techniques is, in general sense, to display or extract the signal in the data in the presence of noise and to find out what the data show us in the midst of their apparent chaos.

An Introduction to Applied Multivariate Analysis with R explores the correct application of these methods so as to extract as much information as possible from the data at hand, particularly as some type of graphical representation, via the R software. Throughout the book, the authors give many examples of R code used to apply the multivariate techniques to multivariate data.

Inhaltsverzeichnis

Frontmatter
1. Multivariate Data and Multivariate Analysis
Abstract
Multivariate data arise when researchers record the values of several random variables on a number of subjects or objects or perhaps one of a variety of other things (we will use the general term “units”) in which they are interested, leading to a vector-valued or multidimensional observation for each. Such data are collected in a wide range of disciplines, and indeed it is probably reasonable to claim that the majority of data sets met in practise are multivariate. In some studies, the variables are chosen by design because they are known to be essential descriptors of the system under investigation. In other studies, particularly those that have been difficult or expensive to organise, many variables may be measured simply to collect as much information as possible as a matter of expediency or economy.
Brian Everitt, Torsten Hothorn
2. Looking at Multivariate Data: Visualisation
Abstract
According to Chambers, Cleveland, Kleiner, and Tukey (1983), “there is no statistical tool that is as powerful as a well-chosen graph”. Certainly graphical presentation has a number of advantages over tabular displays of numerical results, not least in creating interest and attracting the attention of the viewer. But just what is a graphical display? A concise description is given by Tufte (1983): Data graphics visually display measured quantities by means of the combined use of points, lines, a coordinate system, numbers, symbols, words, shading and color.
Brian Everitt, Torsten Hothorn
3. Principal Components Analysis
Abstract
One of the problems with a lot of sets of multivariate data is that there are simply too many variables to make the application of the graphical techniques described in the previous chapters successful in providing an informative initial assessment of the data. And having too many variables can also cause problems for other multivariate techniques that the researcher may want to apply to the data. The possible problem of too many variables is sometimes known as the curse of dimensionality (Bellman 1961). Clearly the scatterplots, scatterplot matrices, and other graphics included in Chapter 2 are likely to be more useful when the number of variables in the data, the dimensionality of the data, is relatively small rather than large. This brings us to principal components analysis, a multivariate technique with the central aim of reducing the dimensionality of a multivariate data set while accounting for as much of the original variation as possible present in the data set. This aim is achieved by transforming to a new set of variables, the principal components, that are linear combinations of the original variables, which are uncorrelated and are ordered so that the first few of them account for most of the variation in all the original variables. In the best of all possible worlds, the result of a principal components analysis would be the creation of a small number of new variables that can be used as surrogates for the originally large number of variables and consequently provide a simpler basis for, say, graphing or summarising the data, and also perhaps when undertaking further multivariate analyses of the data.
Brian Everitt, Torsten Hothorn
4. Multidimensional Scaling
Abstract
In Chapter 3, we noted in passing that one of the most useful ways of using principal components analysis was to obtain a low-dimensional “map” of the data that preserved as far as possible the Euclidean distances between the observations in the space of the original q variables. In this chapter, we will make this aspect of principal component analysis more explicit and also introduce a class of other methods, labelled multidimensional scaling, that aim to produce similar maps of data but do not operate directly on the usual multivariate data matrix, X. Instead they are applied to distance matrices (see Chapter 1), which are derived from the matrix X (an example of a distance matrix derived from a small set of multivariate data is shown in Subsection 4.4.2), and also to so-called dissimilarity or similarity matrices that arise directly in a number of ways, in particular from judgements made by human raters about how alike pairs of objects, stimuli, etc., of interest are. An example of a directly observed dissimilarity matrix is shown in Table 4.5, with judgements about political and war leaders that had major roles in World War II being given by a subject after receiving the simple instructions to rate each pair of politicians on a nine-point scale, with 1 indicating two politicians they regard as very similar and 9 indicating two they regard as very dissimilar. (If the nine point-scale had been 1 for very dissimilar and 9 for very similar, then the result would have been a rating of similarity, although similarities are often scaled to lie in a [0; 1] interval. The term proximity is often used to encompass both dissimilarity and similarity ratings.)
Brian Everitt, Torsten Hothorn
5. Exploratory Factor Analysis
Abstract
In many areas of psychology, and other disciplines in the behavioural sciences, often it is not possible to measure directly the concepts of primary interest. Two obvious examples are intelligence and social class. In such cases, the researcher is forced to examine the concepts indirectly by collecting information on variables that can be measured or observed directly and can also realistically be assumed to be indicators, in some sense, of the concepts of real interest. The psychologist who is interested in an individual’s “intelligence”, for example, may record examination scores in a variety of different subjects in the expectation that these scores are dependent in some way on what is widely regarded as “intelligence” but are also subject to random errors. And a sociologist, say, concerned with people’s “social class”might pose questions about a person’s occupation, educational background, home ownership, etc., on the assumption that these do reect the concept he or she is really interested in.
Brian Everitt, Torsten Hothorn
6. Cluster Analysis
Abstract
One of the most basic abilities of living creatures involves the grouping of similar objects to produce a classification. The idea of sorting similar things into categories is clearly a primitive one because early humans, for example, must have been able to realise that many individual objects shared certain properties such as being edible, or poisonous, or ferocious, and so on. And classification in its widest sense is needed for the development of language, which consists of words that help us to recognise and discuss the different types of events, objects, and people we encounter. Each noun in a language, for example, is essentially a label used to describe a class of things that have striking features in common; thus animals are called cats, dogs, horses, etc., and each name collects individuals into groups. Naming and classifying are essentially synonymous.
Brian Everitt, Torsten Hothorn
7. Confirmatory Factor Analysis and Structural Equation Models
Abstract
An exploratory factor analysis as described in Chapter 5 is used in the early investigation of a set of multivariate data to determine whether the factor analysis model is useful in providing a parsimonious way of describing and accounting for the relationships between the observed variables. The analysis will determine which observed variables are most highly correlated with the common factors and how many common factors are needed to give an adequate description of the data. In an exploratory factor analysis, no constraints are placed on which manifest variables load on which factors. In this chapter, we will consider confirmatory factor analysis models in which particular manifest variables are allowed to relate to particular factors whilst other manifest variables are constrained to have zero loadings on some of the factors. A confirmatory factor analysis model may arise from theoretical considerations or be based on the results of an exploratory factor analysis where the investigator might wish to postulate a specific model for a new set of similar data, one in which the loadings of some variables on some factors are fixed at zero because they were “small” in the exploratory analysis and perhaps to allow some pairs of factors but not others to be correlated. It is important to emphasise that whilst it is perfectly appropriate to arrive at a factor model to submit to a confirmatory analysis from an exploratory factor analysis, the model must be tested on a fresh set of data. Models must not be generated and tested on the same data.
Brian Everitt, Torsten Hothorn
8. The Analysis of Repeated Measures Data
Abstract
The multivariate data sets considered in previous chapters have involved measurements or observations on a number of different variables for each object or individual in the study. In this chapter, however, we will consider multivariate data of a different nature, namely data resulting from the repeated measurements of the same variable on each unit in the data set. Examples of such data are common in many disciplines. But before we introduce some actual repeated measures data sets, we need to make a small digression in order to introduce the two different “formats”, the wide and the long forms, in which such data are commonly stored and dealt with.
Brian Everitt, Torsten Hothorn
Backmatter
Metadaten
Titel
An Introduction to Applied Multivariate Analysis with R
verfasst von
Brian Everitt
Torsten Hothorn
Copyright-Jahr
2011
Verlag
Springer New York
Electronic ISBN
978-1-4419-9650-3
Print ISBN
978-1-4419-9649-7
DOI
https://doi.org/10.1007/978-1-4419-9650-3