Skip to main content
main-content
Top

About this book

S is a powerful environment for the statistical and graphical analysis of data. It provides the tools to implement many statistical ideas that have been made possible by the widespread availability of workstations having good graphics and computational capabilities. This book is a guide to using S environments to perform statistical analyses and provides both an introduction to the use of S and a course in modern statistical methods. Implementations of S are available commercially in S-PLUS(R) workstations and as the Open Source R for a wide range of computer systems. The aim of this book is to show how to use S as a powerful and graphical data analysis system. Readers are assumed to have a basic grounding in statistics, and so the book is intended for would-be users of S-PLUS or R and both students and researchers using statistics. Throughout, the emphasis is on presenting practical problems and full analyses of real data sets. Many of the methods discussed are state of the art approaches to topics such as linear, nonlinear and smooth regression models, tree-based methods, multivariate analysis, pattern recognition, survival analysis, time series and spatial statistics. Throughout modern techniques such as robust methods, non-parametric smoothing and bootstrapping are used where appropriate. This fourth edition is intended for users of S-PLUS 6.0 or R 1.5.0 or later. A substantial change from the third edition is updating for the current versions of S-PLUS and adding coverage of R. The introductory material has been rewritten to emphasis the import, export and manipulation of data. Increased computational power allows even more computer-intensive methods to be used, and methods such as GLMMs,

Table of Contents

Frontmatter

Chapter 1. Introduction

Abstract
Statistics is fundamentally concerned with the understanding of structure in data. One of the effects of the information-technology era has been to make it much easier to collect extensive datasets with minimal human intervention. Fortunately, the same technological advances allow the users of statistics access to much more powerful ‘calculators’ to manipulate and display data. This book is about the modern developments in applied statistics that have been made possible by the widespread availability of workstations with high-resolution graphics and ample computational power. Workstations need software, and the S1 system developed at Bell Laboratories (Lucent Technologies, formerly AT&T) provides a very flexible and powerful environment in which to implement new statistical ideas. Lucent’s current implementation of S is exclusively licensed to the Insightful Corporation2, which distributes an enhanced system called S-PLUS.
W. N. Venables, B. D. Ripley

Chapter 2. Data Manipulation

Abstract
Statistics is fundamentally about understanding data. We start by looking at how data are represented in S, then move on to importing, exporting and manipulating data.
W. N. Venables, B. D. Ripley

Chapter 3. The S Language

Abstract
S is a language for the manipulation of objects. It aims to be both an interactive language (like, for example, a UNIX shell language) and a complete programming language with some convenient object-oriented features. This chapter is intended for reference use by interactive users; Venables and Ripley (2000) covers more aspects of the language for programmers.
W. N. Venables, B. D. Ripley

Chapter 4. Graphics

Abstract
Both S-PLUS and R provide comprehensive graphics facilities for static two-dimensional plots, from simple facilities for producing common diagnostic plots by plot (object) to fine control over publication-quality graphs. In consequence, the number of graphics parameters is huge. In this chapter, we build up the complexity gradually. Most readers will not need the material in Section 4.4, and indeed the material there is not used elsewhere in this book. However, we have needed to make use of it, especially in matching existing graphical styles.
W. N. Venables, B. D. Ripley

Chapter 5. Univariate Statistics

Abstract
In this chapter we cover a number of topics from classical univariate statistics plus some modern versions.
W. N. Venables, B. D. Ripley

Chapter 6. Linear Statistical Models

Abstract
Linear models form the core of classical statistics and are still the basis of much of statistical practice; many modern modelling and analytical techniques build on the methodology developed for linear models.
W. N. Venables, B. D. Ripley

Chapter 7. Generalized Linear Models

Abstract
Generalized linear models (GLMs) extend linear models to accommodate both non-normal response distributions and transformations to linearity. (We assume that Chapter 6 has been read before this chapter.) The essay by Firth (1991) gives a good introduction to GLMs; the comprehensive reference is McCullagh and Nelder (1989).
W. N. Venables, B. D. Ripley

Chapter 8. Non-Linear and Smooth Regression

Abstract
In linear regression the mean surface is a plane in sample space; in non-linear regression it may be an arbitrary curved surface but in all other respects the models are the same. Fortunately the mean surface in most non-linear regression models met in practice will be approximately planar in the region of highest likelihood, allowing some good approximations based on linear regression to be used, but non-linear regression models can still present tricky computational and inferential problems.
W. N. Venables, B. D. Ripley

Chapter 9. Tree-Based Methods

Abstract
The use of tree-based models may be unfamiliar to statisticians, although researchers in other fields have found trees to be an attractive way to express knowledge and aid decision-making. Keys such as Figure 9.1 are common in botany and in medical decision-making, and provide a way to encapsulate and structure the knowledge of experts to be used by less-experienced users. Notice how this tree uses both categorical variables and splits on continuous variables. (It is a tree, and readers are encouraged to draw it.)
W. N. Venables, B. D. Ripley

Chapter 10. Random and Mixed Effects

Abstract
Models with mixed effects contain both fixed and random effects. Fixed effects are what we have been considering up to now; the only source of randomness in our models arises from regarding the cases as independent random samples. Thus in regression we have an additive measurement error that we assume is independent between cases, and in a GLM we observe independent binomial, Poisson, gamma ... random variates whose mean is a deterministic function of the explanatory variables.
W. N. Venables, B. D. Ripley

Chapter 11. Exploratory Multivariate Analysis

Abstract
Multivariate analysis is concerned with datasets that have more than one response variable for each observational or experimental unit. The datasets can be summarized by data matrices X with n rows and p columns, the rows representing the observations or cases, and the columns the variables. The matrix can be viewed either way, depending on whether the main interest is in the relationships between the cases or between the variables. Note that for consistency we represent the variables of a case by the row vector x.
W. N. Venables, B. D. Ripley

Chapter 12. Classification

Abstract
Classification is an increasingly important application of modern methods in statistics. In the statistical literature the word is used in two distinct senses. The entry (Hartigan, 1982) in the original Encyclopedia of Statistical Sciences uses the sense of cluster analysis discussed in Section 11.2. Modern usage is leaning to the other meaning (Ripley, 1997) of allocating future cases to one of g prespecified classes. Medical diagnosis is an archetypal classification problem in the modern sense. (The older statistical literature sometimes refers to this as allocation.)
W. N. Venables, B. D. Ripley

Chapter 13. Survival Analysis

Abstract
Extensive survival analysis facilities written by Terry Therneau (Mayo Foundation) are available in S-PLUS and in the R package survival.
W. N. Venables, B. D. Ripley

Chapter 14. Time Series Analysis

Abstract
There are now many books on time series. Our philosophy and notation are close to those of the applied book by Diggle (1990) (from which some of our examples are taken). Brockwell and Davis (1991) and Priestley (1981) provide more theoretical treatments, and Bloomfield (2000) and Priestley are particularly thorough on spectral analysis. Brockwell and Davis (1996) and Shumway and Stoffer (2000) provide readable introductions to time series theory and practice.
W. N. Venables, B. D. Ripley

Chapter 15. Spatial Statistics

Abstract
Spatial statistics is a recent and graphical subject that is ideally suited to implementation in S; S-PLUS itself includes one spatial interpolation method, akima, and loess which can be used for two-dimensional smoothing, but the specialist methods of spatial statistics have been added and are given in our library section spatial. The main references for spatial statistics are Ripley (1981, 1988), Diggle (1983), Upton and Fingleton (1985) and Cressie (1991). Not surprisingly, our notation is closest to that of Ripley (1981).
W. N. Venables, B. D. Ripley

Chapter 16. Optimization

Abstract
Statisticians1 often under-estimate the usefulness of general optimization methods in maximizing likelihoods and in other model-fitting problems. Not only are the general-purpose methods available in the S environments quick to use, they also often outperform the specialized methods that are available. A lot of the software we have illustrated in earlier chapters is based on the functions described in this. Code that seemed slow when the first edition was being prepared in 1993 now seems almost instant.
W. N. Venables, B. D. Ripley

Backmatter

Additional information