Skip to main content
main-content
Top

About this book

S-PLUS is a powerful environment for the statistical and graphical analysis of data. It provides the tools to implement many statistical ideas which have been made possible by the widespread availability of workstations having good graphics and computational capabilities. This book is a guide to using S-PLUS to perform statistical analyses and provides both an introduction to the use of S-PLUS and a course in modern statistical methods. S-PLUS is available for both Windows and UNIX workstations, and both versions are covered in depth. The aim of the book is to show how to use S-PLUS as a powerful and graphical system. Readers are assumed to have a basic grounding in statistics, and so the book is intended for would-be users of S-PLUS, and both students and researchers using statistics. Throughout, the emphasis is on presenting practical problems and full analyses of real data sets. Many of the methods discussed are state-of-the-art approaches to topics such as linear and non-linear regression models, robust and smooth regression methods, survival analysis, multivariate analysis, tree-based methods, time series, spatial statistics, and classification. This second edition is intended for users of S-PLUS 3.3, 4.0, or later. It covers the recent developments in graphics and new statistical functionality, including bootstraping, mixed effects, linear and non-linear models, factor analysis, and regression with autocorrelated errors. The material on S-PLUS programming has been re-written to explain the full story behind the object-oriented programming features. The authors have written several software libraries which enhance S-PLUS; these and all the datasets used are available on the Internet in versions for Windows and UNIX. There are also on-line complements covering advanced material, further exercises and new features of S-PLUS as they are introduced. Dr. Venables is Head of Department and Senior Lecturer at the Department of

Table of Contents

Frontmatter

Chapter 1. Introduction

Abstract
Statistics is fundamentally concerned with the understanding of structures in data. One of the effects of the information-technology era has been to make it much easier to collect extensive datasets with minimal human intervention. Fortunately the same technological advances allow the users of statistics access to much more powerful ‘calculators’ to manipulate and display data. This book is about the modern developments in applied statistics which have been made possible by the widespread availability of workstations with high-resolution graphics and computational power equal to a mainframe of a few years ago. Workstations need software, and the S system developed at what was then AT&T’s Bell Laboratories (and now at Lucent Technologies) provides a very flexible and powerful environment in which to implement new statistical ideas. Thus this book provides both an introduction to the use of S and a course in modern statistical methods.
W. N. Venables, B. D. Ripley

Chapter 2. The S Language

Abstract
S is a language for the manipulation of objects. It aims to be both an interactive language (like, for example, a Unix shell language) as well as a complete programming language with some convenient object-oriented features. In this chapter we shall be concerned with the interactive language, and hence certain language constructs used mainly in programming will be postponed to Chapter 4.
W. N. Venables, B. D. Ripley

Chapter 3. Graphical Output

Abstract
S-PLUS provides comprehensive graphics facilities, from simple facilities for producing common diagnostic plots by plot (object) to fine control over publication-quality graphs. In consequence, the number of graphics parameters is huge. In this chapter, we build up the complexity gradually. Most readers will not need the material in Section 3.4, and indeed the material there is not used elsewhere in this book. However, we have needed to make use of it, especially in matching existing graphical styles.
W. N. Venables, B. D. Ripley

Chapter 4. Programming in S

Abstract
The S language is both an interactive language and a language for adding new functions to the S-PLUS system. It is a complete programming language with control structures, recursion and a useful variety of data types. The S-PLUS environment provides many functions to handle standard operations, but most users need occasionally to write new functions. This chapter is concerned with designing, writing, testing and correcting your own S functions.
W. N. Venables, B. D. Ripley

Chapter 5. Distributions and Data Summaries

Abstract
In this chapter we cover a number of topics from classical univariate statistics. Many of the functions used are S-PLUS extensions to S.
W. N. Venables, B. D. Ripley

Chapter 6. Linear Statistical Models

Abstract
Linear models form the core of classical statistics, and S provides extensive facilities to fit and investigate them. These work with a version of the Wilkinson-Rogers notation (Wilkinson & Rogers, 1973) for specifying models which we discuss in Section 6.2.
W. N. Venables, B. D. Ripley

Chapter 7. Generalized Linear Models

Abstract
Generalized linear models (GLMs) extend linear models to accommodate both non-normal response distributions and transformations to linearity. (We will assume that Chapter 6 has been read before this chapter.) The essay by Firth (1991) gives a good introduction to GLMs; the comprehensive reference is McCullagh & Nelder (1989).
W. N. Venables, B. D. Ripley

Chapter 8. Robust Statistics

Abstract
Outliers are sample values which cause surprise in relation to the majority of the sample. This is not a pejorative term; outliers may be correct, but they should always be checked for transcription errors. They can play havoc with standard statistical methods, and many robust and resistant methods have been developed since 1960 to be less sensitive to outliers.
W. N. Venables, B. D. Ripley

Chapter 9. Non-linear Models

Abstract
In linear regression the mean surface in sample space is a plane; in non-linear regression it may be an arbitrary curved surface but in all other respects the models are same. Fortunately in practice the mean surface in most non-linear regression models will be approximately planar in the region of highest likelihood, allowing some good approximations based on linear regression techniques to be used, but non-linear regression models can still present tricky computational and inferential problems.
W. N. Venables, B. D. Ripley

Chapter 10. Random and Mixed Effects

Abstract
We collect together several ways to handle linear and non-linear models with random effects, possibly as well as fixed effects.
W. N. Venables, B. D. Ripley

Chapter 11. Modern Regression

Abstract
S-PLUS has a ‘Modern Regression Module’ which contains functions for a number of regression methods. These are not necessarily non-linear in the sense of Chapter 9, which refers to a non-linear parametrization, but they do allow non-linear functions of the independent variables to be chosen by the procedures. The methods are all fairly computer-intensive, and so are only feasible in the era of plentiful computing power (and hence are ‘modern’). Some of these methods are part of the S modelling language, and others have been added by S-PLUS. As the latter predate the modelling language and have not been updated, the functions of this chapter do not have a consistent style and user interface.
W. N. Venables, B. D. Ripley

Chapter 12. Survival Analysis

Abstract
S-PLUS contains extensive survival analysis facilities written by Terry Therneau (Mayo Foundation). The functions in S-PLUS 3.3 and 3.4 are modified versions of code available from statlib as survival4 (see Appendix C for further information); S-PLUS 3.2 and earlier versions used the rather different survival2 . As the code for survival4 is available1, we strongly recommend its use.
W. N. Venables, B. D. Ripley

Chapter 13. Multivariate Analysis

Abstract
Multivariate analysis is concerned with datasets which have more than one response variable for each observational or experimental unit. The datasets can be summarized by data matrices X with n rows and p columns, the rows representing the observations or cases, and the columns the variables. The matrix can be viewed either way, depending whether the main interest is in the relationships between the cases or between the variables. Note that for consistency we represent the variables of a case by the row vector x.
W. N. Venables, B. D. Ripley

Chapter 14. Tree-based Methods

Abstract
The use of tree-based models will be relatively unfamiliar to statisticians, although researchers in other fields have found trees to be an attractive way to express knowledge and aid decision-making. Keys such as Figure 14.1 are common in botany and in medical decision-making, and provide a way to encapsulate and structure the knowledge of experts to be used by less-experienced users. Notice how this tree uses both categorical variables and splits on continuous variables.
W. N. Venables, B. D. Ripley

Chapter 15. Time Series

Abstract
There are now a large number of books on time series. Our philosophy and notation are close to those of the applied book by Diggle (1990) (from which some of our examples are taken). Brockwell & Davis (1991) and Priestley (1981) provide more theoretical treatments, and Bloomfield (1976) and Priestley are particularly thorough on spectral analysis. Brockwell & Davis (1996) is an excellent low-level introduction to the theory.
W. N. Venables, B. D. Ripley

Chapter 16. Spatial Statistics

Abstract
Spatial statistics is a recent and graphical subject which is ideally suited to implementation in S; S itself includes one spatial interpolation method, akima , and loess which can be used for two-dimensional smoothing, but the specialist methods of spatial statistics have been added and are given in our library spatial . The main references for spatial statistics are Ripley (1981, 1988), Diggle (1983), Upton & Fingleton (1985) and Cressie (1991). Not surprisingly, our notation is closest to that of Ripley (1981).
W. N. Venables, B. D. Ripley

Chapter 17. Classification

Abstract
Classification is an increasingly important application of modern methods in statistics. In the statistical literature the word is used in two distinct senses. The entry (Hartigan, 1982) in the original Encyclopedia of Statistical Sciences uses the sense of cluster analysis discussed in Chapter 13, and this is the main business of the International Federation of Classification Societies. Modern usage is leaning to the other meaning (Ripley, 1997) of allocating future cases to one of g classes. This is similar to discriminant analysis, but we are not interested in the differences between the classes per se. Medical diagnosis is an archetypal classification problem in the modern sense. (The older statistical literature sometimes refers to this as allocation.)
W. N. Venables, B. D. Ripley

Backmatter

Additional information