1994 | Book

# Modern Applied Statistics with S-Plus

Authors: W. N. Venables, B. D. Ripley

Publisher: Springer New York

Book Series: Statistics and Computing

Included in: Professional Book Archive

1994 | Book

Authors: W. N. Venables, B. D. Ripley

Publisher: Springer New York

Book Series: Statistics and Computing

Included in: Professional Book Archive

S-Plus is a powerful environment for statistical and graphical analysis of data. It provides the tools to implement many statistical ideas which have been made possible by the widespread availability of workstations having good graphics and computational capabilities. This book is a guide to using S-Plus to perform statistical analyses and provides both an introduction to the use of S-Plus and a course in modern statistical methods. The aim of the book is to show how to use S-Plus as a powerful and graphical system. Readers are assumed to have a basic grounding in statistics, and so the book is intended for would-be users of S-Plus, and both students and researchers using statistics. Throughout, the emphasis is on presenting practical problems and full analyses of real data sets.

Advertisement

Abstract

Statistics is fundamentally concerned with the understanding of structures in data. One of the effects of the information-technology era has been to make it much easier to collect extensive datasets with minimal human intervention. Fortunately the same technological advances allow the users of statistics access to much more powerful ‘calculators’ to manipulate and display data. This book is about the modern developments in applied statistics which have been made possible by the widespread availability of workstations with high-resolution graphics and computational power equal to a mainframe of a few years ago. Workstations need software, and the S system developed at AT&T’s Bell Laboratories provides a very flexible and powerful environment in which to implement new statistical ideas. Thus this book provides both an introduction to the use of S and a course in modern statistical methods.

Abstract

S is a language for the manipulation of objects. It aims to be both an interactive language (like, for example, a Unix shell language) as well as a complete programming language with some convenient object-oriented features. In this chapter we shall be concerned with the interactive language, and hence certain language constructs used mainly in programming will be postponed to Chapter 4.

Abstract

S-PLUS provides comprehensive graphics facilities, from simple facilities for producing common diagnostic plots by plot (object) to fine control over publication-quality graphs. In consequence, the number of graphics parameters is huge. In this chapter, we build up the complexity gradually. Most readers will not need to go beyond the first 3 sections, and indeed the material later in this chapter is not used elsewhere in this book. However, we have needed to make use of it, especially in matching existing graphical styles.

Abstract

The S language is both an interactive language and a language for adding new functions to the S system. It is a complete programming language with control structures, recursion and a useful variety of data types. The S environment provides many functions to handle standard operations, but most users need occasionally to write new functions. This chapter is concerned with designing, writing, testing and correcting your own S functions.

Abstract

In this chapter we cover a number of topics from classical univariate statistics. Many of the functions used are S-PLUS extensions to S.

Abstract

Linear models form the core of classical statistics, and S provides extensive facilities to fit and manipulate them. These work with a version of the Wilkinson-Rogers syntax (Wilkinson & Rogers, 1973) for specifying models which we discuss in the Section 6.2, and which is also used for generalized linear models, models for survival analysis and tree-based models in later chapters. The main function for fitting linear models is lm, which provides our first example of a style of S functions we shall see repeatedly in later chapters, producing a fitted model object which is then analysed by generic functions.

Abstract

Generalized linear models (GLMs) extend linear models to accommodate both non-normal response distributions and transformations to linearity. (We will assume that Chapter 6 has been read before this chapter.) The essay by Firth (1991) gives a good introduction to GLMs; the comprehensive reference is McCullagh & Neider (1989).

Abstract

Outliers are sample values which cause surprise in relation to the majority of the sample. This is not a pejorative term; outliers may be correct, but they should always be checked for transcription errors. They can play havoc with standard statistical methods, and many robust and resistant methods have been developed since 1960 to be less sensitive to outliers.

Abstract

In linear regression the mean surface in sample space is a plane. In non-linear regression the mean surface may be an arbitrary curved surface but in other respects the models are similar. In practice the mean surface in most non-linear regression models will be approximately planar in the region(s) of high likelihood allowing good approximations based on linear regression techniques to be used. Non-linear regression models can still present tricky computational and inferential problems. (Indeed, the examples here exceeded the capacity of S-PLUS for Windows 3.1.)

Abstract

S-PLUS has a ‘Modern Regression Module’ which contains functions for a number of regression methods. These are not necessarily non-linear in the sense of Chapter 9, which refers to a non-linear parametrization, but they do allow nonlinear functions of the independent variables to be chosen by the procedures. The methods are all fairly computer-intensive, and so are only feasible in the era of plentiful computing power (and hence are ‘modern’). Some of these methods are part of the S modelling language, and others have been added by S-PLUS. As the latter predate the modelling language and have not been updated, the functions of this chapter do not have a consistent style and user interface.

Abstract

Survival analysis is not part of S, but has been added to S-PLUS based on functions written by Terry Therneau (Mayo Foundation) and available as survival2 code from statlib (see Appendix D for further information.) The functions in survival3 were released in mid-1992. They are not part of S-PLUS 3.2, but are scheduled to be included in late 1994. As these functions are much easier to use and provide a higher capability, this chapter is based on their use. (This does mean that the methods are probably not accessible to Windows users at present, as the library uses C code. Section 11.6 sketches how to use survival2, for those who have no other choice.)

Abstract

Multivariate analysis is concerned with datasets which have more than one response variable for each observational or experimental unit. The datasets can be summarized by data matrices X with n rows and p columns, the rows representing the observations or cases, and the columns the variables. The matrix can be viewed either way, depending whether the main interest is in the relationships between the cases or between the variables. Note that for consistency we represent the variables of a case by the row vector x.

Abstract

The use of tree-based models will be relatively unfamiliar to statisticians, although researchers in other fields have found trees to be an attractive way to express knowledge and aid decision-making. Keys such as Figure 13.1 are common in botany and in medical decision-making, and provide a way to encapsulate and structure the knowledge of experts to be used by less-experienced users. Notice how this tree uses both categorical variables and splits on continuous variables.

Abstract

There are now a large number of books on time series. Our philosophy and notation are close to those of the applied book by Diggle (1990) (from which some of our examples are taken). Brockwell and Davis (1991) and Priestley (1981) provide more theoretical treatments, and Bloomfield (1976) and Priestley are particularly thorough on spectral analysis.

Abstract

Spatial statistics is a recent and graphical subject which is ideally suited to implementation in S; S itself includes one spatial interpolation method, akima, and loess which can be used for two-dimensional smoothing, but the specialist methods of spatial statistics have been added and are given in our library spatial. The main references for spatial statistics are Ripley (1981, 1988), Diggle (1983), Upton & Fingleton (1985) and Cressie (1991). Not surprisingly, our notation is closest to Ripley (1981).