nach oben

2008 | Buch

Kapitel lesen Erstes Kapitel lesen

Lattice

Multivariate Data Visualization with R

verfasst von: Deepayan Sarkar

Verlag: Springer New York

Buchreihe : Use R!

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

R is rapidly growing in popularity as the environment of choice for data analysis and graphics both in academia and industry. Lattice brings the proven design of Trellis graphics (originally developed for S by William S. Cleveland and colleagues at Bell Labs) to R, considerably expanding its capabilities in the process. Lattice is a powerful and elegant high level data visualization system that is sufficient for most everyday graphics needs, yet flexible enough to be easily extended to handle demands of cutting edge research. Written by the author of the lattice system, this book describes it in considerable depth, beginning with the essentials and systematically delving into specific low levels details as necessary. No prior experience with lattice is required to read the book, although basic familiarity with R is assumed. The book contains close to 150 figures produced with lattice. Many of the examples emphasize principles of good graphical design; almost all use real data sets that are publicly available in various R packages. All code and figures in the book are also available online, along with supplementary material covering more advanced topics.

Inhaltsverzeichnis

Frontmatter

Basics

1. Introduction

The traditional graphics subsystem in R is very flexible when it comes to producing standard statistical graphics. It provides a collection of high-level plotting functions that produce entire coherent displays, several low-level routines to enhance such displays and provide finer control over the various elements that make them up, and a system of parameters that allows global control over defaults and other details. However, this system is not very proficient at combining multiple plots in a page. It is quite straightforward to produce such plots; however, doing so in an effective manner, with properly coordinated scales, aspect ratios, and labels, is a fairly complex task that is difficult even for the experienced R user. Trellis graphics, originally implemented in S, was designed to address this shortcoming. The lattice add-on package provides similar capabilities for R users.

The name “Trellis” comes from the trellislike rectangular array of panels of which such displays often consist. Although Trellis graphics is typically associated with multiple panels, it is also possible to create single-panel Trellis displays, which look very much like traditional high-level R plots. There are subtle differences, however, mostly stemming from an important design goal of Trellis graphics, namely, to make optimum use of the available display area. Even single-panel Trellis displays are usually as good, if not better, than their traditional counterparts in terms of default choices. Overall, Trellis graphics is intended to be a more mature substitute for traditional statistical graphics in R. As such, this book assumes no prior knowledge of traditional R graphics; in fact, too much familiarity with it can be a hindrance, as some basic assumptions that are part and parcel of traditional R graphics may have to be unlearned. However, there are many parallels between the two: both provide high-level functions to produce comprehensive statistical graphs, both provide fine control over annotation and tools to augment displays, and both employ a system of user-modifiable global parameters that control the details of the display. This chapter gives a preview of Trellis graphics using a few examples; details follow in later chapters.

2. A Technical Overview of lattice

This chapter gives a broad overview of lattice, briefly describing the most important features shared by all high-level functions. Some of the topics covered are somewhat technical, but they are important motifs in the“big picture” view of lattice, and it would hinder rather than help to introduce them later at arbitrary points in the book. For readers that are new to lattice, it is recommended that they give this chapter a cursory overview and move on to the subsequent chapters. Each of the remaining chapters in Part I can be read, for the most part, directly after Chapter 1, although some advanced examples do require some groundwork laid out in this chapter. This nonlinear flow is inconvenient for those new to lattice, but it is somewhat inevitable; one should not expect to learn all the complexities of Trellis graphics in a first reading.

3. Visualizing Univariate Distributions

Visualizing the distribution of a single continuous variable is a common graphical task for which several specialized methods have evolved. The distribution of a random variable X is defined by the corresponding cumulative distribution function (CDF) F(x) = P(X ≤). For continuous random variables, or more precisely, random variables with an absolutely continuous CDF, an equivalent representation is the density f(x) = F'(x). One is often also interested in the inverse of F, the quantile function. R provides these functions for many standard distributions; for example, pnorm(), dnorm(), and qnorm() give the distribution, density, and quantile functions, respectively, for the normal distribution. Most of the visualization methods discussed in this chapter involve estimating these functions from data. In particular, density plots and histograms display estimates of the density f, and quantile plots and box-and-whisker plots are based on (partial) estimates of F or its inverse.

Although the mathematical relationships between the theoretical constructs are well-defined, there are no natural relationships between their standard estimates. Furthermore, the task of visualization comes with its own special rules; two plots with exactly the same information can put visual emphasis on entirely different aspects of that information. Thus, the appropriateness of a particular visualization depends to a large extent on the purpose of the analysis. We discuss the merits of different visualizations as we encounter them, but it is helpful to keep this background in mind when reading about them.

4. Displaying Multiway Tables

An important subset of statistical data comes in the form of tables. Tables usually record the frequency or proportion of observations that fall into a particular category or combination of categories. They could also encode some other summary measure such as a rate (of binary events) or mean (of a continuous variable). In R, tables are usually represented by arrays of one (vectors), two (matrices), or more dimensions. To distinguish them from other vectors and arrays, they often have class “table”. The R functions table() and xtabs() can be used to create tables from raw data.

Graphs of tables do not always convey information more easily than the tables themselves, but they often do. The barchart() and dotplot() functions in lattice are designed to display tabulated data. As with other high-level functions, the primary formula interface requires the data to be available as a data frame. The as.data.frame.table() function can be used for converting tables to suitable data frames. In addition, there are methods in lattice that work directly on tables. We focus on the latter in this chapter; examples using the formula interface can be found in Chapter 2.

5. Scatter Plots and Extensions

The scatter plot is possibly the single most important statistical graphic. In this chapter we discuss the xyplot() function, which can be used to produce several variants of scatter plots, and splom(), which produces scatter-plot matrices. We also include a brief discussion of parallel coordinates plots, as produced by parallel(), which are related to scatter-plot matrices in terms of the kinds of data they are used to visualize, although not so much in the actual visual encoding.

A scatter plot graphs two variables directly against each other in a Cartesian coordinate system. It is a simple graphic in the sense that the data are directly encoded without being summarized in any way; often the aspects that the user needs to worry about most are graphical ones such as whether to join the points by a line, what colors to use, and so on. Depending on the purpose, scatter plots can also be enhanced in several ways. In this chapter, we go over some of the variants supported by panel.xyplot(), which is the default panel function for both xyplot() and splom() (under the alias panel.splom()).

6. Trivariate Displays

Trivariate displays encode three primary variables in a panel. There are four high-level functions in lattice that produce trivariate displays: cloud() creates three-dimensional scatter plots of unstructured trivariate data, whereas levelplot(), contourplot(), and wireframe() render surfaces or twodimensional tables evaluated on a systematic rectangular grid. Of these, cloud() and wireframe() are similar in that they both create two-dimensional projections of three-dimensional constructs, and they share several common arguments that control the details of the projection.

Finer Control

7. Graphical Parameters and Other Settings

In the second part of this book, we take a detailed look at features that are common to all high-level lattice functions, providing a uniform interface to control their output. We start, in this chapter, by describing the system of user settable graphical parameters and other global options.

Graphical parameters are often critical in determining the effectiveness of a plot. Such parameters include obvious ones such as colors, symbols, line types, and fonts for the various elements of a graph, as well as more subtle ones such as the length of tick marks or the amount of space separating different components of the graph. The parameters used in lattice displays are highly customizable. Many of them can be controlled directly by specifying suitable arguments in a high-level function call. Most derive their default values from a system of common global settings that can also be modified by the user. The latter approach has two primary benefits: it allows good global defaults to be specified, and it provides a consistent “look and feel” to lattice graphics while letting the user retain ultimate control.

Not all parameters of interest are graphical. For example, a user may dislike the default argument value as.table = FALSE (which orders panels starting from the lower-left corner rather than the upper-left one), and wish to change the default globally rather than specify an additional argument in every call. Several such non-graphical parameters can be customized, through a slightly different system of global options. Both these systems are discussed in this chapter.

8. Plot Coordinates and Axis Annotation

In this chapter, we discuss how the coordinate system for each panel is determined, how axes are annotated, and how one might control these in a lattice display. Control is possible at several levels, with a trade-off between the degree of control desired and the amount of effort required to achieve it.

9. Labels and Legends

In this chapter, we discuss annotation of lattice displays by adding labels and legends. As usual, there are various levels of control available to the user, with corresponding differences in the amount of work involved. Most common needs for annotation are satisfied by various labels giving descriptive names for the variables and titles for the entire plot. Legends are usually needed to explain the correspondence between varying graphical parameters such as color, plotting character, and so on, and the quantitative information they represent.

10. Data Manipulation and Related Topics

Now that we have had a chance to look at several types of lattice plots and ways to control their various elements, it is time to take another look at the big picture and introduce some new ideas. This chapter may be viewed as a continuation of Chapter 2; the topics covered are slightly more advanced, but generally apply to all lattice functions.

11. Manipulating the “trellis” Object

The Trellis paradigm is different from traditional R graphics in an important respect: high-level “plotting” functions in lattice produce objects rather than any actual graphics output. As with other objects in R, these objects can be assigned to variables, stored on disk in serialized form to be recovered in a later session, and otherwise manipulated in various ways. They can also be plotted, which is all we want to do in the vast majority of cases. Throughout this book, we have largely focused on this last task. In this chapter, we take a closer look at the implications of the object-based design and how one might take advantage of it.

12. Interacting with Trellis Displays

High-level functions in lattice produce “trellis” objects that can be thought of as abstract representations of visualizations. An actual rendering of a visualization is produced by plotting the corresponding object using the appropriate print() or plot() method. In this chapter, we discuss things the user can do after this plotting has been completed.

One possible approach is to treat the result as any other graphic created using the grid package, and make further enhancements to the display using the low-level tools available in grid. In particular, the display consists of a tree of viewports, and various grid graphical objects (grobs) drawn within them. The user can move down to any of these viewports and add further objects, or, less commonly, edit the properties of existing objects. The precise details of these operations are beyond the scope of this book, but are discussed by Murrell (2005). In this chapter, we focus entirely on a higher-level interface in the lattice package for similar tasks, which is less flexible,1 but usually sufficient. The playwith package (Andrews, 2007) provides a user-friendly GUI wrapper for many of these facilities.

Extending Trellis Displays

13. Advanced Panel Functions

R is a complete programming language that allows, and indeed encourages, its users to go beyond the canned uses built into the system. The transition from user to programmer can be intimidating for the beginner to contemplate, but is almost inevitable after a point. In the context of lattice, this transition is most often necessitated by a desire to customize the display in small ways, perhaps just to add a common reference line to all panels. Such customizations are fairly basic in any serious use of lattice, and we have seen a number of examples throughout this book. In this chapter, which is meant for the more advanced user, we take a more formal look at panel functions, give pointers to the tools that might help in writing new ones, and finally discuss some nontrivial examples.

14. New Trellis Displays

Each high-level function in lattice is intended to create a certain type of statistical display by default. Many variations are already built into the default panel functions and can be activated with additional arguments in a high-level function call itself. More extensive modifications can be made by writing custom panel functions, as we have seen throughout this book and particularly in Chapter 13.

Although panel functions can be used to implement entirely novel visualizations, trying to shoehorn such a display into a function intended for another purpose is mostly useful as a one-off, quick-and-dirty solution. For a systematic implementation that could perhaps be used by others, it is often more sensible to create a new function whose name better reflects the nature of the visualization. On the other hand, existing function names are sometimes perfectly appropriate, and it is the data which are in a form that is not directly usable. A typical example of this is a univariate time series; there is really only one choice for the x and y variables in the xyplot() call that produced Figure 10.17, and the need for a new function to hide the use of a formula seems wasteful.

Rather than trying to anticipate all potential use cases, lattice provides the groundwork for further extensions by making use of the object-oriented features of R. Each high-level function in lattice is generic, with method dispatch possible on the first argument x and possibly (using the formal _S4 system) the second argument data. New high-level display functions can be written either as new methods for existing generic functions, or, if it seems appropriate, as an entirely new function which should itself be generic to allow further specialized methods. In this chapter, we give examples of both new methods and new high-level functions implemented using the framework provided by lattice. These can, it is hoped, serve as models for further extensions.

Color Plates

Backmatter

Titel: Lattice
verfasst von: Deepayan Sarkar
Verlag: Springer New York
Electronic ISBN: 978-0-387-75969-2
Print ISBN: 978-0-387-75968-5
DOI: https://doi.org/10.1007/978-0-387-75969-2

Springer Professional