Interactive and Dynamic Graphics for Data Analysis

With R and Ggobi

verfasst von: Dianne Cook, Deborah F. Swayne

Verlag: Springer New York

Buchreihe : Use R!

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book is about using interactive and dynamic plots on a computer screen as part of data exploration and modeling, both alone and as a partner with static graphics and non-graphical computational methods. The area of int- active and dynamic data visualization emerged within statistics as part of research on exploratory data analysis in the late 1960s, and it remains an active subject of research today, as its use in practice continues to grow. It now makes substantial contributions within computer science as well, as part of the growing ?elds of information visualization and data mining, especially visual data mining. The material in this book includes: • An introduction to data visualization, explaining how it di?ers from other types of visualization. • Adescriptionofourtoolboxofinteractiveanddynamicgraphicalmethods. • An approach for exploring missing values in data. • An explanation of the use of these tools in cluster analysis and supervised classi?cation. • An overview of additional material available on the web. • A description of the data used in the analyses and exercises. The book’s examples use the software R and GGobi. R (Ihaka & Gent- man 1996, RDevelopment CoreTeam2006) isafreesoftware environment for statistical computing and graphics; it is most often used from the command line, provides a wide variety of statistical methods, and includes high–quality staticgraphics.RaroseintheStatisticsDepartmentoftheUniversityofAu- land and is now developed and maintained by a global collaborative e?ort.

Inhaltsverzeichnis

Frontmatter

1. Introduction

In this technological age, we live in a sea of information. We face the problem of gleaning useful knowledge from masses of words and numbers stored in computers. Fortunately, the computing technology that produces this deluge also gives us some tools to transform heterogeneous information into knowledge. We now rely on computers at every stage of this transformation: structuring and exploring information, developing models, and communicating knowledge.

In this book we teach a methodology that makes visualization central to the process of abstracting knowledge from information. Computers give us great power to represent information in pictures, but even more, they give us the power to interact with these pictures. If these are pictures of data, then interaction gives us the feeling of having our hands on the data itself and helps us to orient ourselves in the sea of information. By generating and manipulating many pictures, we make comparisons among different views of the data, we pose queries about the data and get immediate answers, and we discover large patterns and small features of interest. These are essential facets of data exploration, and they are important for model development and diagnosis. In this first chapter we sketch the history of computer-aided data visualization and the role of data visualization in the process of data analysis.

2. TheToolbox

The tools used to perform the analyses described in this book come largely from two “toolboxes.” One is R, which we use extensively for data management and manipulation, modeling, and static plots. Since R is well documented elsewhere, both in books (Dalgaard 2002, Venables & Ripley 2002, Murrell 2005) and on the web (R–project.org), we will say very little about it here.

Instead, we emphasize the less familiar tools drawn from GGobi, our other major toolbox: a set of direct manipulations that we apply to a set of plot types. With these plot types and manipulations we can construct and interact with multiple views, linked so that an action in one can affect them all. The examples described throughout the book are based on these tools.

3. Missing Values

Values are often missing in data, for several reasons. Measuring instruments fail, samples are lost or corrupted, patients do not show up to scheduled appointments, and measurements may be deliberately censored if they are known to be untrustworthy above or below certain thresholds. When this happens, it is always necessary to evaluate the nature and the distribution of the gaps, to see whether a remedy must be applied before further analysis of the data. If too many values are missing, or if gaps on one variable occur in association with other variables, ignoring them may invalidate the results of any analysis that is performed. This sort of association is not at all uncommon and may be directly related to the test conditions of the study. For example, when measuring instruments fail, they often do so under conditions of stress, such as high temperature or humidity. As another example, a lot of the missing values in smoking cessation studies occur for those people who begin smoking again and silently withdraw from the study, perhaps out of discouragement or embarrassment.

4. Supervised Classification

When you browse your email, you can usually tell right away whether a message is spam. Still, you probably do not enjoy spending your time identifying spam and have come to rely on a filter to do that task for you, either deleting the spam automatically or filing it in a different mailbox. An email filter is based on a set of rules applied to each incoming message, tagging it as spam or “ham” (not spam). Such a filter is an example of a supervised classification algorithm. It is formulated by studying a training sample of email messages that have been manually classified as spam or ham. Information in the header and text of each message is converted into a set of numerical variables such as the size of the email, the domain of the sender, or the presence of the word “free.” These variables are used to define rules that determine whether an incoming message is spam or ham.

5. Cluster Analysis

The aim of unsupervised classification, or cluster analysis, is to organize observations into similar groups. Cluster analysis is a commonly used, appealing, and conceptually intuitive, statistical method. Some of its uses include market segmentation, where customers are grouped into clusters with similar attributes for targeted marketing; gene expression analysis, where genes with similar expression patterns are grouped together; and the creation of taxonomies of animals, insects, or plants. A cluster analysis results in a simplification of a dataset for two reasons: first, because the dataset can be summarized by a description of each cluster, and second, because each cluster, which is now relatively homogeneous, can be analyzed separately. Thus, it can be used to effectively reduce the size of massive amounts of data.

6. Miscellaneous Topics

Analysts often encounter data that cannot be fully analyzed using the methods presented in the preceding chapters. In this chapter we introduce, however briefly, some of these different kinds of data and the additional methods needed to analyze them, beginning with a section on inference. A more complete treatment of each topic will be found on the book web site.

7. Datasets

Description: Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at least in restaurants in the United States) are a major component of pay.

In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990. The restaurant, located in a suburban shopping mall, was part of a national chain and served a varied menu. In observance of local law the restaurant offered seating in a non-smoking section to patrons who requested it. Each record includes a day and time, and taken together, they show the server’s work schedule.

Backmatter

Titel: Interactive and Dynamic Graphics for Data Analysis
verfasst von: Dianne Cook
Deborah F. Swayne
Verlag: Springer New York
Electronic ISBN: 978-0-387-71762-3
Print ISBN: 978-0-387-71761-6
DOI: https://doi.org/10.1007/978-0-387-71762-3