Skip to main content

2006 | Buch

Graphics of Large Datasets

Visualizing a Million

verfasst von: Antony Unwin, Martin Theus, Heike Hofmann

Verlag: Springer New York

Buchreihe : Statistics and Computing

insite
SUCHEN

Über dieses Buch

Graphics are great for exploring data, but how can they be used for looking at the large datasets that are commonplace to-day? This book shows how to look at ways of visualizing large datasets, whether large in numbers of cases or large in numbers of variables or large in both. Data visualization is useful for data cleaning, exploring data, identifying trends and clusters, spotting local patterns, evaluating modeling output, and presenting results. It is essential for exploratory data analysis and data mining. Data analysts, statisticians, computer scientists-indeed anyone who has to explore a large dataset of their own-should benefit from reading this book.

New approaches to graphics are needed to visualize the information in large datasets and most of the innovations described in this book are developments of standard graphics. There are considerable advantages in extending displays which are well-known and well-tried, both in understanding how best to make use of them in your work and in presenting results to others. It should also make the book readily accessible for readers who already have a little experience of drawing statistical graphics. All ideas are illustrated with displays from analyses of real datasets and the authors emphasize the importance of interpreting displays effectively. Graphics should be drawn to convey information and the book includes many insightful examples.

From the reviews:

"Anyone interested in modern techniques for visualizing data will be well rewarded by reading this book. There is a wealth of important plotting types and techniques." Paul Murrell for the Journal of Statistical Software, December 2006

"This fascinating book looks at the question of visualizing large datasets from many different perspectives. Different authors are responsible for different chapters and this approach works well in giving the reader alternative viewpoints of the same problem. Interestingly the authors have cleverly chosen a definition of 'large dataset'. Essentially they focus on datasets with the order of a million cases. As the authors point out there are now many examples of much larger datasets but by limiting to ones that can be loaded in their entirety in standard statistical software they end up with a book that has great utility to the practitioner rather than just the theorist. Another very attractive feature of the book is the many colour plates, showing clearly what can now routinely be seen on the computer screen. The interactive nature of data analysis with large datasets is hard to reproduce in a book but the authors make an excellent attempt to do just this." P. Marriott for the Short Book Reviews of the ISI

Inhaltsverzeichnis

Frontmatter

Introduction

1. Introduction
Antony Unwin

Basics

Frontmatter
2. Statistical Graphics
Martin Theus
3. Scaling Up Graphics
3.7 Summary
The design and implementation of statistical graphics should pay attention to the challenges from big datasets. For many users, this has not been an issue up till now and so some statistical and graphics packages can have problems with graphics of more than 10,000 cases.
However, most of the plots used in statistical graphics can be scaled up to be usable with large datasets. Areal plots for categorical data are quite robust against large data glyph-based plots do have more serious problems. Modifications like α-blending or binning, interactions like (logical) zooming and panning, or interactive reordering and grouping are of great assistance when dealing with large datasets.
In general, all statistical graphics that summarize the data, and plot some version of these summaries, will scale up to large datasets. Barcharts, for instance, plot the breakdown of a categorical variable, which is a sufficient summary to fully describe the data. Binned scatterplots show an approximation of the underlying scatterplot and have a complexity that depends on the (constant) size of the binning grid rather than on the size of the dataset.
Martin Theus
4. Interacting with Graphics
Antony Unwin

Applications

Frontmatter
5. Multivariate Categorical Data — Mosaic Plots
Heike Hofmann
6. Rotating Plots
Dianne Cook, Leslie Miller
7. Multivariate Continuous Data — Parallel Coordinates
7.7 Summary
This chapter has introduced a smooth modified version of the parallel coordinate plot. The modifications are based on a parameter transformation process and its geometric structure. The mathematics behind the new plot has been explained with views that show how patterns may be detected in a dataset. The smooth curves have several significant features, including a norm-reducing property and orthogonal crossings of the axes.
Although not explicitly mentioned, the analysis of the datasets in the two examples needed a lot of interactions with the software. Actions like reordering and rescaling of axes (cf. Section 4.4.3) or the application of density estimation procedures are necessary steps towards a meaningful and presentable visualization.
Rida Moustafa, Ed Wegman
8. Networks
Graham Wills
9. Trees
Simon Urbanek
10. Transactions
Bárbara González-Arévalo, Félix Hernández-Campos, Steve Marron, Cheolwoo Park
11. Graphics of a Large Dataset
Antony Unwin, Martin Theus
Backmatter
Metadaten
Titel
Graphics of Large Datasets
verfasst von
Antony Unwin
Martin Theus
Heike Hofmann
Copyright-Jahr
2006
Verlag
Springer New York
Electronic ISBN
978-0-387-37977-7
Print ISBN
978-0-387-32906-2
DOI
https://doi.org/10.1007/0-387-37977-0