Skip to main content
main-content

Über dieses Buch

High-dimensional spaces arise as a way of modelling datasets with many attributes. Such a dataset can be directly represented in a space spanned by its attributes, with each record represented as a point in the space with its position depending on its attribute values. Such spaces are not easy to work with because of their high dimensionality: our intuition about space is not reliable, and measures such as distance do not provide as clear information as we might expect.

There are three main areas where complex high dimensionality and large datasets arise naturally: data collected by online retailers, preference sites, and social media sites, and customer relationship databases, where there are large but sparse records available for each individual; data derived from text and speech, where the attributes are words and so the corresponding datasets are wide, and sparse; and data collected for security, defense, law enforcement, and intelligence purposes, where the datasets are large and wide. Such datasets are usually understood either by finding the set of clusters they contain or by looking for the outliers, but these strategies conceal subtleties that are often ignored. In this book the author suggests new ways of thinking about high-dimensional spaces using two models: a skeleton that relates the clusters to one another; and boundaries in the empty space between clusters that provide new perspectives on outliers and on outlying regions.

The book will be of value to practitioners, graduate students and researchers.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract
Many organizations collect large amounts of data: businesses about their customers, governments about their citizens and visitors, scientists about physical systems, and economists about financial systems. Collecting such data is the easy part; extracting useful knowledge from it is often much harder.
David B. Skillicorn

Chapter 2. Basic Structure of High-Dimensional Spaces

Abstract
Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple, raises a number of challenging problems in practice.
David B. Skillicorn

Chapter 3. Algorithms

Abstract
This chapter describes many of the algorithms that play a role in constructing models of high-dimensional spaces. Those familiar with knowledge discovery may want to skip some or all of the chapter. However, several of the standard algorithms are used in slightly specialized ways when applied to high-dimensional datasets.
David B. Skillicorn

Chapter 4. Spaces with a Single Center

Abstract
The intuitive view of the natural space created from large data is that it somehow looks as shown in Fig. 4.1. In the center are the most common, normal, or typical records and they all resemble each other to some extent; outside this are records that are more scattered and less normal or typical, resembling each other less; and outside this are records that are much more scattered, much less frequent, and very untypical. The reason that this structure seems intuitively appealing is that, as records become inherently more unusual (further from the center), they also become less alike (because of the different directions).
David B. Skillicorn

Chapter 5. Spaces with Multiple Centers

Abstract
The assumption that, in a natural geometry, spaces derived from data have a single center with a single cluster is implicitly an assumption that there is one underlying process responsible for generating the data, and that the spatial variation around some notional center is caused by some variation overlaying this process. Often, perhaps most of the time, it is much more plausible that there are multiple, interacting processes generating the data, and so at least multiple clusters. Each of these clusters might have a notional center with some variation around it, but there is typically also some relationship among the clusters themselves. In other words, the skeleton for such data must describe both the clusters and the connections. The analysis is significantly more complex, but more revealing.
David B. Skillicorn

Chapter 6. Representation by Graphs

Abstract
A geometric space has the advantage that the similarity between any pair of points is independent of the presence and placement of any other points, no matter what the particular measure of similarity might be. This is computationally attractive, which is why it has been the basis of everything discussed so far.
David B. Skillicorn

Chapter 7. Using Models of High-Dimensional Spaces

Abstract
We have shown how to think about high-dimensional spaces as multi-centered spaces, and we have introduced the algorithmic basis for constructing a skeleton for such a space. We have also seen how to divide up a space into qualitative regions that allow outliers and small clusters to be assessed and interpreted in terms of what their impact on existing models should be.
David B. Skillicorn

Chapter 8. Including Contextual Information

Abstract
In all of the previous chapters, we have assumed, perhaps implicitly, that understanding high-dimensional spaces was something that happened in isolation, and only once for each particular dataset. Nothing could be further from the truth. The process of exploring and understanding a dataset is always iterative, and the results of each round, and the deeper understanding that comes from it, inform the strategy and tactics of the next round.
David B. Skillicorn

Chapter 9. Conclusions

Abstract
Clustering is the process of understanding the structure implicit in a dataset, as a way of understanding more deeply the system that the data describes. This is an inherently messy process, because of the ambiguity of what is meant by “understanding”. It is also a complex process, because of the properties of significant real-world systems, and so the properties of the data about them.
David B. Skillicorn

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise