Skip to main content
Top

2018 | Book

Modern Algorithms of Cluster Analysis

insite
SEARCH

About this book

This book provides the reader with a basic understanding of the formal concepts of the cluster, clustering, partition, cluster analysis etc.

The book explains feature-based, graph-based and spectral clustering methods and discusses their formal similarities and differences. Understanding the related formal concepts is particularly vital in the epoch of Big Data; due to the volume and characteristics of the data, it is no longer feasible to predominantly rely on merely viewing the data when facing a clustering problem.

Usually clustering involves choosing similar objects and grouping them together. To facilitate the choice of similarity measures for complex and big data, various measures of object similarity, based on quantitative (like numerical measurement results) and qualitative features (like text), as well as combinations of the two, are described, as well as graph-based similarity measures for (hyper) linked objects and measures for multilayered graphs. Numerous variants demonstrating how such similarity measures can be exploited when defining clustering cost functions are also presented.

In addition, the book provides an overview of approaches to handling large collections of objects in a reasonable time. In particular, it addresses grid-based methods, sampling methods, parallelization via Map-Reduce, usage of tree-structures, random projections and various heuristic approaches, especially those used for community detection.

Table of Contents

Frontmatter
Chapter 1. Introduction
Abstract
This chapter characterises the scope of this book. It explains the reasons why one should be interested in cluster analysis, lists major application areas, basic theoretical and practical problems, and highlights open research issues and challenges faced by people applying clustering methods.
Sławomir T. Wierzchoń, Mieczysław A. Kłopotek
Chapter 2. Cluster Analysis
Abstract
This chapter outlines the major steps of cluster analysis. It starts with an informal introduction to clustering, its tools, methodology and applications. Then it proceeds with formalising the problem of data clustering. Diverse measures of object similarity, based on quantitative (like numerical measurement results) and on qualitative features (like text) as well as on their mixtures, are described. Various variants, how such similarity measures can be exploited when defining clustering cost functions are presented. Afterwards, major brands of clustering algorithms are explained, including hierarchical, partitional, relational, graph-based, density-based and kernel methods. The underlying data representations and interrelationships between various methodologies are discussed. Also possibilities of combining several algorithms for analysis of the same data (ensemble clustering) are presented. Finally, the issues related to easiness/hardness of the data clustering tasks are recalled.
Sławomir T. Wierzchoń, Mieczysław A. Kłopotek
Chapter 3. Algorithms of Combinatorial Cluster Analysis
Abstract
While Chap. 2 presented the broadness of the spectrum of clustering methods, this chapter focusses on clustering of objects that are embedded in some metric space and gives an insight into the potential variations of clustering algorithms driven by peculiarities of data at hand. It concentrates essentially around k-means clustering algorithm and its variants, which serve as an illustration on the kinds of issues the algorithms need to accommodate to. After presenting the basic variants of k-means, implementational issues are discussed, including initialisation and efficiency (e.g. k-means++). Then accommodations to special forms of data, e.g. text data (spherical k-means), or data with clusters separated by complicated boundary lines (kernel k-means), or ones residing in non-Euclidean space (k-medoids, k-modes) are presented. Subsequently non-sharp separation areas between clusters are discussed. First the EM algorithm is presented, then fuzzy k-means, with fuzzified variants of the aforementioned algorithms. k-means application presumes apriorical knowledge of the number of clusters. To overcome this limitation, so-called affinity propagation algorithm is presented. Finally approaches to handling data of non-typical relation between the number of objects and the number of dimensions, that is residing (predominantly) in a narrow section of the feature space are discussed (projective clustering, random projection, k q-flats, manifold clustering), or with too large data sets (subsampling), or clustering changing over time. The latter issue is closely connected to clustering under partial pre-labelling of objects. Also methods for simultaneous clustering of both objects and their features (co-clustering) are briefly outlined.
Sławomir T. Wierzchoń, Mieczysław A. Kłopotek
Chapter 4. Cluster Quality Versus Choice of Parameters
Abstract
This chapter is devoted to actions to be performed in order to get maximum insights into the data by application of clustering algorithms. For data preprocessing stage, methods for choosing the appropriate set of features and algorithms for selection of the proper number of clusters are presented. For post-processing of cluster analysis algorithm results, criteria evaluating the quality of the obtained clusters, both for the output of a single clustering algorithm and in case of applying multiple ones. Multiple internal and external quality measures are suggested.
Sławomir T. Wierzchoń, Mieczysław A. Kłopotek
Chapter 5. Spectral Clustering
Abstract
This chapter discusses clustering methods based on similarities between pairs of objects. Such a knowledge does not imply that the entire objects are embedded in a metric space. Instead, the local knowledge supports a graphical representation displaying relationships among the objects from a given data set. The problem of data clustering transforms then into the problem of graph partitioning, and this partitioning is acquired by analysing eigenvectors of the graph Laplacian, a basic tool used in spectral graph theory. We explain how various forms of graph Laplacian are used in various graph partitioning criteria, and how these translate into particular algorithms. There is a strong and fascinating relationship between graph Laplacian and random walk on a graph. Particularly, it allows to formulate a number of other clustering criteria, and to formulate another data clustering algorithms. We briefly review these problems. It should be noted that the eigenvectors deliver so-called spectral representation of data items. Unfortunately, this representation is fixed for a given data set, and adding or deleting some items destroys it. Thus we discuss recently invented methods of out of sample spectral clustering allowing to overcome this disadvantage. Although spectral methods are successful in extracting non-convex groups in data, the process of forming graph Laplacian is memory consuming and computing its eigenvectors is time consuming. Thus we discuss various local methods in which only relevant part of the graph are considered. Moreover, we mention a number of methods allowing fast and approximate computation of the eigenvectors.
Sławomir T. Wierzchoń, Mieczysław A. Kłopotek
Chapter 6. Community Discovery and Identification in Empirical Graphs
Abstract
This chapter discusses a specific, yet quite important application area of cluster analysis that is of community detection in large empirical graphs. The chapter introduces a number of alternative understandings of the term community, which lead to a diversity of community detection algorithms. Then it presents specific object similarity measures as well as community quality measures (modularity and its derivatives) which require special adaptation or creation of new clustering algorithms to address community detection. In particular, we give some insights into Newman, Louvain, FastGreedy algorithms developed specially for community detection as well as other, like kernel k-means or spectral methods, which were adopted for this purpose. Finally issues of overlapping community and multi-layered community detection are discussed.
Sławomir T. Wierzchoń, Mieczysław A. Kłopotek
Chapter 7. Data Sets
Abstract
In this chapter we present short characteristic of the data sets used throughout this book in illustrative examples.
Sławomir T. Wierzchoń, Mieczysław A. Kłopotek
Backmatter
Metadata
Title
Modern Algorithms of Cluster Analysis
Authors
Prof. Dr. Slawomir Wierzchoń
Mieczyslaw Kłopotek
Copyright Year
2018
Electronic ISBN
978-3-319-69308-8
Print ISBN
978-3-319-69307-1
DOI
https://doi.org/10.1007/978-3-319-69308-8

Premium Partner