Top

2006 | Book

Read chapter Read first chapter

Grouping Multidimensional Data

Recent Advances in Clustering

Editors: Jacob Kogan, Charles Nicholas, Marc Teboulle

Publisher: Springer Berlin Heidelberg

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Clustering is one of the most fundamental and essential data analysis techniques. Clustering can be used as an independent data mining task to discern intrinsic characteristics of data, or as a preprocessing step with the clustering results then used for classification, correlation analysis, or anomaly detection.

Kogan and his co-editors have put together recent advances in clustering large and high-dimension data. Their volume addresses new topics and methods which are central to modern data analysis, with particular emphasis on linear algebra tools, opimization methods and statistical techniques. The contributions, written by leading researchers from both academia and industry, cover theoretical basics as well as application and evaluation of algorithms, and thus provide an excellent state-of-the-art overview.

The level of detail, the breadth of coverage, and the comprehensive bibliography make this book a perfect fit for researchers and graduate students in data mining and in many other important related application areas.

Frontmatter

The Star Clustering Algorithm for Information Organization

Summary

We present the star clustering algorithm for static and dynamic information organization. The offline star algorithm can be used for clustering static information systems, and the online star algorithm can be used for clustering dynamic information systems. These algorithms organize a data collection into a number of clusters that are naturally induced by the collection via a computationally efficient cover by dense subgraphs. We further show a lower bound on the accuracy of the clusters produced by these algorithms as well as demonstrate that these algorithms are computationally efficient. Finally, we discuss a number of applications of the star clustering algorithm and provide results from a number of experiments with the Text Retrieval Conference data.

J.A. Aslam, E. Pelekhov, D. Rus

A Survey of Clustering Data Mining Techniques

Summary

Clustering is the division of data into groups of similar objects. In clustering, some details are disregarded in exchange for data simplification. Clustering can be viewed as a data modeling technique that provides for concise summaries of the data. Clustering is therefore related to many disciplines and plays an important role in a broad range of applications. The applications of clustering usually deal with large datasets and data with many attributes. Exploration of such data is a subject of data mining. This survey concentrates on clustering algorithms from a data mining perspective.

P. Berkhin

Similarity-Based Text Clustering: A Comparative Study

Summary

Clustering of text documents enables unsupervised categorization and facilitates browsing and search. Any clustering method has to embed the objects to be clustered in a suitable representational space that provides a measure of (dis)similarity between any pair of objects. While several clustering methods and the associated similarity measures have been proposed in the past for text clustering, there is no systematic comparative study of the impact of similarity measures on the quality of document clusters, possibly because most popular cost criteria for evaluating cluster quality do not readily translate across qualitatively different measures. This chapter compares popular similarity measures (Euclidean, cosine, Pearson correlation, extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hypergraph partitioning, generalized k-means, weighted graph partitioning), on a variety of high dimension sparse vector data sets representing text documents as bags of words. Performance is measured based on mutual information with a human-imposed classification. Our key findings are that in the quasiorthogonal space of word frequencies: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) Graph partitioning tends to be superior especially when balanced clusters are desired; (iv) Performance curves generally do not cross.

J. Ghosh, A. Strehl

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Summary

We present a method to cluster data sets too large to fit in memory, based on a Low-Memory Factored Representation (LMFR). The LMFR represents the original data in a factored form with much less memory, while preserving the individuality of each of the original samples. The scalable clustering algorithm Principal Direction Divisive Partitioning (PDDP) can use the factored form in a natural way to obtain a clustering of the original dataset.

The resulting algorithm is the PieceMeal PDDP (PMPDDP) method. The scalability of PMPDDP is demonstrated with a complexity analysis and experimental results. A discussion on the practical use of this method by a casual user is provided.

D. Littau, D. Boley

Clustering with Entropy-Like k-Means Algorithms

Summary

The aim of this chapter is to demonstrate that many results attributed to the classical k-means clustering algorithm with the squared Euclidean distance can be extended to many other distance-like functions. We focus on entropy-like distances based on Bregman [88] and Csiszar [119] divergences, which have previously been shown to be useful in various optimization and clustering contexts. Further, the chapter reviews various versions of the classical k-means and BIRCH clustering algorithms with squared Euclidean distance and considers modifications of these algorithms with the proposed families of distance-like functions. Numerical experiments with some of these modifications are reported.

M. Teboulle, P. Berkhin, I. Dhillon, Y. Guan, J. Kogan

Sampling Methods for Building Initial Partitions

Summary

The initialization of iterative clustering algorithms is a difficult yet important problem in the practice of data mining. In this chapter, we discuss two new approaches for building such initial partitions. The first approach applies a procedure for selecting appropriate samples in the spirit of the Cross-Entropy (CE) method, and the second is based on a sequential summarizing schema. In the first approach, we use a sequential sample clustering procedure instead of the simulation step of the CE method. In this context, we state several facts related to the Projection Pursuit methodology for exploring the structure of a high-dimensional data set. In addition we review several external and internal approaches for cluster validity testing. Experimental results for cluster initializations obtained via the CE method and the first of the presented methods are reported for a real data set.

Z. Volkovich, J. Kogan, C. Nicholas

TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

Summary

A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that are presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdms from text collections and for the incremental modification of these tdms by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem-solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different term-weighting policies on the performance of querying and clustering tasks.

D. Zeimpekis, E. Gallopoulos

Criterion Functions for Clustering on High-Dimensional Data

Summary

In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and high-quality document clustering algorithms play an important role toward this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatly improve the retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. This ever-increasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexity-quality trade-offs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process, which seeks to maximize or minimize a particular clustering criterion function defined over the entire clustering solution.

This chapter provides empirical and theoretical comparisons of the performance of a number of widely used criterion functions in the context of partitional clustering algorithms for high-dimensional datasets. The comparisons consist of a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion func-break tions and their effect on the clusters they produce. Our experimental results show that there is a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis of the criterion function shows that their relative performance of the criterion functions depends on: (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

Y. Zhao, G. Karypis

Backmatter

Title: Grouping Multidimensional Data
Editors: Jacob Kogan
Charles Nicholas
Marc Teboulle
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-28349-2
Print ISBN: 978-3-540-28348-5
DOI: https://doi.org/10.1007/3-540-28349-8

Springer Professional

Grouping Multidimensional Data

Recent Advances in Clustering

About this book

Table of Contents

Frontmatter

The Star Clustering Algorithm for Information Organization

A Survey of Clustering Data Mining Techniques

Similarity-Based Text Clustering: A Comparative Study

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Clustering with Entropy-Like k-Means Algorithms

Sampling Methods for Building Initial Partitions

TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

Criterion Functions for Clustering on High-Dimensional Data

Backmatter

Premium Partner