Advances in K-means Clustering

A Data Mining Thinking

verfasst von: Junjie Wu

Verlag: Springer Berlin Heidelberg

Buchreihe : Springer Theses

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Nearly everyone knows K-means algorithm in the fields of data mining and business intelligence. But the ever-emerging data with extremely complicated characteristics bring new challenges to this "old" algorithm. This book addresses these challenges and makes novel contributions in establishing theoretical frameworks for K-means distances and K-means based consensus clustering, identifying the "dangerous" uniform effect and zero-value dilemma of K-means, adapting right measures for cluster validity, and integrating K-means with SVMs for rare class analysis. This book not only enriches the clustering and optimization theories, but also provides good guidance for the practical use of K-means, especially for important tasks such as network intrusion detection and credit fraud prediction. The thesis on which this book is based has won the "2010 National Excellent Doctoral Dissertation Award", the highest honor for not more than 100 PhD theses per year in China.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Cluster Analysis and K-means Clustering: An Introduction

Abstract

The phrase “data mining” was termed in the late eighties of the last century, which describes the activity that attempts to extract interesting patterns from data. Since then, data mining and knowledge discovery has become one of the hottest topics in both academia and industry. It provides valuable business and scientific intelligence hidden in a large amount of historical data

Junjie Wu

Chapter 2. The Uniform Effect of K-means Clustering

Abstract

This chapter studies the uniform effect of K-means clustering. As a well-known and widely used partitional clustering method, K-means has attracted great research interests for a very long time. Researchers have identified some data characteristics that may strongly impact the performance of K-means clustering, including the size of the data, the sparseness of the data, noise and outliers in the data, types of attributes and data sets, and scales of attributes.

Junjie Wu

Chapter 3. Generalizing Distance Functions for Fuzzy c-Means Clustering

Abstract

Fuzzy \(c\)-means (FCM) is a well-known partitional clustering method, which allows an object to belong to two or more clusters with a membership grade between zero and one. Recently, due to the rich information conveyed by the membership grade matrix, FCM has been widely used in many real-world application domains where well-separated clusters are typically not available. In addition, people also recognize that the simple centroid-based iterative procedure of FCM is very appealing when dealing with large volume data.

Junjie Wu

Chapter 4. Information-Theoretic K-means for Text Clustering

Abstract

Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While research efforts devoted to Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which lead to infinite KL-divergence values and create a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, we propose a Summation-based Incremental Learning (SAIL) algorithm for Info-Kmeans clustering in this chapter. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of the Shannon entropy, which successfully avoids the zero-value dilemma. To improve the clustering quality, we further introduce the Variable Neighborhood Search (VNS) meta-heuristic and propose the V-SAIL algorithm, which is then accelerated by a multithreading scheme in PV-SAIL. Experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help to improve the clustering quality at a low cost of computation.

Junjie Wu

Chapter 5. Selecting External Validation Measures for K-means Clustering

Abstract

Cluster validity is a long standing challenge in the clustering literature. While many evaluation measures have been developed for cluster validity, these measures often provide inconsistent information about the clustering performance, and the best suitable measures to use remain unclear in practice. Our study in this chapter fills this crucial void by giving an organized study of sixteen external validation measures for K-means clustering. Specifically, we first propose a filtering criterion based on the uniform effect of K-means, and apply it for the identification of defective measures.

Junjie Wu

Chapter 6. K-means Based Local Decomposition for Rare Class Analysis

Abstract

Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attentions in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This chapter thus fills this crucial void by developing a method for Classification using lOcal clusterinG (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes.

Junjie Wu

Chapter 7. K-means Based Consensus Clustering

Abstract

Consensus clustering, also known as cluster ensemble or clustering aggregation, aims to find a single clustering from multi-source basic clusterings on the same group of data objects. It has been widely recognized that consensus clustering has merits in generating better clusterings, finding bizarre clusters, handling noise, outliers and sample variations, and integrating solutions from multiple distributed sources of data or attributes.

Junjie Wu

Backmatter

Titel: Advances in K-means Clustering
verfasst von: Junjie Wu
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-29807-3
Print ISBN: 978-3-642-29806-6
DOI: https://doi.org/10.1007/978-3-642-29807-3