Elsevier

Information Sciences

Volume 181, Issue 16, 15 August 2011, Pages 3397-3410
Information Sciences

Minimum spanning tree based split-and-merge: A hierarchical clustering method

https://doi.org/10.1016/j.ins.2011.04.013Get rights and content

Abstract

Most clustering algorithms become ineffective when provided with unsuitable parameters or applied to datasets which are composed of clusters with diverse shapes, sizes, and densities. To alleviate these deficiencies, we propose a novel split-and-merge hierarchical clustering method in which a minimum spanning tree (MST) and an MST-based graph are employed to guide the splitting and merging process. In the splitting process, vertices with high degrees in the MST-based graph are selected as initial prototypes, and K-means is used to split the dataset. In the merging process, subgroup pairs are filtered and only neighboring pairs are considered for merge. The proposed method requires no parameter except the number of clusters. Experimental results demonstrate its effectiveness both on synthetic and real datasets.

Introduction

Clustering plays an important role in data mining, pattern recognition, and machine learning. It aims at grouping N data points into K clusters so that data points within the same cluster are similar, while data points in diverse clusters are different from each other. From the machine learning point of view, clustering is unsupervised learning as it classifies a dataset without any a priori knowledge. A large number of clustering algorithms [20], [37], [46] have been presented in the literature since K-means [32], one of the most popular clustering algorithms, was published. The algorithms can be grouped into hierarchical clustering [12], [15], [16], [19], [27], partitional clustering [2], [6], [17], [24], [32], [35], density-based clustering [10], [18], grid-based clustering [1], [43], model-based clustering [11], [33], and graph-based clustering [29], [38], [40], [45], [47], [49]. Hierarchical and partitional clustering are the two most common groups [30].

Generally, a hierarchical clustering algorithm partitions a dataset into various clusters by an agglomerative or a divisive approach based on a dendrogram. Agglomerative clustering starts by considering each point as a cluster, and iteratively combines two most similar clusters in terms of an objective function. In contrast, divisive clustering starts with only one cluster including all data points. It iteratively selects a cluster and partitions it into two subclusters. The main advantage of hierarchical clustering is that it produces a nested tree of partitions and can therefore be more informative than non-hierarchical clustering. Furthermore, with the dendrogram, the optimal number of clusters can be determined. However, hierarchical clustering has a relatively high computational cost. Single linkage [39] and complete linkage [26] are two well-known examples of hierarchical clustering algorithms, and they take O(N2 logN) time.

Partitional clustering splits a dataset at once using an objective function. K-means is one of the most popular examples of partitional clustering. It employs mean-squared-error as its objective function. Its main advantage is that it runs efficiently: its computational complexity is O(NKId), where I is the number of iterations for convergence, and d is the dimensionality of the dataset. Since K and d are usually far less than N, the algorithm runs in a linear time on low-dimensional data. However, there does not exist a universal objective function that can be used to discover all different intrinsic structures of datasets. Therefore, partitional clustering produces inaccurate results when the objective function used does not capture the intrinsic structure of the data. This is the reason why partitional clustering algorithms are incapable of dealing with clusters of arbitrary shapes, different sizes and densities.

Clustering algorithms that combine the advantages of hierarchical and partitional clustering have been proposed in the literature [5], [23], [25], [28], [30], [31]. This kind of hybrid algorithms analyzes the dataset in two stages. In the first stage, the dataset is split into a number of subsets with a partitioning criterion. In the second stage, the produced subsets are merged in terms of a similarity measure. Different split and merge approaches have been designed in several hybrid algorithms. CSM [30] first applies K-means to partition the dataset into K′ subsets, where K′ is an input parameter. Afterwards, single linkage, which uses a dedicated cohesion function as the similarity measure, is utilized to iteratively merge the K′ subsets until K subsets are achieved. In the split stage, as K-means may produce different partitions in different runs, the final results may be unstable.

CHAMELEON [23] is another example of a hybrid clustering algorithm. It constructs a K-nearest neighbor graph, and employs a graph cut scheme to partition the graph into K′ subsets. Relative inter-connectivity and relative closeness are defined to merge the subsets. Liu et al. [31] proposed a multi-prototype clustering algorithm, which can also be considered as a hybrid method. The method uses a convergence mechanism, and repeatedly performs split and merge operations until the prototypes remain unchanged. However, many empirical parameters are involved. Kaukoranta et al. [25] proposed a split-and-merge algorithm, where the objective function is to minimize the mean squared error.

A minimum spanning tree (MST) is a useful graph structure, which has been employed to capture perceptual grouping [21]. Zahn defined several criteria of edge inconsistency for detecting clusters of different shapes [49]. However, for datasets consisting of differently shaped clusters, the method lacks an adaptive selection of the criteria. Xu et al. [47] proposed three MST-based algorithms: removing long MST-edges, a center-based iterative algorithm, and a representative-based global optimal algorithm. But for a specific dataset, users do not know which algorithm is suitable.

In this paper, we propose a minimum spanning tree based split-and-merge method (SAM). It works on numerical data and assumes that the graph can be calculated in a vector space. In the splitting stage, three iterations of MSTs are used to construct a neighborhood graph called 3-MST graph. The vertices with high degrees in the graph are selected as the initial prototypes, and K-means is then applied. In the merge stage, the neighboring subsets with respect to the MST are filtered out and considered for merge.

The rest of the paper is organized as follows: In Section 2, the proposed algorithm is described. The experimental results are presented in Section 3, and conclusions are drawn in Section 4.

Section snippets

Problem formulation

Given a set of data points X = {x1, x2,  , xi,  , xN}, where xi=(xi1,xi2,,xij,,xid)TRd is a feature vector, the goal of clustering is to partition the set X into K clusters: C1, C2,  , CK, where Ci  ∅, Ci  Cj = ∅, X = C1  C2   CK, i = 1: K, j = 1: K, i  j. An undirected graph has been employed to represent a dataset for clustering [38], [49]. Let G(X) = (V, E) denote the undirected complete graph of X, the weights of the edges can be calculated with function w:ER, where V = X, E = {(xi, xj)∣xi, xj  X, i  j}, and w(xi,xj)=(xi

Experimental results

The clustering performance of the proposed method is evaluated on six synthetic and four real datasets. The first four synthetic datasets DS1–DS4 are taken from the literature [22], [4], [14], [13], and the next two DS5, DS6 are from [42]. These datasets are illustrated in Fig. 5. The four real world instances are taken from the UCI datasets [50], including IRIS, WINE, Breast Cancer Wisconsin (WBC), and Breast Cancer Wisconsin Diagnostic (WDBC). The descriptions of these datasets are shown in

Conclusion

The proposed method employs minimum spanning trees in different stages. Before a dataset is split into different clusters, the hairs (leaves together with the connecting edges) of the first MST computed for the whole instance are pruned.

In the splitting stage, more than the desired K clusters are created by K-means. Three MSTs on an iteratively refined graph are computed and combined to determine the initial prototypes for K-means, because randomly selected initial prototypes would lead to

Acknowledgments

The authors thank the anonymous reviewers for their constructive comments and suggestions which helped improve the quality of this paper. The work of C. Zhong was supported by Zhejiang Provincial Natural Science Foundation of China, No. Y1090851, and the Center for International Mobility (CIMO). The work of D. Miao was supported by the National Natural Science Foundation of China, No. 60970061, No. 61075056, and the Research Fund for the Doctoral Program of Higher Education, No. 20060247039.

References (50)

  • R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for datamining...
  • J.C. Bezdek et al.

    Some new indexes of cluster validity

    IEEE Trans. Syst. Man Cybern. B. Cybern.

    (1998)
  • D. Cheng et al.

    A divide-and-merge methodology for clustering

    ACM Trans. Database Syst.

    (2006)
  • Y. Chu et al.

    Density conscious subspace clustering for high-dimensional data

    IEEE Trans. Knowl. Data Eng.

    (2010)
  • Y. Chu et al.

    Reducing redundancy in subspace clustering

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • T.H. Corman et al.

    Introduction to Algorithms

    (2001)
  • M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial data...
  • M. Figueiredo et al.

    Unsupervised learning of finite mixture models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • P. Fränti et al.

    Fast agglomerative clustering using a k-nearest neighbor graph

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2006)
  • L. Fu et al.

    FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data

    BMC Bioinformatics

    (2007)
  • A. Gionis et al.

    Clustering aggregation

    ACM Trans. Knowl. Disc. Data

    (2007)
  • S. Guha, R. Rastogi, K. Shim, CURE: an efficient clustering algorithm for large databases, in: Proceedings of the 1998...
  • S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, in: Proceedings of the...
  • T. Hastie et al.

    The Elements of Statistical Learning, Data Mining, Inference and Prediction

    (2001)
  • A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimedia databases with noise, in: Knowledge...
  • Cited by (0)

    View full text