Elsevier

Neurocomputing

Volume 148, 19 January 2015, Pages 487-497
Neurocomputing

An information theoretic approach to hierarchical clustering combination

https://doi.org/10.1016/j.neucom.2014.07.014Get rights and content

Abstract

In Hierarchical clustering, a set of patterns are partitioned into a sequence of groups represented as a dendrogram. The dendrogram is a tree representation where each node is associated with merging of two (or more) partitions and hence each partition is nested into the next partition. Hierarchical representation has properties that are useful for visualization and interpretation of clustering results. On one hand, different hierarchical clustering algorithms usually produce different dendrograms. On the other hand, clustering combination methods have received considerable interest in recent years and they yield superior results for clustering problems.

This paper proposes a framework for combining various hierarchical clustering results which preserves the structural contents of input hierarchies. In this method, first a description matrix is created for each hierarchy, and then the description matrices of the input hierarchies are aggregated to form a consensus matrix from which the final hierarchy is derived. In this framework, we use two new measures for aggregating the description matrices, namely Rényi and Jensen–Shannon Divergences. The experimental and comparative analysis of our proposed framework shows the effectiveness of these two aggregators in hierarchical clustering combination.

Introduction

Clustering is a process of forming groups (clusters) of similar patterns from a given set of inputs. A clustering algorithm seeks to group patterns such that patterns belonging to the same cluster are “similar” to each other, while patterns from different clusters are “dissimilar”. Clustering is used extensively as a fundamental data analysis tool in different fields such as data mining, image processing, machine learning, and bioinformatics [1], [2].

There have been many approaches presented for data clustering [3], [4]. The two main groups of clustering algorithms are hierarchical and nonhierarchical (partitional) clustering algorithms. In partitional algorithms, the number of clusters, k, is usually assumed to be known in advance. The inputs to partitional algorithms are the data, a distance metric, and the number of clusters, k. The output of each algorithm is a model of the data from which the memberships of patterns to different clusters can be derived.

The popular k-means algorithm belongs to the family of partitional clustering algorithms [4]. This method starts with a random set of centroids and assigns each pattern to its closest centroid. Then, repeatedly, for each group, based on its members, a new central point (new centroid) is calculated and pattern assignments to their closest centroids are changed, if necessary. The algorithm finishes when no pattern reassignments are needed or when certain amount of time elapses.

Hierarchical clustering algorithms work by merging the nearest clusters in the bottom-up fashion (agglomerative clustering) or splitting clusters into separate clusters in the top-down fashion (divisive clustering) [5]. In the Agglomerative Hierarchical Clustering (AHC) algorithms, each individual pattern is first assigned to a cluster containing only that pattern, then two clusters that are closest to each other are merged into a new group and this process continues until we reach to a cluster which contains all of the patterns. In the case of top-down, first, a cluster containing all patterns is created, and then this cluster is divided into two other clusters with respect to the amount of separation between patterns. This process is continued until the final clusters contain only one pattern. The relationship between the input patterns and the output of a hierarchical clustering algorithm, which is a hierarchy of clusters, is well represented by a tree which is known as dendrogram. Dendrograms offer better view of data distribution in different abstraction levels. This property makes the hierarchical clustering algorithms an ideal choice for data exploration and visualization. Furthermore, in some applications the number of clusters is not known in advance, and the dendrogram could provide a visualization method for user to decide on the number of clusters.

It can be shown that the performance of data clustering is improved by combining the results of several clustering algorithms [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22]. These approaches are called ensemble methods. Ensemble methods contain two steps in general. In the first step, multiple clustering results are created, which is called the ensemble. And in the second step, the results from multiple clustering techniques are combined, using a consensus function which is called an aggregator [23], to create a single and integrated model for input dataset.

Many clustering combination techniques are introduced to create ensembles, with a comprehensive body of works on partitional clusterings [3], [4], [22] and a few are introduced on hierarchical clusterings. These methods are discussed in below. A categorization of different clustering combination methods, according to the consensus function which is used, is presented as the following.

Many consensus functions are introduced for partitional clustering ensembles which used a variety of mathematical tools [22]. Some are introduced as follows: information theory, fuzzy clustering, genetic algorithms, relabeling and voting, co-association matrix, graph and hypergraph, Mirkin distance, finite mixture models, locally adaptive clustering algorithm, kernel and non-negative matrix factorization [22].

Most clustering combination methods are based on partitional base clusterers, i.e. the input of all combinational clustering algorithms is nonhierarchical and if one is interested in combining a set of hierarchical clusterings using aforementioned methods, the results of clustering techniques should be converted to nonhierarchical. In this conversion, only one level of each hierarchy, which itself is a clustering of data, is preserved and the structural contents of the hierarchy will be lost. Following this conversion, any combinational method such as stacked clustering [24] may be used to produce a consensus hierarchy. In this method, only the information of one level of primary hierarchical clusterings is used while useful information that exists in other levels of hierarchical clustering that may be used to improve the quality of combination methods are ignored. The objective of this paper is to propose a new framework for hierarchical clustering combination which preserves the structural contents of input hierarchies.

Some consensus functions are introduced for hierarchical clustering ensembles. Among them, an mean aggregator based method [25], a fuzzy similarity relation based method [26], [27] and a boosting based method [28] can be found. These methods are described more in Section 3.

In the area of supervised learning, there exists a similar problem as hierarchical clustering combination, which is combining Decision Trees (DT) of classifiers. Different methods for combination of decision trees have been proposed. The boosting of DT [29] or random forests [30] are two examples to be mentioned. In these methods, the outputs of base trees are combined to produce the output for each new instance and, therefore, they do not create a DT from the ensemble. The only exceptions, to the best of our knowledge, are the works of [31], [32], [33] in which they use Fourier analysis to aggregate the trees in an ensemble to construct a single informative DT [31]. Nevertheless, all of aforementioned DT combination methods use pattern labels and therefore could not be used in an unsupervised scenario.

In this paper, we propose an HCC method in which the hierarchical clustering results are combined into a one representative consensus clustering. In this method, two new aggregators are use for combining the description matrices, namely Rényi and Jensen–Shannon divergences. The rest of this paper is organized as follows. In Section 2 the Hierarchical Clustering Combination problem, HCC, is introduced in its general framework. This framework includes the two main steps of ensemble methods, i.e. creation and the combination task. First, the hierarchies are created using clustering methods, and then, the hierarchy resulted from each clusterer is converted to a description matrix. The description matrices are used as middle structures. In the end, the combination task is performed on description matrices. In Section 3, different description matrices of a given dendrogram are introduced and following that the method used to recover the final hierarchy from the consensus matrix is described. Section 4 presents the theory and a description of the consensus method which is proposed for aggregating different descriptor matrices. In Section 5, the experimental set ups are declared via 5.1 Evaluation method and datasets, 5.2 Behavior analysis of Rényi aggregator in a HCC framework, 5.3 Performing the set of experiments. Section 6 discusses the experimental results and Section 6 compares the performance of the developed techniques to the state of the art. Finally, a summarization of the main conclusions of this work is given in Section 8. Derivations of the formulas used in aggregation step are presented in Appendices A and B.

Section snippets

Hierarchical clustering combination

The problem of hierarchical clustering combination may be stated as follows:

Dendrogram descriptor matrix and hierarchy recovering method

A dendrogram associated to a hierarchical clustering of N input patterns can be represented by a description matrix of size N×N. A description matrix expresses the relative position of a given pair of terminal vertices (i.e. patterns pair) in a dendrogram. If T(k)={tij(k)} is the description matrix of kth hierarchy, i.e. H(k), then the tij(k) represents how much the two patterns i and j within the kth dendrogram are different. Various descriptors are presented to reveal different structural

Descriptor matrix aggregation

In clustering algorithms, it is generally desired that each pattern to be grouped into the same cluster as its nearest neighbor. It is noted that, row (column) i of a dendrogram description matrix shows the differences between pattern iand all other patterns. Given the pattern zi, if we randomly select pattern zj, then the probability that zj not to be the nearest neighbor of zi is obtained bypij=tijj=1Ntij,where N is the number of patterns in the dendrogram and tij is the (i,j)th entry in the

Experimental setup

Several experiments have been conducted to evaluate our proposed framework. The plans made for performing the experiments are declared in this section. In Section 5.1, The evaluation method used for measuring the clustering quality is presented and the datasets used in the experiment are introduced. The behavior analysis of the proposed aggregator of our HCC technique is clarified in Section 5.2. And finally in Section 5.3, it is explained that how the set of experiments are performed. The

Experimental results

Here, we perform a statistical analysis to find the most effective values among different values of the independent variables Descriptor-type, Aggregator, Recovery-method, and the missing-value, which leads to a higher dependant variable, CPCC. In this analysis, the variables are called factors, and the values are called the levels of the factors. The CPCC is also called the response variable.

First, we use the analysis of variance (ANOVA) to identify the significant factors and interaction

Comparison of HCC approaches

Here, we compare the information-theory based aggregation approach with other HCC methods. We set the variables of proposed approach with the optimal factor settings F1={0}, F2={Single}, F3={CD}, F4={α(−32)} and compare the results. It should be mentioned that the combination method proposed in [25] is a mean aggregator based method. Therefore, it is a special case of our information-theory based aggregator with β=1 (see Table 1). So, the results are compared with MATCH and BobHic. The average

Conclusion

In this paper, we presented a framework for Hierarchical clustering combination problem. In this framework, each input dendrogram is represented by a dendrogram description matrix. Then the base clusterers descriptor matrices are combined by various aggregators where the aggregators are derived using information theoretic methods. This framework is a unified view of some of the existing algorithms, which leads to a class of new operators (adjustable by a parameter). We presented the

Elaheh Rashedi received the Bachelor degree from University of Tehran, Tehran, Iran, in 2008 and the Master degree in electrical and computer engineering from the Isfahan University of Technology, Isfahan, Iran, in 2011. She is currently a Ph.D. student in Computer Science department at Wayne State University. Her current research interests include pattern recognition, multiple classifier systems, data fusion and learning methods.

References (44)

  • S. Theodoridis et al.

    Pattern Recognition

    (2003)
  • A. Weingessel, E. Dimitriadou, K. Hornik, An Ensemble Method for Clustering, in: Presented at the DSC Working Papers,...
  • A. Topchy, B. Minaei-Bidgoli, A.K. Jain, W.F. Punch, Adaptive Clustering Ensembles, in: Proceedings of the 17th...
  • A. Fred et al.

    Combining multiple clusterings using evidence accumulation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • A. Fred

    Finding consistent clusters in data partitions

    Multiple Classifier Syst.

    (2001)
  • J. Chang, D.M. Blei, Mixtures of Clusterings by Boosting, in: Learning Workshop, Hilton Clearwater,...
  • D. Frossyniotis et al.

    A multi-clustering fusion algorithm

    Methods Appl. Artif. Intell.

    (2002)
  • R. Ghaemi et al.

    A survey: clustering ensembles techniques

    World Academy Sci., Eng. Technol.

    (2009)
  • A. Gionis et al.

    Clustering aggregation

    ACM Trans. Knowl. Discov. Data (TKDD)

    (2007)
  • R. Maclin et al.

    Popular ensemble methods: an empirical study

    J. Artif. Intell. Res.

    (2011)
  • B. Minaei-Bidgoli, A. Topchy, W.F. Punch, A comparison of resampling methods for clustering ensembles, in: Proceedings...
  • B. Minaei-bidgoli, E. Topchy, W.F. Punch, Ensembles of partitions via data resampling, in: Proceedings of the...
  • Cited by (0)

    Elaheh Rashedi received the Bachelor degree from University of Tehran, Tehran, Iran, in 2008 and the Master degree in electrical and computer engineering from the Isfahan University of Technology, Isfahan, Iran, in 2011. She is currently a Ph.D. student in Computer Science department at Wayne State University. Her current research interests include pattern recognition, multiple classifier systems, data fusion and learning methods.

    Abdolreza Mirzaei was born in Isfahan, Iran. He received the B.S. (first-class honors) degree in computer engineering from Isfahan University, in 2001, the M.Sc. degree in artificial intelligence from Iran University of Science and Technology, Tehran, Iran, in 2003, and the Ph.D. degree in artificial intelligence from Amirkabir University of Technology, Tehran, in 2009, respectively. He is currently with the Department of Electrical and Computer Engineering, Isfahan University of Technology. His research interests include statistical and structural classification methods, digital image processing, computer vision, multiple classifier systems, and learning methods.

    Mohammad Rahmati received his Ph.D. degree in electrical and computer engineering from the University of Kentucky, Lexington, KY, in 1993. He is currently an Associate Professor with the Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran. His research interests include pattern recognition, image and video processing, computer vision, and data mining. Dr. Rahmati is a member of the IEEE Signal Processing Society.

    View full text