Ensemble clustering using factor graph
Introduction
The ensemble clustering technique has been receiving increasing attention in recent years due to its ability to combine multiple clusterings into a probably better and more robust clustering [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. Many ensemble clustering approaches have been developed in the past few decades [11]. Despite the great success, there are still three limitations to most of the existing ensemble clustering methods. First, they generally take the cluster number of the final clustering as input and lack the ability to automatically estimate the cluster number. Second, most of them treat each base clustering equally and overlook the different reliability of the base clusterings. Third, many of the existing methods work at the object-level and does not scale well for large ensembles. In this paper, we refer to the ensemble clustering approaches with the ability to automatically estimate the cluster number of the final clustering as the automatic approaches, and the approaches without the ability of automatic cluster number estimate as the non-automatic approaches. Recently some efforts have been made to address one or two of the three limitations [12], [13], [14]. However, to the best of our knowledge, none of the existing ensemble clustering approaches is capable of dealing with all of the three limitations in a unified model.
To overcome the aforementioned three limitations, in this paper, we propose a novel ensemble clustering approach termed ensemble clustering using factor graph (ECFG). We introduce the concept of super-object as a compact and adaptive representation for ensemble data. Instead of using the original data objects, we use the super-objects as the primitive objects, which greatly reduces the problem size and facilitates the overall process. We formulate the ensemble clustering problem into a probabilistic framework. The clustering results are represented by a set of binary decisions. Each binary decision indicates whether the corresponding two super-objects are in the same cluster or not. We assume that each base clustering is associated with a probability of making the correct decisions, which can be viewed as the reliability of a base clustering. The consensus clustering and the reliability of each base clustering are estimated iteratively by solving an instance of binary linear programming (BLP) problem. However, the BLP problem is NP-hard. The high computational complexity is the most significant hurdle for it. In this work, we present an efficient solver for this instance of BLP problem based on the factor graph technique [15]. The factor graph is a powerful tool for solving optimization problems and has many successful applications in the field of pattern recognition and machine learning [16], [17], [18], [19]. In our work, the constrained objective function is represented by a factor graph. Then the max-product belief propagation [15] is applied, which generates the solution insensitive to initialization and converged to the neighborhood maximum.
The main contributions of this paper are summarized as follows:
- 1.
We introduce the concept of super-object, which serves as a compact and adaptive representation for ensemble data and significantly facilitates the computation of the consensus process.
- 2.
We cast the ensemble clustering problem into a BLP problem and propose an efficient solver for the BLP problem based on factor graph.
- 3.
We propose a novel ensemble clustering approach termed ECFG, which has three advantages: (i) it can automatically estimate the cluster number of the final clustering, (ii) the reliability of each base clustering can be estimated and exploited. (iii) it is efficient w.r.t. both large data sizes and large ensemble sizes.
- 4.
Experimental results on multiple real-world datasets have shown that our approach significantly outperforms the state-of-the-art approaches in terms of both clustering accuracy and efficiency (see Section 4).
The remainder of this paper is organized as follows. We review the related work in Section 2. The proposed ensemble clustering approach termed ECFG is introduced in Section 3. The experimental results are reported in Section 4. We conclude this paper in Section 5.
Section snippets
Related work
The purpose of ensemble clustering is to combine multiple base clusterings into a more accurate and robust clustering. With regard to the difference of the input information of the ensemble clustering system, there are two formulations of the ensemble clustering problem. In the first formulation, the ensemble clustering system uses only the information of the multiple clusterings as input and has no access to the original data features [1], [5], [7], [20], [21]. In the other formulation, the
Ensemble clustering using factor graph
In this section, we introduce the proposed approach termed ensemble clustering using factor graph (ECFG). The formulation of the ensemble clustering problem is presented in Section 3.1. We introduce the concept of super-object in Section 3.2 and describe the overall algorithm of ECFG in Section 3.3. To solve the BLP problem in Section 3.3, we introduce a novel optimization method based on factor graph in Section 3.4.
Experiments
In this section, we conduct experiments on multiple real-world datasets to evaluate the proposed ECFG approach. We compare the proposed approach against seven ensemble clustering approaches, namely, COMUSA [27], DICLENS [13], COMUSACL [14], COMUSACL-DEW [14], EAC [1], WEAC [10], and WCT [3]. Among these baseline approaches, COMUSA, DICLENS, COMUSACL, and COMUSACL-DEW are automatic ensemble clustering approaches, while the other three are non-automatic approaches.
Conclusions
In this paper, we propose a novel ensemble clustering approach termed ECFG. Compared to the existing approaches, our approach is distinguished in three aspects: (i) the cluster number is automatically obtained; (ii) the reliability of each base clustering can be estimated and exploited in the ensemble clustering process; (iii) our approach is efficient in terms of both large data sizes and large ensemble sizes. In this paper, the super-object representation is introduced to facilitate the
Conflict of interest
None declared.
Acknowledgments
This work was supported by National Science & Technology Pillar Program (No. 2012BAK16B06), NSFC (61173084 & 61502543), the GuangZhou Program (Grant no. 201508010032), the CCF-Tencent Open Research Fund, and the PhD Start-up Fund of Natural Science Foundation of Guangdong Province, China (No. 2014A030310180).
Dong Huang received his B.S. degree in computer science in 2009 from South China University of Technology, China. He received his M.Sc. degree in computer science in 2011 and his Ph.D. degree in computer science in 2015, both from Sun Yat-sen University, China. He joined South China Agricultural University in 2015 as an assistant professor with College of Mathematics and Informatics. His research interests include data mining and pattern recognition.
References (32)
- et al.
Weighted partition consensus via kernels
Pattern Recognit.
(2010) - et al.
Hybrid cluster ensemble framework based on the random combination of data transformation operators
Pattern Recognit.
(2012) - et al.
Ensemble clustering by means of clustering embedding in vector spaces
Pattern Recognit.
(2014) - et al.
Hybrid clustering solution selection strategy
Pattern Recognit.
(2014) - et al.
Combining multiple clusterings via crowd agreement estimation and multi-granularity link analysis
Neurocomput.
(2015) - et al.
An efficient and scalable family of algorithms for combining clusterings
Eng. Appl. Artif. Intell.
(2013) - et al.
Fast affinity propagation clusteringa multilevel approach
Pattern Recognit.
(2012) - et al.
Clustering aggregation by probability accumulation
Pattern Recognit.
(2009) - et al.
Combining multiple clusterings using similarity graph
Pattern Recognit.
(2011) - et al.
Combining multiple clusterings using evidence accumulation
IEEE Trans. Pattern Anal. Mach. Intell.
(2005)
A link-based approach to the cluster ensemble problem
IEEE Trans. Pattern Anal. Mach. Intell.
A survey of clustering ensemble algorithms
Int. J. Pattern Recognit. Artif. Intell.
Cited by (131)
Parameter-free ensemble clustering with dynamic weighting mechanism
2024, Pattern RecognitionWeighted ensemble clustering with multivariate randomness and random walk strategy
2024, Applied Soft ComputingClustering approximation via a fusion of multiple random samples
2024, Information FusionBi-level ensemble method for unsupervised feature selection
2023, Information FusionProFeat: Unsupervised image clustering via progressive feature refinement
2022, Pattern Recognition LettersCitation Excerpt :In contrast, our ensemble-based approach does not rely on consensus labels, and is efficient and scalable. Our ensemble approach does not generate explicit consensus cluster labels in contrast to previous cluster ensemble approaches such as [10–12,14]. As a result, our algorithm requires only batch-level graph construction without partitioning unlike previous approaches, so is computationally highly efficient and scalable.
Clustering ensemble based on approximate accuracy of the equivalence granularity[Formula presented]
2022, Applied Soft Computing
Dong Huang received his B.S. degree in computer science in 2009 from South China University of Technology, China. He received his M.Sc. degree in computer science in 2011 and his Ph.D. degree in computer science in 2015, both from Sun Yat-sen University, China. He joined South China Agricultural University in 2015 as an assistant professor with College of Mathematics and Informatics. His research interests include data mining and pattern recognition.
Jianhuang Lai received his M.Sc. degree in applied mathematics in 1989 and his Ph.D. in mathematics in 1999 from Sun Yat-sen University, China. He joined Sun Yat-sen University in 1989 as an assistant professor, where currently he is a professor with the Department of Automation of School of Information Science and Technology and dean of School of Information Science and Technology. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, wavelet and its applications. He has published over 100 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g. IEEE TPAMI, IEEE TKDE, IEEE TNN, IEEE TIP, IEEE TSMC (Part B), Pattern Recognition, ICCV, CVPR and ICDM. Prof. Lai serves as a standing member of the Image and Graphics Association of China and also serves as a standing director of the Image and Graphics Association of Guangdong.
Chang-Dong Wang received the B.S. degree in applied mathematics in 2008, M.Sc. degree in computer science in 2010 and Ph.D. degree in computer science in 2013, all from Sun Yat-sen University, China. He is a visiting student at University of Illinois at Chicago from January 2012 to November 2012. He joined Sun Yat-sen University in 2013 as an assistant professor with School of Mobile Information Engineering. His current research interests include machine learning and data mining. He has published over 20 scientific papers in international journals and conferences such as IEEE TPAMI, IEEE TKDE, IEEE TSMC-C, Pattern Recognition, KAIS, Neurocomputing, ICDM and SDM. His ICDM 2010 paper won the Honorable Mention for Best Research Paper Awards.