Elsevier

Pattern Recognition

Volume 50, February 2016, Pages 131-142
Pattern Recognition

Ensemble clustering using factor graph

https://doi.org/10.1016/j.patcog.2015.08.015Get rights and content

Highlights

  • Introduce the super-object representation to facilitate the consensus process.

  • Probabilistically formulate the ensemble clustering problem into a BLP problem.

  • Propose an efficient solver for the BLP problem based on factor graph.

  • The cluster number of the consensus clustering is estimated automatically.

  • Our method achieves the state-of-the-art performance in effectiveness and efficiency.

Abstract

In this paper, we propose a new ensemble clustering approach termed ensemble clustering using factor graph (ECFG). Compared to the existing approaches, our approach has three main advantages: (1) the cluster number is obtained automatically and need not to be specified in advance; (2) the reliability of each base clustering can be estimated in an unsupervised manner and exploited in the consensus process; (3) our approach is efficient for processing ensembles with large data sizes and large ensemble sizes. In this paper, we introduce the concept of super-object, which serves as a compact and adaptive representation for the ensemble data and significantly facilitates the computation. Through the probabilistic formulation, we cast the ensemble clustering problem into a binary linear programming (BLP) problem. The BLP problem is NP-hard. To solve this optimization problem, we propose an efficient solver based on factor graph. The constrained objective function is represented as a factor graph and the max-product belief propagation is utilized to generate the solution insensitive to initialization and converged to the neighborhood maximum. Extensive experiments are conducted on multiple real-world datasets, which demonstrate the effectiveness and efficiency of our approach against the state-of-the-art approaches.

Introduction

The ensemble clustering technique has been receiving increasing attention in recent years due to its ability to combine multiple clusterings into a probably better and more robust clustering [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. Many ensemble clustering approaches have been developed in the past few decades [11]. Despite the great success, there are still three limitations to most of the existing ensemble clustering methods. First, they generally take the cluster number of the final clustering as input and lack the ability to automatically estimate the cluster number. Second, most of them treat each base clustering equally and overlook the different reliability of the base clusterings. Third, many of the existing methods work at the object-level and does not scale well for large ensembles. In this paper, we refer to the ensemble clustering approaches with the ability to automatically estimate the cluster number of the final clustering as the automatic approaches, and the approaches without the ability of automatic cluster number estimate as the non-automatic approaches. Recently some efforts have been made to address one or two of the three limitations [12], [13], [14]. However, to the best of our knowledge, none of the existing ensemble clustering approaches is capable of dealing with all of the three limitations in a unified model.

To overcome the aforementioned three limitations, in this paper, we propose a novel ensemble clustering approach termed ensemble clustering using factor graph (ECFG). We introduce the concept of super-object as a compact and adaptive representation for ensemble data. Instead of using the original data objects, we use the super-objects as the primitive objects, which greatly reduces the problem size and facilitates the overall process. We formulate the ensemble clustering problem into a probabilistic framework. The clustering results are represented by a set of binary decisions. Each binary decision indicates whether the corresponding two super-objects are in the same cluster or not. We assume that each base clustering is associated with a probability of making the correct decisions, which can be viewed as the reliability of a base clustering. The consensus clustering and the reliability of each base clustering are estimated iteratively by solving an instance of binary linear programming (BLP) problem. However, the BLP problem is NP-hard. The high computational complexity is the most significant hurdle for it. In this work, we present an efficient solver for this instance of BLP problem based on the factor graph technique [15]. The factor graph is a powerful tool for solving optimization problems and has many successful applications in the field of pattern recognition and machine learning [16], [17], [18], [19]. In our work, the constrained objective function is represented by a factor graph. Then the max-product belief propagation [15] is applied, which generates the solution insensitive to initialization and converged to the neighborhood maximum.

The main contributions of this paper are summarized as follows:

  • 1.

    We introduce the concept of super-object, which serves as a compact and adaptive representation for ensemble data and significantly facilitates the computation of the consensus process.

  • 2.

    We cast the ensemble clustering problem into a BLP problem and propose an efficient solver for the BLP problem based on factor graph.

  • 3.

    We propose a novel ensemble clustering approach termed ECFG, which has three advantages: (i) it can automatically estimate the cluster number of the final clustering, (ii) the reliability of each base clustering can be estimated and exploited. (iii) it is efficient w.r.t. both large data sizes and large ensemble sizes.

  • 4.

    Experimental results on multiple real-world datasets have shown that our approach significantly outperforms the state-of-the-art approaches in terms of both clustering accuracy and efficiency (see Section 4).

The remainder of this paper is organized as follows. We review the related work in Section 2. The proposed ensemble clustering approach termed ECFG is introduced in Section 3. The experimental results are reported in Section 4. We conclude this paper in Section 5.

Section snippets

Related work

The purpose of ensemble clustering is to combine multiple base clusterings into a more accurate and robust clustering. With regard to the difference of the input information of the ensemble clustering system, there are two formulations of the ensemble clustering problem. In the first formulation, the ensemble clustering system uses only the information of the multiple clusterings as input and has no access to the original data features [1], [5], [7], [20], [21]. In the other formulation, the

Ensemble clustering using factor graph

In this section, we introduce the proposed approach termed ensemble clustering using factor graph (ECFG). The formulation of the ensemble clustering problem is presented in Section 3.1. We introduce the concept of super-object in Section 3.2 and describe the overall algorithm of ECFG in Section 3.3. To solve the BLP problem in Section 3.3, we introduce a novel optimization method based on factor graph in Section 3.4.

Experiments

In this section, we conduct experiments on multiple real-world datasets to evaluate the proposed ECFG approach. We compare the proposed approach against seven ensemble clustering approaches, namely, COMUSA [27], DICLENS [13], COMUSACL [14], COMUSACL-DEW [14], EAC [1], WEAC [10], and WCT [3]. Among these baseline approaches, COMUSA, DICLENS, COMUSACL, and COMUSACL-DEW are automatic ensemble clustering approaches, while the other three are non-automatic approaches.

Conclusions

In this paper, we propose a novel ensemble clustering approach termed ECFG. Compared to the existing approaches, our approach is distinguished in three aspects: (i) the cluster number is automatically obtained; (ii) the reliability of each base clustering can be estimated and exploited in the ensemble clustering process; (iii) our approach is efficient in terms of both large data sizes and large ensemble sizes. In this paper, the super-object representation is introduced to facilitate the

Conflict of interest

None declared.

Acknowledgments

This work was supported by National Science & Technology Pillar Program (No. 2012BAK16B06), NSFC (61173084 & 61502543), the GuangZhou Program (Grant no. 201508010032), the CCF-Tencent Open Research Fund, and the PhD Start-up Fund of Natural Science Foundation of Guangdong Province, China (No. 2014A030310180).

Dong Huang received his B.S. degree in computer science in 2009 from South China University of Technology, China. He received his M.Sc. degree in computer science in 2011 and his Ph.D. degree in computer science in 2015, both from Sun Yat-sen University, China. He joined South China Agricultural University in 2015 as an assistant professor with College of Mathematics and Informatics. His research interests include data mining and pattern recognition.

References (32)

  • N. Iam-On et al.

    A link-based approach to the cluster ensemble problem

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • J. Yi, T. Yang, R. Jin, A.K. Jain, Robust ensemble clustering by matrix completion, in: ICDM, 2012, pp....
  • D. Huang, J.-H. Lai, C.-D. Wang, Exploiting the wisdom of crowd: a multi-granularity approach to clustering ensemble,...
  • Y. Ren, C. Domeniconi, G. Zhang, G. Yu, Weighted-object ensemble clustering, in: ICDM, 2013, pp....
  • S. Vega-Pons et al.

    A survey of clustering ensemble algorithms

    Int. J. Pattern Recognit. Artif. Intell.

    (2011)
  • T. Li, C. Ding, Weighted consensus clustering, in: SDM, 2008, pp....
  • Cited by (131)

    • ProFeat: Unsupervised image clustering via progressive feature refinement

      2022, Pattern Recognition Letters
      Citation Excerpt :

      In contrast, our ensemble-based approach does not rely on consensus labels, and is efficient and scalable. Our ensemble approach does not generate explicit consensus cluster labels in contrast to previous cluster ensemble approaches such as [10–12,14]. As a result, our algorithm requires only batch-level graph construction without partitioning unlike previous approaches, so is computationally highly efficient and scalable.

    View all citing articles on Scopus

    Dong Huang received his B.S. degree in computer science in 2009 from South China University of Technology, China. He received his M.Sc. degree in computer science in 2011 and his Ph.D. degree in computer science in 2015, both from Sun Yat-sen University, China. He joined South China Agricultural University in 2015 as an assistant professor with College of Mathematics and Informatics. His research interests include data mining and pattern recognition.

    Jianhuang Lai received his M.Sc. degree in applied mathematics in 1989 and his Ph.D. in mathematics in 1999 from Sun Yat-sen University, China. He joined Sun Yat-sen University in 1989 as an assistant professor, where currently he is a professor with the Department of Automation of School of Information Science and Technology and dean of School of Information Science and Technology. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, wavelet and its applications. He has published over 100 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g. IEEE TPAMI, IEEE TKDE, IEEE TNN, IEEE TIP, IEEE TSMC (Part B), Pattern Recognition, ICCV, CVPR and ICDM. Prof. Lai serves as a standing member of the Image and Graphics Association of China and also serves as a standing director of the Image and Graphics Association of Guangdong.

    Chang-Dong Wang received the B.S. degree in applied mathematics in 2008, M.Sc. degree in computer science in 2010 and Ph.D. degree in computer science in 2013, all from Sun Yat-sen University, China. He is a visiting student at University of Illinois at Chicago from January 2012 to November 2012. He joined Sun Yat-sen University in 2013 as an assistant professor with School of Mobile Information Engineering. His current research interests include machine learning and data mining. He has published over 20 scientific papers in international journals and conferences such as IEEE TPAMI, IEEE TKDE, IEEE TSMC-C, Pattern Recognition, KAIS, Neurocomputing, ICDM and SDM. His ICDM 2010 paper won the Honorable Mention for Best Research Paper Awards.

    View full text