Elsevier

Knowledge-Based Systems

Volume 200, 20 July 2020, 106028
Knowledge-Based Systems

A robust density peaks clustering algorithm with density-sensitive similarity

https://doi.org/10.1016/j.knosys.2020.106028Get rights and content

Abstract

Density peaks clustering (DPC) algorithm is proposed to identify the cluster centers quickly by drawing a decision-graph without any prior knowledge. Meanwhile, DPC obtains arbitrary clusters with fewer parameters and no iteration. However, DPC has some shortcomings to be addressed before it is widely applied. Firstly, DPC is not suitable for manifold datasets because these datasets have multiple density peaks in one cluster. Secondly, the cut-off distance parameter has a great influence on the algorithm, especially on small-scale datasets. Thirdly, the method of decision-graph will cause uncertain cluster centers, which leads to wrong clustering. To address these issues, we propose a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS) to find accurate cluster centers on the manifold datasets. With density-sensitive similarity, the influence of the parameters on the clustering results is reduced. In addition, a novel density clustering index (DCI) instead of the decision-graph is designed to automatically determine the number of cluster centers. Extensive experimental results show that RDPC-DSS outperforms DPC and other state-of-the-art algorithms on the manifold datasets.

Introduction

Clustering is the most important method of unsupervised learning, which is a rich conceptual and algorithmic framework for data analysis and interpretation [1], [2], [3]. The general target of clustering is to minimize the dissimilarity between instances in the same cluster and maximize them in different clusters [4], [5], [6]. At present, clustering has been widely applied in data mining, pattern recognition, image segmentation, genetic disease detection, and so on [7], [8], [9]. However, the performance of the most well-known algorithms such as k-means, AP algorithms, SC algorithms, etc., will be limited when dealing with arbitrary shape datasets without any prior knowledge [10], [11], [12].

Recently, density peaks clustering (DPC) algorithm has attracted widely attention, which is able to detect non-spherical clusters without any prior knowledge [13]. DPC is conducted based on two assumptions: (1) cluster centers have higher local-density than their neighbors; (2) cluster centers are posited far from each other. Based on these two assumptions, DPC requires neither iterative process nor more input parameters to identify cluster centers by drawing a decision-graph and plays a prominent role in data mining, such as social network, genetic disease, biology and so on [14], [15], [16], [17]. However, DPC has some shortcomings to be solved before it is widely applied: (1) DPC is not satisfactory for multi-peaks manifold datasets; (2) The selection of input parameter dc is sensitive, especially on small-scale datasets; (3) The method of drawing the decision-graph will cause the cluster centers to be uncertain sometimes [18], [19]. All these shortcomings will lead to unsatisfactory clustering results when processing manifold datasets.

To address the shortcoming of multiple density peaks in DPC, many studies based on divide and conquer strategies have been proposed. Liang et al. [20] proposed a 3DC algorithm, which selects cluster centers recursively rather than selecting all the possible cluster centers. The algorithm is terminated according to the concept of density-reachable in DBSCAN. Xu et al. [21] proposed a FDPC algorithm, which utilizes DPC and SVM to split and merge clusters to obtain accurate clustering results. Zhang et al. [22] proposed an E_CFSFDP algorithm, which first initializes the clusters as more as possible based on DPC. After that, it brings out a variant of KNN graph to improve the CHAMELEON to merge the sub-clusters to form right clusters. Bie et al. [23] proposed a fuzzy-CFSFDP algorithm, which first finds all clusters and merges the extra clusters adaptively. Gao et al. [24] proposed an automatic algorithm called ICFS, which includes pre-clustering, merging and splitting phases. All these improved algorithms need to introduce redundant parameters to help merge sub-clusters. To deal with the parameter sensitivity, most improved algorithms introduce KNN to redefine the local-density. Du et al. [25] introduced a DPC-KNN algorithm, which uses the concept of KNN to redefine the local-density. DPC-KNN takes the global distribution of the dataset into account and is more feasible and effective than DPC. Xie et al. [26] proposed a FKNN-DPC algorithm, which improves DPC through a novel local-density calculation method along with a novel allocation strategy. FKNN-DPC calculates the local-density based on the KNN and is independent of the cutoff distance dc. The proposed algorithm is superior to DPC in identifying clusters of any shape and scale. Liu et al. [27] proposed a SNN-DPC algorithm based on SNN concept, which improves the performance of DPC on multi-scale, cross-winding and variable-density datasets. A new measure of local-density and relative-distance is proposed based on shared neighbors, which can be more objectively adapted to the local environment. All these improved algorithms delete parameter dc by redefining the KNN-based local-density. However, the parameter k which represents the number of nearest neighbors should be pre-specified. How to choose the suitable k need to be further researched. Moreover, the above algorithms do not consider global-consistency when faced with multiple density peaks and rely on manually identifying cluster centers.

In order to overcome the shortcomings of DPC, this paper proposed a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS). Density-sensitive similarity is defined firstly for Spectral Clustering, which considers the global-consistency to obtain desirable clusters on manifold datasets [28]. RDPC-DSS introduces the density-sensitive similarity instead of the conventional Euclidean distance to obtain the desirable clusters on the manifold datasets. In addition, local-density based on density-sensitive similarity is calculated to reduce the influence of parameters on the clustering results. As far as we know, Silhouette Coefficient is a clustering index used to reflect the compactness and separation between data instances [29]. However, the computational complexity of Silhouette Coefficient is high. In order to avoid the high computational complexity, we design a novel density cluster index (DCI) based on the DPC characteristics to determine the accurate number of cluster centers. To summarize, the major contributions of the paper are:

  • We propose a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS) for manifold datasets. RDPC-DSS introduces density-sensitive similarity to improve the effectiveness of DPC on multi-peaks manifold datasets and eliminate the sensitivity of cut-off distance parameter.

  • We design a novel density cluster index (DCI) based on the Silhouette Coefficient but the complexity of DCI is lower than the Silhouette Coefficient. The characteristics of DPC and the definition of clustering are jointly considered in DCI. RDPC-DSS based on DCI directly detects clusters automatically overcomes the decision-graphs issue of the traditional DPC algorithm.

  • Our experimental results show that the RDPC-DSS can obtain better clustering accuracy than DPC on several multi-peaks manifold datasets. Besides, with density-sensitive similarity, the RDPC-DSS based on DCI can outperform several state-of-the-art analogous algorithms in terms of automatically obtaining optimal clustering.

The rest chapters are organized as follows: main principles of DPC, density-sensitive similarity and Silhouette Coefficient are introduced in Section 2. A detailed description of RDPC-DSS is make in Section 3. Section 4 designs experiments to illustrate the effectiveness of RDPC-DSS and conducts comparison of other state-of-art algorithms on several varied datasets. Finally, concluding remarks and further challenges are summarized.

Section snippets

Related works

This section simply introduces the main ideas of DPC algorithm, density-sensitive similarity and Silhouette Coefficient.

RDPC-DSS algorithm

To improve DPC, we propose a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS). Firstly, density-sensitive similarity is taken instead of Euclidean distance to deal with manifold datasets. In addition, it will eliminate the effect of dc on the algorithm results. Secondly, a novel density clustering index (DCI) is designed to obtain satisfactory cluster centers automatically according to the characteristics of DPC and the definition of clustering.

Experiments and results

We illustrate the effectiveness of the proposed RDPC-DSS algorithm on six synthetic manifold datasets and seven real-world datasets. It includes three famous image datasets: COIL dataset [45], MNIST dataset [46] and Olivetti face dataset [47]. The description of these datasets is shown in Table 1, Table 2. The number of clusters in these datasets is various. DPC and D-SC, FKNN-DPC, SNN-DPC are applied to compare the performance with RDPC-DSS. D-SC algorithm is the Spectral Clustering with

Conclusions and future works

In this paper, we proposed a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS) to improve the ability of DPC to process manifold datasets. Density-sensitive similarity measures the similarity instead of Euclidean distance firstly, which considers the global-consistency of the data structure. Moreover, the calculation of local-density and relative-distance based on the density-sensitive similarity will eliminate the sensitivity of parameter dc. In addition, a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by Outstanding Innovation Scholarship for Doctoral Candidate of CUMT (2019YCBS055).

References (49)

  • WangH. et al.

    A study of graph-based system for multi-view clustering

    Knowl.-Based Syst.

    (2019)
  • ZhangY.Y. et al.

    A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE

    Knowl.-Based Syst.

    (2019)
  • VinhN. et al.

    Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance

    J. Mach. Learn. Res.

    (2010)
  • IencoD. et al.

    Fuzzy extensions of the dbscan clustering algorithm

    Soft Comput.

    (2018)
  • DengT. et al.

    Low-rank local tangent space embedding for subspace clustering

    Inform. Sci.

    (2020)
  • LaiT. et al.

    Efficient robust model fitting for multistructure data using global greedy search

    IEEE Trans. Cybern.

    (2019)
  • ShiY. et al.

    Face clustering: representation and pairwise constraints

    IEEE Trans. Inf. Forensics Secur.

    (2018)
  • IvannikovaE. et al.

    Revealing community structures by ensemble clustering using group diffusion

    Inf. Fusion

    (2018)
  • ZengX. et al.

    Color perception algorithm of medical images using density peak based hierarchical clustering

    Biomed. Signal Process. Control

    (2019)
  • JacquesJ. et al.

    Model-based co-clustering for functional data

    Neurocomputing

    (2018)
  • FreyB. et al.

    Clustering by passing messages between data points

    Science

    (2007)
  • ZhaoW. et al.

    K-means: a revisit

    Neurocomputing

    (2018)
  • RodriguezA. et al.

    Clustering by fast search and find of density peaks

    Science

    (2014)
  • MasudM. et al.

    I-nice: A new approach for identifying the number of clusters and initial cluster centres

    Inform. Sci.

    (2018)
  • WangM. et al.

    An improved density peaks-based clustering method for social circle discovery in social networks

    Neurocomputing

    (2016)
  • DingS. et al.

    An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood

    Knowl.-Based Syst.

    (2017)
  • MengH. et al.

    Indoor positioning of RBF neural network based on improved fast clustering algorithm combined with LM algorithm

    IEEE Access

    (2019)
  • LiX. et al.

    Evolutionary multiobjective clustering and its applications to patient stratification

    IEEE Trans. Cybern.

    (2019)
  • XuX. et al.

    DPCG: an efficient density peaks clustering algorithm based on grid

    Int. J. Mach. Learn. Cybern.

    (2018)
  • LiangZ. et al.

    Delta-density based clustering with a divide-and-conquer strategy: 3DC clustering

    Pattern Recognit. Lett.

    (2016)
  • XuX. et al.

    A feasible density peaks clustering algorithm with a merging strategy

    Soft Comput.

    (2018)
  • ZhangW. et al.

    Extended fast search clustering algorithm: Widely density clusters, no density peaks

    Comput. Sci. Inf. Technol.

    (2015)
  • BieR. et al.

    Adaptive fuzzy clustering by fast search and find of density peaks

    Pers. Ubiquitous Comput.

    (2016)
  • GaoJ. et al.

    Icfs: An improved fast search and find of density peaks clustering algorithm

  • Cited by (56)

    View all citing articles on Scopus
    View full text