A robust density peaks clustering algorithm with density-sensitive similarity
Introduction
Clustering is the most important method of unsupervised learning, which is a rich conceptual and algorithmic framework for data analysis and interpretation [1], [2], [3]. The general target of clustering is to minimize the dissimilarity between instances in the same cluster and maximize them in different clusters [4], [5], [6]. At present, clustering has been widely applied in data mining, pattern recognition, image segmentation, genetic disease detection, and so on [7], [8], [9]. However, the performance of the most well-known algorithms such as k-means, AP algorithms, SC algorithms, etc., will be limited when dealing with arbitrary shape datasets without any prior knowledge [10], [11], [12].
Recently, density peaks clustering (DPC) algorithm has attracted widely attention, which is able to detect non-spherical clusters without any prior knowledge [13]. DPC is conducted based on two assumptions: (1) cluster centers have higher local-density than their neighbors; (2) cluster centers are posited far from each other. Based on these two assumptions, DPC requires neither iterative process nor more input parameters to identify cluster centers by drawing a decision-graph and plays a prominent role in data mining, such as social network, genetic disease, biology and so on [14], [15], [16], [17]. However, DPC has some shortcomings to be solved before it is widely applied: (1) DPC is not satisfactory for multi-peaks manifold datasets; (2) The selection of input parameter is sensitive, especially on small-scale datasets; (3) The method of drawing the decision-graph will cause the cluster centers to be uncertain sometimes [18], [19]. All these shortcomings will lead to unsatisfactory clustering results when processing manifold datasets.
To address the shortcoming of multiple density peaks in DPC, many studies based on divide and conquer strategies have been proposed. Liang et al. [20] proposed a 3DC algorithm, which selects cluster centers recursively rather than selecting all the possible cluster centers. The algorithm is terminated according to the concept of density-reachable in DBSCAN. Xu et al. [21] proposed a FDPC algorithm, which utilizes DPC and SVM to split and merge clusters to obtain accurate clustering results. Zhang et al. [22] proposed an E_CFSFDP algorithm, which first initializes the clusters as more as possible based on DPC. After that, it brings out a variant of KNN graph to improve the CHAMELEON to merge the sub-clusters to form right clusters. Bie et al. [23] proposed a fuzzy-CFSFDP algorithm, which first finds all clusters and merges the extra clusters adaptively. Gao et al. [24] proposed an automatic algorithm called ICFS, which includes pre-clustering, merging and splitting phases. All these improved algorithms need to introduce redundant parameters to help merge sub-clusters. To deal with the parameter sensitivity, most improved algorithms introduce KNN to redefine the local-density. Du et al. [25] introduced a DPC-KNN algorithm, which uses the concept of KNN to redefine the local-density. DPC-KNN takes the global distribution of the dataset into account and is more feasible and effective than DPC. Xie et al. [26] proposed a FKNN-DPC algorithm, which improves DPC through a novel local-density calculation method along with a novel allocation strategy. FKNN-DPC calculates the local-density based on the KNN and is independent of the cutoff distance . The proposed algorithm is superior to DPC in identifying clusters of any shape and scale. Liu et al. [27] proposed a SNN-DPC algorithm based on SNN concept, which improves the performance of DPC on multi-scale, cross-winding and variable-density datasets. A new measure of local-density and relative-distance is proposed based on shared neighbors, which can be more objectively adapted to the local environment. All these improved algorithms delete parameter by redefining the KNN-based local-density. However, the parameter which represents the number of nearest neighbors should be pre-specified. How to choose the suitable need to be further researched. Moreover, the above algorithms do not consider global-consistency when faced with multiple density peaks and rely on manually identifying cluster centers.
In order to overcome the shortcomings of DPC, this paper proposed a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS). Density-sensitive similarity is defined firstly for Spectral Clustering, which considers the global-consistency to obtain desirable clusters on manifold datasets [28]. RDPC-DSS introduces the density-sensitive similarity instead of the conventional Euclidean distance to obtain the desirable clusters on the manifold datasets. In addition, local-density based on density-sensitive similarity is calculated to reduce the influence of parameters on the clustering results. As far as we know, Silhouette Coefficient is a clustering index used to reflect the compactness and separation between data instances [29]. However, the computational complexity of Silhouette Coefficient is high. In order to avoid the high computational complexity, we design a novel density cluster index (DCI) based on the DPC characteristics to determine the accurate number of cluster centers. To summarize, the major contributions of the paper are:
- •
We propose a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS) for manifold datasets. RDPC-DSS introduces density-sensitive similarity to improve the effectiveness of DPC on multi-peaks manifold datasets and eliminate the sensitivity of cut-off distance parameter.
- •
We design a novel density cluster index (DCI) based on the Silhouette Coefficient but the complexity of DCI is lower than the Silhouette Coefficient. The characteristics of DPC and the definition of clustering are jointly considered in DCI. RDPC-DSS based on DCI directly detects clusters automatically overcomes the decision-graphs issue of the traditional DPC algorithm.
- •
Our experimental results show that the RDPC-DSS can obtain better clustering accuracy than DPC on several multi-peaks manifold datasets. Besides, with density-sensitive similarity, the RDPC-DSS based on DCI can outperform several state-of-the-art analogous algorithms in terms of automatically obtaining optimal clustering.
The rest chapters are organized as follows: main principles of DPC, density-sensitive similarity and Silhouette Coefficient are introduced in Section 2. A detailed description of RDPC-DSS is make in Section 3. Section 4 designs experiments to illustrate the effectiveness of RDPC-DSS and conducts comparison of other state-of-art algorithms on several varied datasets. Finally, concluding remarks and further challenges are summarized.
Section snippets
Related works
This section simply introduces the main ideas of DPC algorithm, density-sensitive similarity and Silhouette Coefficient.
RDPC-DSS algorithm
To improve DPC, we propose a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS). Firstly, density-sensitive similarity is taken instead of Euclidean distance to deal with manifold datasets. In addition, it will eliminate the effect of on the algorithm results. Secondly, a novel density clustering index (DCI) is designed to obtain satisfactory cluster centers automatically according to the characteristics of DPC and the definition of clustering.
Experiments and results
We illustrate the effectiveness of the proposed RDPC-DSS algorithm on six synthetic manifold datasets and seven real-world datasets. It includes three famous image datasets: COIL dataset [45], MNIST dataset [46] and Olivetti face dataset [47]. The description of these datasets is shown in Table 1, Table 2. The number of clusters in these datasets is various. DPC and D-SC, FKNN-DPC, SNN-DPC are applied to compare the performance with RDPC-DSS. D-SC algorithm is the Spectral Clustering with
Conclusions and future works
In this paper, we proposed a robust density peaks clustering algorithm with density-sensitive similarity (RDPC-DSS) to improve the ability of DPC to process manifold datasets. Density-sensitive similarity measures the similarity instead of Euclidean distance firstly, which considers the global-consistency of the data structure. Moreover, the calculation of local-density and relative-distance based on the density-sensitive similarity will eliminate the sensitivity of parameter . In addition, a
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by Outstanding Innovation Scholarship for Doctoral Candidate of CUMT (2019YCBS055).
References (49)
- et al.
A study of graph-based system for multi-view clustering
Knowl.-Based Syst.
(2019) - et al.
A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE
Knowl.-Based Syst.
(2019) - et al.
Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance
J. Mach. Learn. Res.
(2010) - et al.
Fuzzy extensions of the dbscan clustering algorithm
Soft Comput.
(2018) - et al.
Low-rank local tangent space embedding for subspace clustering
Inform. Sci.
(2020) - et al.
Efficient robust model fitting for multistructure data using global greedy search
IEEE Trans. Cybern.
(2019) - et al.
Face clustering: representation and pairwise constraints
IEEE Trans. Inf. Forensics Secur.
(2018) - et al.
Revealing community structures by ensemble clustering using group diffusion
Inf. Fusion
(2018) - et al.
Color perception algorithm of medical images using density peak based hierarchical clustering
Biomed. Signal Process. Control
(2019) - et al.
Model-based co-clustering for functional data
Neurocomputing
(2018)
Clustering by passing messages between data points
Science
K-means: a revisit
Neurocomputing
Clustering by fast search and find of density peaks
Science
I-nice: A new approach for identifying the number of clusters and initial cluster centres
Inform. Sci.
An improved density peaks-based clustering method for social circle discovery in social networks
Neurocomputing
An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood
Knowl.-Based Syst.
Indoor positioning of RBF neural network based on improved fast clustering algorithm combined with LM algorithm
IEEE Access
Evolutionary multiobjective clustering and its applications to patient stratification
IEEE Trans. Cybern.
DPCG: an efficient density peaks clustering algorithm based on grid
Int. J. Mach. Learn. Cybern.
Delta-density based clustering with a divide-and-conquer strategy: 3DC clustering
Pattern Recognit. Lett.
A feasible density peaks clustering algorithm with a merging strategy
Soft Comput.
Extended fast search clustering algorithm: Widely density clusters, no density peaks
Comput. Sci. Inf. Technol.
Adaptive fuzzy clustering by fast search and find of density peaks
Pers. Ubiquitous Comput.
Icfs: An improved fast search and find of density peaks clustering algorithm
Cited by (56)
SFKNN-DPC: Standard deviation weighted distance based density peak clustering algorithm
2024, Information SciencesAn overview on density peaks clustering
2023, NeurocomputingK-DGHC: A hierarchical clustering method based on K-dominance granularity
2023, Information SciencesAn improved density peaks clustering algorithm based on natural neighbor with a merging strategy
2023, Information SciencesA Sampling-Based Density Peaks Clustering Algorithm for Large-Scale Data
2023, Pattern RecognitionVDPC: Variational density peak clustering algorithm
2023, Information Sciences