nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 24.06.2022 | Original Article

A novel density deviation multi-peaks automatic clustering algorithm

verfasst von: Wei Zhou, Limin Wang, Xuming Han, Milan Parmar, Mingyang Li

Erschienen in: Complex & Intelligent Systems | Ausgabe 1/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

The density peaks clustering (DPC) algorithm is a classical and widely used clustering method. However, the DPC algorithm requires manual selection of cluster centers, a single way of density calculation, and cannot effectively handle low-density points. To address the above issues, we propose a novel density deviation multi-peaks automatic clustering method (AmDPC) in this paper. Firstly, we propose a new local-density and use the deviation to measure the relationship between data points and the cut-off distance ($d_c$). Secondly, we divide the density deviation into multiple density levels equally and extract the points with higher distances in each density level. Finally, for the multi-peak points with higher distances at low-density levels, we merge them according to the size difference of the density deviation. We finally achieve the overall automatic clustering by processing the low-density points. To verify the performance of the method, we test the synthetic dataset, the real-world dataset, and the Olivetti Face dataset, respectively. The simulation experimental results indicate that the AmDPC method can handle low-density points more effectively and has certain effectiveness and robustness.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Databases have been widely used in information systems. The development of internet technology and artificial intelligence imposes higher standards on the collection and processing of information [1]. Machine learning is the closest data processing method to artificial intelligence systems [2]. Data mining technology is an invaluable means of discovering the hidden worth of data [3]. Cluster analysis is a key method of data mining, which can categorize data with unknown relationships according to the similarity between data, and then analyze the valuable information hidden in the unknown data [4]. In recent years, many scholars have proposed different types of clustering algorithms, and also solved many practical problems [5]. Cluster analysis techniques have a widespread adoption in the fields of pattern recognition [6], stock prediction [7], bioengineering [8], and transportation [9]. Cluster analysis techniques have become an indispensable tool for mining the value of information in present society.

Density-based clustering includes density peaks clustering (DPC) algorithm [10], density distribution function clustering (DENCLUE) algorithm [11], density spatial clustering (DBSCAN) algorithm and so on [12]. The density peaks clustering (DPC) algorithm is the most typical clustering method in recent years. Compared with the k-means algorithm [13], the DPC method is not necessary to predetermine the number of class clusters and can effectively identify clustering centers in various complex datasets. Similar to the DBSCAN algorithm, the DPC is also superior for noise point identification. And the DPC algorithm does not need iteration to get the best clustering results. The DPC algorithm is also well suited for medium to large scale data clustering problems relative to the affinity propagation (AP) algorithm [14].

In recent years, the DPC algorithm is becoming a research hotspot for scholars at home and abroad. However, the algorithm needs to set the location of clustering centers artificially, and the selection of clustering centers lacks an accurate scientific basis and is subjective [15, 16]. Ding et al. [17] proposed an automatic clustering algorithm to detect clustering centers by taking the lower value of the judgment metric after normalizing the density and distance ranges. The DPC algorithm assigns data points mainly based on the identified clustering centers, which is very effective for clustering high-density points. However, it is difficult to handle the data at low-density points. Wang et al. [18] proposed a systematic clustering method based on low-density representative points, which can effectively identify class clusters of different shapes and sizes. Jiang et al. [19] used the sensitivity of DBSCAN to noise points for the halo points of the DPC algorithm to make the anomaly identification more accurate and efficient.

For data with different density distributions, Chen et al. [20] proposed a domain adaptive density clustering (DADC) algorithm. The DADC algorithm can effectively handle data with different structural characteristics through strategies such as domain-adaptive density measurement, automatic extraction of initial cluster centers, and merged fragment clusters. Du et al. [21] proposed a robust border-peeling clustering algorithm. This method used Cauchy kernel density estimation and a bidirectional association strategy to share neighborhood information to improve clustering performance on complex datasets with non-uniform distributions. Qian et al. [22] proposed an adaptive density variation clustering algorithm based on local density information and nearest neighbor metric method. This algorithm can handle clustering of arbitrary shape and size with varying density, and has some robustness to outliers. The above methods can all work with data with different density distributions, but they lack systematic description and analysis for non-convex datasets with low-density points. The DPC method usually utilizes a Gaussian kernel function to compute the local density. However, the Gaussian kernel function considers the entire data space when calculating the local-density and may encounter difficulties in distinguishing overlapping clusters and dealing with low-density points [23‐25]. Milan et al. [26] use residuals instead of Gaussian kernel function for local density calculation to further view the potential relationship between different data points.

To address the above issues, we present a novel density deviation multi-peaks auto-clustering method. We first utilize density deviation instead of the original local density calculation method, so that we can know the relationship between data points and cut-off distance $d_c$ more carefully. We put the density deviation in the decision diagram by taking the rightmost area as the high-density level, the leftmost area as the low-density level, and the rest as the medium density level. We select the points with higher distances in different density classes as sub-centers respectively. For the multi-peak points in the high-density region, we consider them to be the clustering centroids can be directly clustered. For the multi-peak points in the medium-density level and low-density level, we call them low-density clustering centroids. We judge whether they are one class according to the distance of the peaks between low-density centroids and then get the final clustering result. To validate the usefulness of the method, we conduct a clustering performance analysis on synthetic and real-world datasets, respectively. The simulation experimental results indicate that the AmDPC method has a better clustering effect compared with other state-of-the-art methods.

The rest of this paper is organized as follows. Section “Related work”presents the technical details of the DPC algorithm in detail. Section “A novel density deviation multi-peaks automatic clustering algorithm”focus on the advanced ideas and clustering details of the AmDPC algorithm. Section “Performance evaluation and experimental analysis”shows the experimental specifics and the comparative analysis results of the algorithm. Section “Conclusion”summarizes the contributions of the algorithm and give future perspectives.

Rodriguez and Laio presented an efficient density peaks clustering method in Science in 2014. This method has only one parameter and can rapidly locate the clustering centers of arbitrary structure data. The overall idea of the DPC algorithm is as follows: (i) The cluster center should be at the maximum density, and each cluster center is far away from each other, and the cluster center is surrounded by many similar clusters of data points with high local-density; (ii) There is the obvious distinction between the different clusters and the clustering centers are very far apart from each other. [27, 28]. Therefore, the data points that are denser and more distant from the sample points at the same density level have the ability to act as clustering centers.

$$\begin{aligned} d_c= S_{rnd(\frac{p}{100}*M)} \end{aligned}$$

(1)

In Eq. (1), we give the computation of the cut-off distance. Where $d_c$ represents the cut-off distance, S indicates the set of data points sorted in ascending order by distance, p signifies the user-selected percentage parameter that determines which member of the set S is selected to attach to $d_c$, and rnd(o) returns the integer value closest to o, in which M means the capacity of the set S.

Assume that for a dataset $S=\left\{ x_1, x_2, ..., x_N\right\} $, N denotes the number of sampling points, and its local-density is defined as follows:

$$\begin{aligned} d(x_i,x_j)=\Vert x_i-x_j \Vert \end{aligned}$$

(2)

$$\begin{aligned} \rho _{i}=\sum \limits _{j}e^{-\Big (\frac{\Vert x_i-x_j\Vert }{d_{c}}\Big )^2} \end{aligned}$$

(3)

The $\Vert \Vert $ represents the calculation of Euclidean distance between data points. In Eq. (3), we show the computation formula of local density under Gaussian kernel function. We generally choose 1%–2% of maximum spacing among all sample points as the cut-off distance $d_c$ based on our experience.

The DPC algorithm defines two important concepts: first, the local density for the samples, and second, the minimal distance between data points and samples having a greater local density.

Refers to the space from sample i to the denser sample than it and closest to it, and the distance between them as $\delta _i$. The formula for $\delta _i$ is given as follows:

$$\begin{aligned} \delta _i= \left\{ \begin{matrix} \min \limits _{j:\rho _j>\rho _i} d(x_i,x_j),&{} if\exists {j}\quad s.t.\quad \rho _j>\rho _i \\ \max \limits _{j} d(x_i,x_j),&{}otherwise \end{matrix}\right. \end{aligned}$$

(4)

The decision graph plots each sample point with the density $\rho $ for the horizontal axis and the parameter $\delta $ for the vertical axis (see Fig. 1a). The data points at the top right side of the decision graph, which are obviously far from most of the samples, are often selected as the best clustering centers. The upper-right position represents a high density of sample points and their spacing among them. The strategy of selecting cluster centers by decision graph is the core of the DPC method. Finally, according to the location of the clustering center, DPC assigns other points that are closer to the clustering center to get the final clustering results [29, 30].

The DPC algorithm has a very efficient clustering performance. The algorithm is not dependent upon the ranking of data input and can disrupt data and then input without affecting the clustering effect. The method of selecting clustering centers is relatively intuitive and stable on some datasets, and the clustering results are reliable within the applicable range. The density-based clustering method is superior to the distance-based clustering method in dealing with non-spherical cluster data due to the density ranking. However, the algorithm also has certain shortcomings. The DPC method has a single way of calculating local density, which cannot effectively measure the relationship between data. The cluster center selection lacks a certain scientific theoretical basis and cannot handle data with low-density points.

Figure 1 presents the decision graph and the clustering result of the Pathbased dataset in the DPC method, respectively. In Fig. 1a, we give the true clustering results for the Pathbased dataset. Since the true clustering result of this dataset is three classes, we have selected the three points in the red rectangular box at the top right as the clustering centers in the decision graph. In Fig. 1c, we can know that the red points in the low-density region do not get the best clustering results. The above analysis shows that the DPC algorithm fails to efficiently handle the low-density regions of the dataset.

A novel density deviation multi-peaks automatic clustering algorithm

In this section, we put forward an automatic clustering algorithm named density deviation multi-peaks automatic clustering algorithm (AmDPC). Firstly, we propose a density deviation theoretical system that can further explore the density relationship between different data points. Then, we average the density deviations of the data into multiple regions. For different levels of density deviations, we select several different peak points as sub-clustering centroids respectively. Finally, we name these sub-clustering centroids as multi-peak points. We set the size of the low-density threshold according to the decision map to merge the low-density points and process individual high-density points.

Density deviation theory

The most frequently used local-density calculation in the DPC algorithm is the Gaussian kernel function. However, this computing method cannot measure the potential relationship and dispersion between the data, nor can it further explore the degree of deviation between data points within a certain range. Therefore, a more reasonable local-density theoretical framework is needed to express the relationship between the data. The deviation reflects the difference between the mean and the true value. It reflects both the measure of the dispersion of the data distribution and the deviation of the sign value of each unit in the statistical aggregate. Hence, in this paper, we introduce the calculation of deviations into the local density. The equation of the relationship between the space between data points and the deviation of cut-off distance is as follows:

$$\begin{aligned} de=\frac{|d_{ij}-d_c |}{d_c} \end{aligned}$$

(5)

We use de to refer to the degree of deviation between the cut-off distance and the data distance. In Eq. (5), $d_c$ denotes the cut-off distance of the AmDPC algorithm (see Eq. (1)) and $d_{ij}$ represents the distance between data points (see Eq. (2)). To capture the level of the deviation of data points from the cut-off distance within a known range, we introduce the number of neighboring data points $N_i$ in Eq. (5).

$$\begin{aligned} de_i=\frac{|d_{ij}-d_c |}{d_{c}\times N_i} \end{aligned}$$

(6)

According to Eq. (6), we can better measure the relationship between data points in the case of selecting a different number of neighboring data points $N_i$. Similar to the calculation of Gaussian kernel to better measure the local-density for the sample. So we give the deviation density as follows:

$$\begin{aligned} \rho _{d_i}=\sum \limits _{j}e^{-\Big (\frac{d_{ij}-d_c}{d_{c}\times N_i}\Big )^2} \end{aligned}$$

(7)

In Eq. (7), we give the concept of deviation density by accumulating the neighboring data points $N_i$. Compared to the local density, the deviation density provides a better measure of the potential relationship between the data points. To capture the relationship between the local density and the deviation density of the sample, we give the following definition.

Definition 1

(Density deviation) The absolute value of the difference between the overall deviation density and the local density in the statistical sample. The specific computation formula for density deviation is shown below:

$$\begin{aligned} d_{\rho _i}=\left|\sum \limits _{j}e^{-\Big (\frac{d_{ij}-d_c}{d_{c}\times N_i}\Big )^2}-\sum \limits _{j}e^{-\Big (\frac{\Vert x_i-x_j \Vert }{d_{c}}\Big )^2} \right|\end{aligned}$$

(8)

In this subsection, we focus on the concept of density deviation. Compared with local density, density deviation can describe the relationship between data points and cut-off distance in more detail. This calculation can capture the degree of density deviation between samples as a whole.

Compared with the local-density of the DPC method, the density deviation can better capture the potential relationship between different data points within a certain range. The density deviation can also further measure the density dispersion of the overall sample, which is of higher value for the measurement of density properties.

Low-density points acquisition

With the above analysis, we use density deviation instead of local density to weigh the relationship among data. Hence, we can get the new cluster decision graph (see Fig. 2b). In Fig. 1 we may see that the DPC method cannot effectively handle the data points in the low-density region. The correct clustering results cannot be obtained by the initial data point assignment method alone. Therefore, we need to set another way of assigning the clustering centroids. In Fig. 1c, we could not efficiently process the cluster centroids in the low-density region of blue number 1. In considering the above, we propose an alternative method for finding clustering centers.

The DPC method directly artificially selects the clustering centers using rectangular boxes with some subjectivity. This algorithm lacks a reasonable approach for clustering centroids in low-density regions. To observe low-density points more flexibly, we average the density deviation of the decision graph into multiple regions Eq. (9).

$$\begin{aligned} \hat{d_{\rho _i}}=\frac{d_{\rho max}-d_{\rho min}}{\lambda } \end{aligned}$$

(9)

In Eq. (9), to prevent the numerical size of $d_{\rho min}$ from being too small and having an impact on the selection of clustering centers in low-density regions. We set a value of $d_{\rho min}$ less than 0.001 to 0.1 uniformly. The $\lambda $ indicates the number of regions to be split. Hence, we get the density interval in the decision graph as shown in Eq. (10).

$$\begin{aligned} (d_{\rho min},\hat{d_{\rho _i}})\cup (\hat{d_{\rho _i}},2\times \hat{d_{\rho _i}})\cdots \cup ((\lambda -1)\times \hat{d_{\rho _i}},d_{\rho max})\qquad \end{aligned}$$

(10)

$$\begin{aligned} \delta _{C_{\lambda avg}}=mean(\delta _{\hat{d_{\rho _i}}}),~~~d_{\rho min}+(\lambda -1)\times \hat{d_{\rho _i}}<d_{\rho _i}<d_{\rho max} \end{aligned}$$

(11)

https://static-content.springer.com/image/art%3A10.1007%2Fs40747-022-00798-3/MediaObjects/40747_2022_798_Equ12_HTML.png

(12)

In Fig. 2b, $C_3$ is the region of the highest density deviation points. In this paper, we use both higher density deviation and distance points as high density clustering centers. In Eq. (11), $\delta _{C_{\lambda avg}}$ denotes the average distance corresponding to the different density points in the region of highest density deviation. In Eq. (12), $C_\lambda $ denotes the cluster centroid partition line within the maximum density deviation hierarchy; $Num_{_{C_{\lambda avg}}}$ denotes the number of data points whose distance is greater than $\delta _{C_{\lambda avg}}$ in the highest density deviation level; $mean(\delta _{C_{\lambda avg}} \le \delta \le \delta _{C_{\lambda max}})$ denotes the mean value of the distance of data points between $\delta _{C_{\lambda avg}}$ and $\delta _{C_{\lambda max}}$ in the highest density deviation level. With the above equations, we can automatically find the appropriate high-density deviation clustering centers in the data with different structures.

After numerous experiments, we found that when $\lambda $ takes the value of 3, we can get the sub-clustering centroids of almost all data. So in this paper, we uniformly set $\lambda $ to the constant 3. In Fig. 2b, the red line indicates that we divide the density deviation into three intervals equally in the decision graph. The vertical coordinates indicate the distance between different data points. To get more low-density points, we set peaks splitting lines in the distance coordinates of each of the three intervals. In Fig. 2b, $C_1$, $C_2$ and $C_3$, are the peaks split lines of the three intervals. The points above the split line are collectively called multi-peak points (also called sub-clustering centroids). In Fig. 2c, we extract the multi-peak points separately to provide a reference for the processing of density levels.

We name the multi-peak points in the intervals $C_1$, $C_2$ and $C_3$ as $P_1$, $P_2$ and $P_3$ respectively. In the clustering process, the different interval sorting methods of multi-peak points can obtain different clustering results. The specific formula we obtained is shown below:

$$\begin{aligned} \left\{ \begin{matrix} P=\left\{ P_1, P_2,P_3 \right\} , &{} w=ascend\\ P=\left\{ P_3, P_2,P_1 \right\} ,&{} w=descend \end{matrix}\right. \end{aligned}$$

(13)

In Eq. (13), the w, ascend and descend is used to control the ordering of sub-cluster centroids. As shown in Fig. 2c, we set two density level threshold partition lines A and B ($A<B$) respectively. We name the multi-peak points with density deviation in the range of ($d_{\rho min}$, A) as low-density points. Those in the range of (A, B) are medium-density points, and in (B, $d_{\rho max}$) are high-density points. In summary, we divided the density deviations into different density strata by the threshold partition line.

$$\begin{aligned} A=\frac{4\rho _{cmin}+\rho _{cmax}}{3(Ni-1)} \end{aligned}$$

(14)

$$\begin{aligned} B=\frac{\rho _{cmin}+3\rho _{cmax}}{2Ni} \end{aligned}$$

(15)

In Fig. 2c, we can see that the sub-clusters centers in the low-density region are more concentrated and the density values are close to the lowest density points. The high-density region has more concentrated sub-clusters and the density value is close to the highest density point. Therefore, we design a more reasonable threshold partition line based on this data feature. In Eqs. (14) and (15), $\rho _{cmax}$ denotes the maximum density value in the sub-cluster centroids and $\rho _{cmin}$ denotes the minimum density value in the sub-clustering centers. $N_i$ are several neighboring data points.

Automatic clustering of different density levels

We can get different density deviation stratification levels by dividing the density levels. Then, we merge and assign the data according to the different density levels. For the division and merging of multi-peak points, we divided them into two cases for clustering. (i) The threshold partition line A and B are between the maximum and minimum values of density deviation, respectively; (2) The threshold partition line B is smaller than the minimum density deviation $d_{\rho min}$. We refer to these two clustering methods as low-density multi-peak points clustering and high-density uniform clustering, respectively.

Low-density multi-peak points clustering: First, we find the ordinal numbers of the locations where the multi-peak points of density deviations at different density levels are located (see Fig. 2c). Then the ordinary numbers of these multi-peak points are sorted from lowest to highest according to the density hierarchy. We first clustered the data points that were in the highest density region, and then merged the multi-peak points that were in the low-density region of ($d_{\rho min}$, A). Finally, the points in the medium-density region that are at (A, B) are combined. The final result of automatic clustering without human intervention is obtained (see Fig. 2d).

High-density uniform clustering: The previous is a method for low-density region datasets. However, if all cluster centers in the decision graph have high-density deviations, the above data point assignments are no longer applicable. Therefore, we propose an allocation method for high-density uniform clustering. If the threshold partition line B is smaller than the density deviation minimum $d_{\rho min}$, it means that the multi-peak points we obtained are the high-density clustering centroids. At this time, we allocate data points to those closest and denser than ourselves and finally get the automatic clustering result.

To contrast and evaluate the different clustering results of the same dataset, we observe the effects of other parameters on the clustering results by setting the cut-off distance $d_c$ in Figs. 1 and 2 to the same value. To observe the low-density point processing more clearly, we give the true clustering results for the Pathbased dataset in Fig. 2a. In Fig. 2d, the hexagrams are the clustering centroids with multi-peak points and the red numbers are the location ordinal numbers where the density deviations of data points are located. In Fig. 2d, we can see that the data points in the low-density region are effectively clustered. And the Pathbased dataset gets the best clustering result relative to the DPC algorithm (see Fig. 1c).

In summary, for different types of data, the AmDPC method can divide and process data according to the decision graph properties of the data points. For low-density data, the AmDPC algorithm has very good reliability and generalizability. The entire procedure of AmDPC is shown in Algorithm 1.

Performance evaluation and experimental analysis

In this section, we conduct simulation experiments using the Olivetti Face dataset, synthetic and real-world datasets to validate the clustering performance of the AmDPC method. We use the performance evaluation metrics such as normalized mutual information (NMI), accuracy (ACC), adjusted rand index (ARI), and F-measure (FM) to evaluate the algorithm clustering performance respectively. To have a clearer understanding of the clustering ability of the AmDPC algorithm, we compare the above evaluation metrics with the state-of-the-art clustering methods such as DBSCAN [12], AP [14], DPC [10], McDPC [31], REDPC [26], and FHC-LDP [32], respectively. In Table 1, we used eleven synthetic datasets¹ and seven real-world datasets² for the simulation tests of the method, respectively.

Table 1

Datasets characteristics description

Datasets	Instances	Dimensions	Cluster numbers
Pathbased	300	2	3
Halfkernel	399	2	2
2circles	600	2	2
Flame	240	2	2
D1	87	2	3
D2	85	2	3
Zelink1	600	2	3
Smile2	1000	2	4
Jain	373	2	2
Complex8	2551	2	8
Complex9	3031	2	9
Heart	303	13	2
Liver	345	6	2
Sonar	208	60	2
Breast	277	9	2
Pima	768	8	2
Glass	214	9	6
Letter	20000	16	26

Clustering performance evaluation indexes

Before analyzing the clustering results, we will present the clustering performance evaluation metrics. We use four popular clustering performance metrics (NMI, ARI, ACC, FM). The values of these metrics range between -1 and 1, with the closer to 1 indicating better clustering performance.

NMI is a widely used external evaluation metric for clustering. For a class of true labels A and clustered labels B, the eigenvalues of A are extracted to form the vector U and the eigenvalues in B are extracted to form the vector V. Hence, the NMI values of U and V can be described as:

$$\begin{aligned} NMI (A,B)=2\frac{I(U,V)}{H(U)+H(V)} \end{aligned}$$

(16)

$$\begin{aligned} I(U,V)=\sum _{u\in U}\sum _{v\in V} log\left( \frac{p(u,v)}{p(u)p(v)}\right) \end{aligned}$$

(17)

$$\begin{aligned} H(U)=-\sum _{1}^{n}p(u_i)\log _2 p(u_i) \end{aligned}$$

(18)

$$\begin{aligned} H(V)=-\sum _{1}^{n}p(v_i)\log _2 p(v_i) \end{aligned}$$

(19)

where p(u) denotes the proportion of u in A and p(v) represents the probability of v in B. P(u, v) denotes the coincidence probability of u and v. In clustering algorithm evaluation, the NMI values of real label A and clustering label B are often used to weigh the effectiveness of the method.

ACC is a very classical metric used to measure clustering accuracy at this stage. We assume that in a dataset with n samples, where $r_i$ and $s_i$ denote clustering labels and true labels, respectively, acc can be defined as:

$$\begin{aligned} ACC=\frac{\sum _{i=1}^{n}\delta (s_i,map(r_i))}{n} \end{aligned}$$

(20)

$$\begin{aligned} \delta (x,y)=\left\{ \begin{matrix} 1 &{}\quad if x=y\\ 0 &{}\quad otherwise \end{matrix}\right. \end{aligned}$$

(21)

The $map (r_i)$ is the permutation mapping function that matches the clustering labels and true labels.

Suppose the true label of a dataset is P and the label of the clustering result is Q. The $c_1$ represents the number of sample pairs that fall into the same category of a cluster in P and the similar category of a cluster in Q. The $c_2$ represents the number of sample pairs that fall into the same category of a cluster in P and not the similar category of a cluster in Q. The $c_3$ represents the number of sample pairs that belong to different classes in P and the same classes in Q. Then the FM is calculated as shown in Eq. (22).

$$\begin{aligned} FM=\sqrt{\frac{c_1}{c_1+c_2}\frac{c_1}{c_2+c_3}} \end{aligned}$$

(22)

The results of rand index (RI) metrics of different methods are generally high, and it is difficult to evaluate the clustering capability. Hence, we typically utilize ARI instead of RI to test the method’s ability. The ARI is shown below.

$$\begin{aligned} ARI=\frac{RI-E[RI]}{max(RI)-E[RI]} \end{aligned}$$

(23)

$$\begin{aligned} RI=\frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$

(24)

In Eq. (24), the RI utilizes a pairwise way to measure true negatives (TN), true positives (TP), false negatives (FN), and false positives (FP).

Experiment results on synthetic datasets

In Table 2, we give the clustering results of the synthetic datasets for different evaluation metrics in the seven state-of-the-art methods. To highlight the performance of the algorithms, we bold the results of the best evaluation metrics. To see more intuitively clustering of synthetic datasets by different methods, we give the visualization results in Figs. 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 and 13 for all datasets in Table 2. In Table 2, we give the values of the best clustering metrics corresponding to the optimal parameters. In Table 2, the Pathbased, Halfkernel, 2circles, Zelink1, Smile2, and Jain datasets are all non-convex datasets with low-density points. From the table, we can know that the best clustering results are obtained for all the non-convex datasets of the AmDPC algorithm. Therefore, our analysis concludes that the AmDPC algorithm has higher superiority for low-density data points compared to other algorithms. And the AmDPC algorithm can handle non-convex datasets well.

Tables 2 and 3 show the values of the parameters corresponding to the clustering results of different methods. Among the seven state-of-the-art methods, the DBSCAN method parameters are the neighbourhood radius (Eps) and the number of clustering radius (MinPts), the AP algorithm parameters are the similarity matrix median value of (Preference) and the damping factor ($\lambda $), and the DPC algorithm parameter is the number of neighbors percentage (p). The McDPC algorithm uses the parameters ($ \gamma $, $ \theta $, $ \lambda $) and the number of neighbors percentage (p). The REDPC algorithm parameters are the neighborhood size ($ N_i $) and the number of neighbors percentage (p). The FHC-LDP method uses the number of neighbors (K) and the quantity of clusters (C). The params of the AmDPC method are the quantity of neighbors (Ni), peak splitting line ($ C_1 $, $ C_2 $), and number of neighbors percentage (p). Among them, parameters $ C_1 $ and $ C_2 $ can modulate the selection of low-density clustering centers. We jointly tune these four group parameters of the AmDPC algorithm separately. All parameters correspond to the best cluster evaluation index values.

From Table 2, we can see that the AmDPC algorithm obtains the best clustering metric values in the first nine synthetic datasets compared to other state-of-the-art methods. The AmDPC method is more advanced than other state-of-the-art algorithms for non-convex datasets, especially for Pathbased and Halfkernel datasets. In the Pathbased dataset, the FHC-LDP algorithm cannot effectively handle low-density points with lower metric values compared to the AmDPC algorithm. In the D2 dataset, the AP, REDPC, and AmDPC algorithms achieve the best clustering results relative to the other methods. In Fig. 8, we can see that the REDPC method identifies the outlier in the D2 dataset. Therefore, we can conclude that the AmDPC algorithm can very accurately assign outliers that are at low-density points to the corresponding cluster (see Table 2). In general, the AmDPC algorithm has very superior clustering results for non-convex datasets with low-density points.

Table 2

The performance of synthetic datasets

Datasets	Algorithms	Par	NMI	ACC	ARI	FM
Pathbased	DBSCAN	1.8/3	0.7005	0.8067	0.5918	0.7342
	AP	− 563.78/0.9	0.4917	0.6333	0.2085	0.6525
	DPC	1	0.5530	0.7467	0.4679	0.6654
	McDPC	0.12/0.8/3.5/0.5	1.0000	1.0000	1.0000	1.0000
	REDPC	8/5	0.5791	0.7600	0.5090	0.6810
	FHC–LDP	8/3	0.7144	0.7333	0.7547	0.5629
	AmDPC	2/2.6/4.5/1	1.0000	1.0000	1.0000	1.0000
Halfkernel	DBSCAN	3/3	1.0000	1.0000	1.0000	1.0000
	AP	–954.24/0.8	0.3534	0.4340	0.2480	0.5161
	DPC	7	0.4641	0.8250	0.4220	0.7286
	McDPC	0.5/5/7/4	1.0000	1.0000	1.0000	1.0000
	REDPC	6/8	0.5186	0.8250	0.5126	0.7533
	FHC-LDP	8/2	1.0000	1.0000	1.0000	1.0000
	AmDPC	2/4/3.9/8	1.0000	1.0000	1.0000	1.0000
2circles	DBSCAN	1.8/3	1.0000	1.0000	1.0000	1.0000
	AP	–136.72/0.8	0.3482	0.2600	0.1811	0.4384
	DPC	4	0.0136	0.5633	0.0146	0.5321
	McDPC	0.2/2/3/2	1.0000	1.0000	1.0000	1.0000
	REDPC	1/3	1.0000	1.0000	1.0000	1.0000
	FHC-LDP	8/2	1.0000	1.0000	1.0000	1.0000
	AmDPC	3/2.5/2.5/8	1.0000	1.0000	1.0000	1.0000
Flame	DBSCAN	0.93/4	0.9259	0.9792	0.9659	0.9840
	AP	− 88.89/0.8	0.5476	0.5958	0.4273	0.6693
	DPC	5	1.0000	1.0000	1.0000	1.0000
	McDPC	0.3/0.09/3/8	1.0000	1.0000	1.0000	1.0000
	REDPC	40/2	1.0000	1.0000	1.0000	1.0000
	FHC-LDP	15/2	1.0000	1.0000	1.0000	1.0000
	AmDPC	4/4/4/6	1.0000	1.0000	1.0000	1.0000
D1	DBSCAN	0.6/1	0.8370	0.7701	0.8063	0.8722
	AP	− 21.52/0.85	0.8512	0.8541	0.7215	0.8587
	DPC	17	1.0000	1.0000	1.0000	1.0000
	McDPC	0.5/0.9/0.9/10	1.0000	1.0000	1.0000	1.0000
	REDPC	1/2	1.0000	1.0000	1.0000	1.0000
	FHC-LDP	4/3	1.0000	1.0000	1.0000	1.0000
	AmDPC	4/1.3/1.5/6	1.0000	1.0000	1.0000	1.0000
D2	DBSCAN	0.75/5	0.8974	0.9412	0.9135	0.9346
	AP	− 48.67/0.75	1.0000	1.0000	1.0000	1.0000
	DPC	2	0.9655	0.9882	0.9679	0.9756
	McDPC	1.9/0.09/1.5/21	0.9832	0.9882	0.9838	0.9877
	REDPC	8/2	1.0000	1.0000	1.0000	1.0000
	FHC-LDP	5/4	0.9655	0.9882	0.9679	0.9756
	AmDPC	25/3/2.5/0.6	1.0000	1.0000	1.0000	1.0000
Zelink1	DBSCAN	0.06/2	1.0000	1.0000	1.0000	1.0000
	AP	− 3.48/0.95	0.3578	0.3779	0.2122	0.4229
	DPC	5	0.1814	0.4749	0.0802	0.4154
	McDPC	0.6/2.8/0.05/10	1.0000	1.0000	1.0000	1.0000
	REDPC	1/2	1.0000	1.0000	1.0000	1.0000
	FHC-LDP	15/3	1.0000	1.0000	1.0000	1.0000
	AmDPC	4/0.046/0.06/2	1.0000	1.0000	1.0000	1.0000
Smile2	DBSCAN	0.05/2	1.0000	1.0000	1.0000	1.0000
	AP	− 13.55/0.85	0.5125	0.5860	0.3749	0.5493
	DPC	88	1.0000	1.0000	1.0000	1.0000
	McDPC	0.01/0.21/0.18/88	1.0000	1.0000	1.0000	1.0000
	REDPC	1/8	1.0000	1.0000	1.0000	1.0000
	FHC-LDP	15/4	1.0000	1.0000	1.0000	1.0000
	AmDPC	8/0.15/0.15/2	1.0000	1.0000	1.0000	1.0000
Jain	DBSCAN	2.4/6	0.8486	0.9196	0.9350	0.9742
	AP	–778.85/0.75	0.3360	0.7340	0.2284	0.6570
	DPC	37	1.0000	1.0000	1.0000	1.0000
	McDPC	0.1/2/3.25/2	1.0000	1.0000	1.0000	1.0000
	REDPC	20/7	0.5052	0.8606	0.5146	0.7905
	FHC-LDP	14/2	1.0000	1.0000	1.0000	1.0000
	AmDPC	160/12/12/80	1.0000	1.0000	1.0000	1.0000
Complex8	DBSCAN	15.6/4	0.9985	0.9988	0.9996	0.9997
	AP	− 6439.19/0.8	0.6379	0.4606	0.3936	0.4891
	DPC	5	0.7812	0.7327	0.5113	0.7177
	McDPC	0.02/1/22/19	0.8539	0.7589	0.7368	0.7973
	REDPC	20/8	0.7473	0.6441	0.6263	0.6875
	FHC-LDP	8/8	0.9501	0.9400	0.9274	0.9379
	AmDPC	3/50/38.6/16	0.8574	0.8009	0.7526	0.7955
Complex9	DBSCAN	15.1/10	0.9942	0.9970	0.9975	0.9979
	AP	− 2792.52/0.8	0.7112	0.3583	0.2845	0.4393
	DPC	3	0.7162	0.5737	0.4462	0.5627
	McDPC	0.02/0.6/22/15	0.8581	0.7667	0.7085	0.7699
	REDPC	1/8	0.7262	0.6457	0.3389	0.6213
	FHC-LDP	15/9	1.0000	1.0000	1.0000	1.0000
	AmDPC	3/20/21/50	0.9132	0.8746	0.8919	0.9132

Table 3

The performance of real-world datasets

Datasets	Algorithms	Par	NMI	ACC	ARI	FM
Heart	DBSCAN	0.8/6	0.1240	0.2178	0.0517	0.3089
	AP	–56.67/0.9	0.1611	0.7195	0.1899	0.5995
	DPC	4	0.1519	0.7195	0.1897	0.6198
	McDPC	1.12/5.25/1.13/2	0.1519	0.6964	0.1647	0.6235
	REDPC	2/7	0.1507	0.7063	0.1810	0.6125
	FHC-LDP	15/2	0.1754	0.7294	0.2074	0.6349
	AmDPC	2/0.8/0.9/12	0.2861	0.7360	0.3649	0.6507
Liver	DBSCAN	36/4	0.0269	0.5594	–0.0087	0.6893
	AP	–1827.91/0.8	0.0219	0.5420	0.0046	0.5459
	DPC	5	0.0105	0.5826	0.0217	0.5463
	McDPC	0.2/0.06/3/0.1	0.0187	0.4928	0.0209	0.5376
	REDPC	3/5	0.0021	0.5710	–0.0035	0.6994
	FHC-LDP	5/2	0.0023	0.5652	0.0076	0.5779
	AmDPC	5/80/12/8	0.0353	0.5913	0.0347	0.6120
Sonar	DBSCAN	1.1/7	0.0396	0.5433	0.0011	0.5260
	AP	–35.53/0.75	0.0033	0.5385	0.0011	0.5260
	DPC	5	0.0021	0.5385	0.0013	0.6775
	McDPC	0.01/3/1.5/0.2	0.0000	0.5337	0.0000	0.7070
	REDPC	1/2	0.0826	0.6635	0.1026	0.5478
	FHC-LDP	5/2	0.0696	0.6394	0.0733	0.5478
	AmDPC	2/0.78/0.28/0.3	0.3174	0.8077	0.4013	0.7023
Breast	DBSCAN	2.65/4	0.0121	0.7148	0.0261	0.7609
	AP	–98.93/0.75	0.0048	0.5632	0.0100	0.5499
	DPC	5	0.0375	0.7004	0.1041	0.6677
	McDPC	0.52/1.25/1.13/2	0.0565	0.4946	0.0089	0.4742
	REDPC	3/2	0.0713	0.7040	0.1517	0.6469
	FHC-LDP	10/2	0.0204	0.7112	0.0472	0.7384
	AmDPC	2/2.1/2/4	0.1123	0.7437	0.1965	0.7384
Pima	DBSCAN	0.3/4	0.0154	0.6341	0.0316	0.6897
	AP	–22.44/0.8	0.0639	0.6536	0.0925	0.5686
	DPC	5	0.0076	0.5990	0.0271	0.5816
	McDPC	0.02/1/0.6/5	0.0171	0.6523	0.0023	0.7380
	REDPC	3/2	0.0066	0.6484	0.0153	0.7104
	FHC-LDP	5/2	0.0182	0.6563	0.0111	0.7341
	AmDPC	3/0.43/0.21/30	0.0837	0.6862	0.1075	0.7083
Glass	DBSCAN	0.5/4	0.3305	0.4159	0.1771	0.4381
	AP	–46.67/0.75	0.2614	0.4439	0.1750	0.5408
	DPC	1	0.3053	0.4579	0.1707	0.4784
	McDPC	1.12/2.11/2.9/25	0.3391	0.4533	0.1435	0.5428
	REDPC	1/2	0.2323	0.3879	0.0806	0.5002
	FHC-LDP	17/6	0.3019	0.6075	0.2690	0.4960
	AmDPC	4/1.5/3/5	0.4008	0.4860	0.1946	0.5700
Letter	DBSCAN	3/35	0.3279	0.1676	0.0157	0.1479
	AP	–623.57/0.8	0.3541	0.2553	0.1241	0.1596
	DPC	3	0.3361	0.2420	0.1259	0.1960
	McDPC	0.02/0.5/6/0.02	0.3802	0.2114	0.0598	0.1886
	REDPC	6/1	0.3725	0.2297	0.0394	0.1866
	FHC-LDP	50/26	0.4681	0.2517	0.0429	0.1981
	AmDPC	320/6.5/3.5/0.1	0.4068	0.2587	0.1276	0.1867

Table 4

Time complexity of different algorithms

Algorithm	DBSCAN	AP	DPC	McDPC	REDPC	FHC-LDP	AmDPC
Time complexity	O(nlogn)	$O(In^2)$	$O(n^2)$	$O(n^2)$	$O(n^2)$	O(nlogn)	$O(n^2)$

The experiments show that the AmDPC algorithm achieves the best clustering results except for the Complex8 and Complex9 datasets. For Pathbased datasets, the DBSCAN, AP, DPC, REDPC, and FHC-LDP algorithms cannot obtain accurate clustering results (see Fig. 3). However, the AmDPC algorithm obtains the best clustering results by clustering and merging the low-density points. In Fig. 4, the AP, DPC and REDPC algorithms cannot get accurate results, while the DBSCAN, McDPC, FHC-LDP and AmDPC algorithms get the best clustering results. In non-convex datasets of Figs. 5, 9, 10 and 11, the AmDPC algorithms obtain the best clustering results. In Figs. 6, 7 and 8, the AmDPC algorithm still obtains the best clustering results for high-density data points. In Fig. 8, the DBSCAN, AP and REDPC methods fail to get the correct results whereas the DPC, McDPC, FHC-LDP and AmDPC algorithms get the best clustering results. It shows that the improved algorithm can accurately handle the data points in low-density regions and automatically gets the best clustering results without manual intervention.

In Figs. 12 and 13, the Complex8 and Complex9 are both complex structures datasets with non-uniform density. We can know that DBSCAN approximates the best clustering results for the Complex8 and Complex9 datasets. In Fig. 12, the FHC-LDP algorithm can also perform clustering on data with an arbitrary structure, but the performance is slightly lower than that of the DBSCAN method. In Fig. 13, the FHC-LDP algorithm achieves the best clustering result for the Complex9 dataset and the DBSCAN algorithm achieves the second best clustering result. In Table 2, we can conclude that the AmDPC algorithm is second only to the DBSCAN and FHC-LDP methods in terms of cluster evaluation metric values for the Complex8 and Complex9 datasets. From the above figures, we can know that the AmDPC algorithm is superior to other state-of-the-art methods for clustering non-convex datasets with low-density points. The AmDPC algorithm outperforms the AP, DPC, McDPC, REDPC algorithms for non-uniform density datasets with complex structures.

In general, the AmDPC algorithm has excellent clustering performance for non-convex datasets with low-density points, and also has a certain clustering effect on complex datasets with non-uniform density structures. To further validate the clustering performance of the method, we will confirm the clustering utility of the AmDPC method on high-dimensional data.

Experiment results on real-world datasets

Table 3 shows the clustering evaluation metrics for the seven state-of-the-art methods on real-world datasets. We have given the best results based on the magnitude of the four clustering index values. The parameters in the table all correspond to the optimal experiment results of the different algorithms. From Table 3, we may know that the clustering evaluation metrics of the AmDPC method are significantly higher than the other methods in Heart dataset. In the Liver dataset, the AmDPC method has the best clustering results for NMI, ACC, and ARI evaluation metrics. Although the FM metric of the REDPC algorithm has the highest value, its ARI value appears negative and the overall clustering accuracy is not as high as that of the AmDPC algorithm. In the Sonar dataset, the clustering effect of the NMI, ACC, and ARI evaluation indicators of the AmDPC method is significantly higher than that of other state-of-the-art algorithms. Although the McDPC algorithm has the highest FM metric value, the other index values are 0, and the overall clustering accuracy is not as good as the AmDPC method.

In the Breast dataset, we can see that the AmDPC algorithm has three clustering evaluation metrics higher than the other algorithms, while the FM metric value is second only to the DBSCAN algorithm. In the Pima dataset, we can see that the values of NMI, ACC and ARI clustering evaluation metrics in the AmDPC algorithm are much higher than the other algorithms. Although the FM metric values of McDPC, REDPC and FHC-LDP methods are slightly higher than the AmDPC algorithm, the other metric values are particularly low and the overall clustering accuracy is not good. Therefore, the clustering performance of the AmDPC algorithm is higher than the other algorithms in the Pima dataset. In the Glass dataset, the AmDPC method has two clustering evaluation metrics higher than other algorithms, and the other evaluation metric is second only to the FHC-LDP algorithm. The ACC, ARI evaluation metrics of the AmDPC method have the best clustering results in the Letter dataset with large sample points. Although the NMI and FM metric of the FHC-LDP algorithm has the highest value, all its other values show a lower status, and the overall clustering accuracy is not as high as that of the AmDPC method. In general, the AmDPC algorithms have better clustering performance for real-world datasets. And the AmDPC method has certain practicality and effectiveness.

Clustering time complexity analysis

In this subsection, we will further contrast and analyze the operation time complexity of DBSCAN, AP, DPC, McDPC, REDPC, FHC-LDP, and AmDPC algorithms.

In AmDPC, the time complexity of calculating the parameters $d_{\rho _i}$ and $\delta $ is $O(n^2)$. The time complexity of the clustering density region acquisition is O(n). The time complexity of density level threshold partition lines is O(n). The time complexity of the automatic clustering of different density levels is $O(n^2)$. In summary, the time complexity of AmDPC is $O(n^2)$, and therefore, the time complexity of AmDPC is nearest to the DPC, McDPC and REDPC methods. As shown in Table 4, the time complexity of the DBSCAN and FHC-LDP algorithms is low, and the AP clustering algorithm is relatively high because it requires iterations. In Table 4, the n is the number of data points and the I is the number of iterations. The time complexity of the DPC, McDPC, REDPC and AmDPC methods is slightly lower than the AP method and higher than the DBSCAN and FHC-LDP method.

Clustering performance analysis of facial recognition

In this subsection, we used the ORL database³ (Olivetti Research Laboratory in Cambridge, UK) to test the practicality of the AmDPC method and other algorithms. It contains 400 different face images, 10 for each of the 40 different experimenters. The images of the experimenters have been recorded at varying times of day under changing lighting, face features (smiling/not smiling), and facial particulars (eyes open/closed, with/without glasses). All images represent the experimenter’s frontal face with some degree of deflection or tilt toward the top and bottom, with a similar dark uniform backdrop. The size of each image was 112$\times $92 pixels in 8bit grayscale. For comparison, we used the complex wavelet structure similarity toolbox to calculate the similarity of any two images [33].

We take the first ten people with a total of 100 face images of the same size without overlapping for our experiment. We treat each person with an identical class label, for a total of 10 class labels. In Table 5, we give the results of the clustering evaluation metrics for the seven classical methods on the Olivetti Face dataset. The parameter types of the different algorithms in Table 5 are the same as those in Table 2 and Table 3. The parameters are the best clustering results obtained after several trials and records. In Table 5, we highlight the best clustering metrics. From Table 5, we can see that the evaluation metrics of the AmDPC algorithm are significantly higher than other algorithms, indicating that our proposed algorithm has higher effectiveness and robustness.

Table 5

Olivetti Face clustering evaluation indexes results

Algorithms	Par	NMI	ACC	ARI	FM
DBSCAN	0.2/1	0.5491	0.3300	0.0431	0.7038
AP	$-0.4323/0.4$	0.8396	0.7000	0.6667	0.8068
DPC	3	0.9075	0.7900	0.6628	0.7982
McDPC	2/0.12/0.16/8	0.8781	0.7400	0.7093	0.7451
REDPC	7/8	0.8787	0.7200	0.7226	0.7582
FHC-LDP	5/10	0.8716	0.8100	0.6916	0.7323
AmDPC	5/0.42/0.12/3	0.9251	0.8700	0.8067	0.8484

In Fig. 14, we also give the visualization results of the AmDPC algorithm on the Olivetti Face dataset. Different colors indicate different categories. In Fig. 14, we can see that 10 people are divided into 9 categories, where the first and the fourth person starting from the top are divided into one category. The facial expressions of the second, third, sixth, and seventh people are recognized very accurately. The fifth and tenth people identified nine and eight facial expressions, as well as the remaining facial expressions, were classified into the categories to which the ninth and seventh person belonged, respectively. In general, the AmDPC method is effective and practical for face recognition compared with other advanced methods.

Multi-peak points intervals sequencing ablation experiments

Table 6

The performance of multi-peak points intervals ordering

Datasets	Ordering methods	NMI	ACC	ARI	FM
Pathbased	$\left\{ P_1, P_2,P_3 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_1, P_3,P_2 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_2, P_1,P_3 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_2, P_3,P_1 \right\} $	0.1301	0.3600	0.0006	0.5375
	$ \left\{ P_3, P_1,P_2 \right\} $	0.1301	0.3600	0.0006	0.5375
	$ \left\{ P_3, P_2,P_1 \right\} $	0.1301	0.3600	0.0006	0.5375
2circles	$ \left\{ P_1, P_2,P_3 \right\} $	0.3111	0.7250	0.2016	0.6584
	$ \left\{ P_1, P_3,P_2 \right\} $	0.3111	0.7250	0.2016	0.6584
	$ \left\{ P_2, P_1,P_3 \right\} $	0.3111	0.7250	0.2016	0.6584
	$ \left\{ P_2, P_3,P_1 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_3, P_1,P_2 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_3, P_2,P_1 \right\} $	1.0000	1.0000	1.0000	1.0000
Flame	$ \left\{ P_1, P_2,P_3 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_1, P_3,P_2 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_2, P_1,P_3 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_2, P_3,P_1 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_3, P_1,P_2 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_3, P_2,P_1 \right\} $	1.0000	1.0000	1.0000	1.0000
D2	$ \left\{ P_1, P_2,P_3 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_1, P_3,P_2 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_2, P_1,P_3 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_2, P_3,P_1 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_3, P_1,P_2 \right\} $	1.0000	1.0000	1.0000	1.0000
	$ \left\{ P_3, P_2,P_1 \right\} $	1.0000	1.0000	1.0000	1.0000
Liver	$ \left\{ P_1, P_2,P_3 \right\} $	0.0353	0.5913	0.0347	0.6120
	$ \left\{ P_1, P_3,P_2 \right\} $	0.0353	0.5913	0.0347	0.6120
	$ \left\{ P_2, P_1,P_3 \right\} $	0.0353	0.5913	0.0347	0.6120
	$ \left\{ P_2, P_3,P_1 \right\} $	0.0353	0.5913	0.0347	0.6120
	$ \left\{ P_3, P_1,P_2 \right\} $	0.0353	0.5913	0.0347	0.6120
	$ \left\{ P_3, P_2,P_1 \right\} $	0.0353	0.5913	0.0347	0.6120
Breast	$ \left\{ P_1, P_2,P_3 \right\} $	0.1163	0.7509	0.1388	0.7587
	$ \left\{ P_1, P_3,P_2 \right\} $	0.1163	0.7509	0.1388	0.7587
	$ \left\{ P_2, P_1,P_3 \right\} $	0.1163	0.7509	0.1388	0.7587
	$ \left\{ P_2, P_3,P_1 \right\} $	0.0248	0.5451	0.0007	0.5471
	$ \left\{ P_3, P_1,P_2 \right\} $	0.0248	0.5451	0.0007	0.5471
	$ \left\{ P_3, P_2,P_1 \right\} $	0.0248	0.5451	0.0007	0.5471
Glass	$ \left\{ P_1, P_2,P_3 \right\} $	0.4008	0.4860	0.1946	0.5700
	$ \left\{ P_1, P_3,P_2 \right\} $	0.4008	0.4860	0.1946	0.5700
	$ \left\{ P_2, P_1,P_3 \right\} $	0.4008	0.4860	0.1946	0.5700
	$ \left\{ P_2, P_3,P_1 \right\} $	0.3879	0.4766	0.1996	0.5530
	$ \left\{ P_3, P_1,P_2 \right\} $	0.3879	0.4766	0.1996	0.5530
	$ \left\{ P_3, P_2,P_1 \right\} $	0.3879	0.4766	0.1996	0.5530
Letter	$ \left\{ P_1, P_2,P_3 \right\} $	0.4068	0.2587	0.1276	0.1867
	$ \left\{ P_1, P_3,P_2 \right\} $	0.4068	0.2587	0.1276	0.1867
	$ \left\{ P_2, P_1,P_3 \right\} $	0.4068	0.2587	0.1276	0.1867
	$ \left\{ P_2, P_3,P_1 \right\} $	0.4068	0.2587	0.1276	0.1867
	$ \left\{ P_3, P_1,P_2 \right\} $	0.4068	0.2587	0.1276	0.1867
	$ \left\{ P_3, P_2,P_1 \right\} $	0.4068	0.2587	0.1276	0.1867

In this subsection, we will further discuss the effect of the multi-peak points intervals ordering methods on the clustering results. In Eq. (13), we only introduce two ordering methods ($\left\{ P_1, P_2,P_3 \right\} $ and $\left\{ P_3, P_2,P_1 \right\} $) and and do not show the effect of the other ordering ways on the clustering results. There are six ordering types for $P_1$, $P_2$ and $P_3$. Since we choose only one of the sorting methods for each experiment, we will then analyze each of the six different sorting methods separately (see Table 6). Through the ablation experiments, we find that the clustering results of the other four sorting ways are the same as the results of the two sorting cases in Eq. (13). The best clustering performance can be basically achieved using the two cases in Eq. (13).

We select four synthetic datasets and four real-world datasets for testing from Table 2 and 3, respectively. To ensure the validity of the experiments, we keep the four parameters of the AmDPC algorithm in Table 2 and 3 unchanged and observe the effect of different ordering methods on the clustering performance. We have listed the six ordering methods $ \left\{ P_1, P_2,P_3 \right\} $, $ \left\{ P_1, P_3,P_2 \right\} $, $ \left\{ P_2, P_1,P_3 \right\} $, $ \left\{ P_2, P_3,P_1 \right\} $, $ \left\{ P_3, P_1,P_2 \right\} $ and $ \left\{ P_3, P_2,P_1 \right\} $ respectively in Table 6. In the table, we bold the clustering results corresponding to the sorting methods in Eq. (13) and further observe their effects on the clustering results.

In Table 6, we can know that the evaluation metric values of Flame, D2, Liver and Letter datasets remain unchanged for the six ordering methods in the AmDPC algorithm. It is indicated that the arrangement of the AmDPC method in different intervals in the above datasets has no effect on the clustering results. From Table 6, we find that the AmDPC method produces two groups clustering metric values for the interval ordering methods in the Pathbased, 2circles, Breast and Glass datasets. In these four datasets, the sorting methods of $ \left\{ P_1, P_2,P_3 \right\} $, $ \left\{ P_1, P_3,P_2 \right\} $ and $ \left\{ P_2, P_1,P_3 \right\} $ all have the same clustering indicator values. The ordering methods of $ \left\{ P_2, P_3,P_1 \right\} $, $ \left\{ P_3, P_1,P_2 \right\} $, $ \left\{ P_3, P_2,P_1 \right\} $ also have the same clustering index values. It indicates that the interval ordering methods of multi-peak points have some influence on the clustering performance. The clustering results of $ \left\{ P_1, P_2,P_3 \right\} $, $ \left\{ P_1, P_3,P_2 \right\} $, $ \left\{ P_2, P_1,P_3 \right\} $ sorting methods in Pathbased, Breast and Glass datasets are greater than those of $ \left\{ P_2, P_3,P_1 \right\} $, $ \left\{ P_3, P_1,P_2 \right\} $, $ \left\{ P_3, P_2,P_1 \right\} $ sorting methods. In the 2circles dataset, the clustering results obtained by the $ \left\{ P_2, P_3,P_1 \right\} $, $ \left\{ P_3, P_1,P_2 \right\} $, $ \left\{ P_3, P_2,P_1 \right\} $ sorting methods are 1 and greater than the clustering results obtained by the $ \left\{ P_1, P_2,P_3 \right\} $, $ \left\{ P_1, P_3,P_2 \right\} $, $ \left\{ P_2, P_1,P_3 \right\} $ sorting methods.

Since the clustering results of the other four sorting methods are the same as those corresponding to the $ \left\{ P_1, P_2,P_3 \right\} $ and $ \left\{ P_3, P_2,P_1 \right\} $ sorting methods. So the best clustering results of the AmDPC algorithm can be fully achieved using only $ \left\{ P_1, P_2,P_3 \right\} $ and $ \left\{ P_3, P_2,P_1 \right\} $ sorting methods. It is also further indicated that the effect of other different sorting methods on the clustering results is consistent with the two sorting cases in Eq. (13). In summary, we can conclude that the AmDPC algorithm can obtain the best clustering results by testing and comparing the ordering cases in Eq. (13) separately.

Parameters sensitivity analysis of the AmDPC algorithm

The AmDPC algorithm has four parameters: $N_i$, $C_1$, $C_2$, and p. In this subsection, we will test the sensitivity of the clustering results for each of these four parameters to further explore their impact on the clustering performance.

We first start with parameter p, which directly affects the quality of decision graph generation for both the AmDPC, REDPC and DPC algorithms. We set different p values for testing: p={1-8} (p is a positive integer), and then test on three synthetic and three real-world datasets, which clustering results are shown in Tables 7, 8, 9 and 10. The values in the above tables are the means and standard deviations of the eight experimental results of the AmDPC, REDPC and DPC algorithms for the parameter p in the range of one to eight, respectively. We bold the mean and standard deviation of the better evaluation metric results. The AmDPC algorithm has better clustering results than the REDPC and DPC methods when p values are in the range of one to eight, and the stability difference between them is not significant. We test the stability of the AmDPC algorithm parameter p by keeping the other parameters constant. In Table 8, we can see that the REDPC algorithm outperforms the DPC and AmDPC algorithms in terms of mean and standard deviation on the Pima dataset. The main reason is that the range of values of the parameter p of the AmDPC algorithm does not make it achieve the best clustering results. In Tables 7, 8, 9 and 10, we can know that the mean values of evaluation metrics of the AmDPC algorithm on different datasets are significantly higher than the REDPC and DPC methods. The standard deviation of the AmDPC algorithm is higher than the REDPC and DPC method in some datasets. The main reason is that the AmDPC algorithm metric value is higher than other algorithms in the sensitivity test of the parameter p.

Table 7

Sensitivity analysis of NMI indicator when p={1-8}

Datasets	AmDPC	REDPC	DPC
Pathbased	0.6759$\,\pm \,$0.1940	0.5557$\,\pm \,$0.0186	0.5268$\,\pm \,$0.0225
Halfkernel	0.8585$\,\pm \,$0.1640	0.7480$\,\pm \,$0.1206	0.7274$\,\pm \,$0.1117
Zelink1	0.9512$\,\pm \,$0.1103	0.7957$\,\pm \,$0.1631	0.1766$\,\pm \,$0.0487
Breast	0.0839$\,\pm \,$0.0291	0.0603$\,\pm \,$0.0197	0.0161$\,\pm \,$0.0096
Pima	0.0239$\,\pm \,$0.0129	0.0206$\,\pm \,$0.0104	0.0172$\,\pm \,$0.0153
Glass	0.3795$\,\pm \,$0.0171	0.1578$\,\pm \,$0.0719	0.2464$\,\pm \,$0.0388

Table 8

Sensitivity analysis of ACC indicator when p={1-8}

Datasets	AmDPC	REDPC	DPC
Pathbased	0.8229$\,\pm \,$0.1190	0.6958$\,\pm \,$0.0583	0.7100$\,\pm \,$0.0346
Halfkernel	0.8585$\,\pm \,$0.1640	0.7480$\,\pm \,$0.1206	0.7274$\,\pm \,$0.1117
Zelink1	0.9569$\,\pm \,$0.1139	0.7659$\,\pm \,$0.2190	0.4816$\,\pm \,$0.0608
Breast	0.6683$\,\pm \,$0.0836	0.6083$\,\pm \,$0.1105	0.5131$\,\pm \,$0.0881
Pima	0.5931$\,\pm \,$0.1035	0.6125$\,\pm \,$0.0223	0.5062$\,\pm \,$0.1289
Glass	0.4737$\,\pm \,$0.0100	0.3785$\,\pm \,$0.0120	0.4014$\,\pm \,$0.0468

Table 9

Sensitivity analysis of ARI indicator when p={1-8}

Datasets	AmDPC	REDPC	DPC
Pathbased	0.6218$\,\pm \,$0.2336	0.4622$\,\pm \,$0.0908	0.4377$\,\pm \,$0.0610
Halfkernel	0.6451$\,\pm \,$0.3972	0.4437$\,\pm \,$0.0641	0.2805$\,\pm \,$0.1733
Zelink1	0.9331$\,\pm \,$0.1637	0.6448$\,\pm \,$0.2816	0.0447$\,\pm \,$0.0381
Breast	0.1431$\,\pm \,$0.0314	0.1162$\,\pm \,$0.0524	0.0025$\,\pm \,$0.0246
Pima	0.0455$\,\pm \,$0.0172	0.0448$\,\pm \,$0.0190	0.0195$\,\pm \,$0.0134
Glass	0.1688$\,\pm \,$0.0238	0.0454$\,\pm \,$0.0318	0.1101$\,\pm \,$0.0498

Table 10

Sensitivity analysis of FM indicator when p={1-8}

Datasets	AmDPC	REDPC	DPC
Pathbased	0.7615$\,\pm \,$0.1450	0.6670$\,\pm \,$0.0106	0.6549$\,\pm \,$0.0083
Halfkernel	0.8440$\,\pm \,$0.1673	0.7027$\,\pm \,$0.0561	0.6648$\,\pm \,$0.0715
Zelink1	0.9594$\,\pm \,$0.0989	0.8077$\,\pm \,$0.1703	0.4875$\,\pm \,$0.0826
Breast	0.6622$\,\pm \,$0.0938	0.6088$\,\pm \,$0.0917	0.5329$\,\pm \,$0.0873
Pima	0.6169$\,\pm \,$0.1037	0.6057$\,\pm \,$0.0647	0.5189$\,\pm \,$0.1039
Glass	0.5614$\,\pm \,$0.0079	0.4968$\,\pm \,$0.0085	0.3672$\,\pm \,$0.0893

To further analyze the sensitivity of other parameters of the AmDPC method, we will give the range of parameter values for different datasets in the Tables 11, 12 and 13. We average the range of $N_i$, $C_1$ and $C_2$ parameter values into eight values, respectively, and find the mean and standard deviation of the eight results. Table 11 shows that the parameter $N_i$ obtains high values of clustering metrics on different datasets. Encouragingly, for these datasets, the mean of the clustering results for the parameter $N_i$ within a certain range is high and the standard deviation is very low. This indicates that the parameter $N_i$ is less sensitive to the clustering results. From the Tables 12 and 13, we can know that the AmDPC method has relatively stable experimental results for the parameters $C_1$ and $C_2$ with standard deviation of 0. In the Tables 12 and 13, the evaluation index values of the $C_1$ and $C_2$ parameters are unchanged, and all datasets achieve the best clustering effect within a certain range. In Tables 12, and 13, the standard deviation of the AmDPC clustering results is 0. The main reason is that the parameter values do not affect the selection of clustering centers or low-density processing, resulting in no change in the clustering results. In summary, the parameters $N_i$, $C_1$ and $C_2$ in the AmDPC algorithm are not sensitive to the clustering results, and they all achieve good clustering results within a certain range.

Table 11

Sensitivity analysis of AmDPC algorithm to parameter $N_i$

Datasets [Range of $N_i$]	NMI	ACC	ARI	FM
Pathbased [2–9]	0.6795$\,\pm \,$0.3952	0.7579$\,\pm \,$0.2778	0.6051$\,\pm \,$0.4707	0.8181$\,\pm \,$0.2047
Halfkernel [4–11]	0.8919$\,\pm \,$0.3057	0.9476$\,\pm \,$0.1481	0.8783$\,\pm \,$0.3444	0.9577$\,\pm \,$0.1195
Zelink1 [2–9]	0.8115$\,\pm \,$0.1826	0.8579$\,\pm \,$0.1713	0.7216$\,\pm \,$0.3111	0.8536$\,\pm \,$0.1585
Breast [2–9]	0.0605$\,\pm \,$0.0395	0.5533$\,\pm \,$0.1095	0.0823$\,\pm \,$0.0693	0.5714$\,\pm \,$0.0833
Pima [2-9-]	0.0284$\,\pm \,$0.0187	0.6584$\,\pm \,$0.0131	0.0343$\,\pm \,$0.0382	0.7232$\,\pm \,$0.0125
Glass [4–11]	0.3893$\,\pm \,$0.0213	0.4802$\,\pm \,$0.0108	0.1912$\,\pm \,$0.0063	0.5697$\,\pm \,$0.0005

Table 12

Sensitivity analysis of AmDPC algorithm to parameter $C_1$

Datasets [Range of $C_1$]	NMI	ACC	ARI	FM
Pathbased [2.5–2.6]	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000
Halfkernel [4.0–20.0]	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000
Zelink1 [0.03–0.09]	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000
Breast [2.06–2.14]	0.1123$\,\pm \,$0.0000	0.7437$\,\pm \,$0.0000	0.1965$\,\pm \,$0.0000	0.7384$\,\pm \,$0.0000
Pima [0.429–0.436]	0.0744$\,\pm \,$0.0000	0.6901$\,\pm \,$0.0000	0.1284$\,\pm \,$0.0000	0.6928$\,\pm \,$0.0000
Glass [1.27–1.66]	0.4008$\,\pm \,$0.0000	0.4860$\,\pm \,$0.0000	0.1946$\,\pm \,$0.0000	0.5700$\,\pm \,$0.0000

Table 13

Sensitivity analysis of AmDPC algorithm to parameter $C_2$

Datasets [Range of $C_2$]	NMI	ACC	ARI	FM
Pathbased [3.0–3.5]	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000
Halfkernel [2.0–5.2]	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000
Zelink1 [0.03–0.09]	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000	1.0000$\,\pm \,$0.0000
Breast [1.7–2.1]	0.1123$\,\pm \,$0.0000	0.7437$\,\pm \,$0.0000	0.1965$\,\pm \,$0.0000	0.7384$\,\pm \,$0.0000
Pima [0.21–0.2105]	0.0744$\,\pm \,$0.0000	0.6901$\,\pm \,$0.0000	0.1284$\,\pm \,$0.0000	0.6928$\,\pm \,$0.0000
Glass [1.0–2.0]	0.4008$\,\pm \,$0.0000	0.4860$\,\pm \,$0.0000	0.1946$\,\pm \,$0.0000	0.5700$\,\pm \,$0.0000

Next, we will further analyze the clustering performance of the AmDPC, REDPC and DPC algorithms parameter p in different datasets. In Figs. 15, 16, 17, 18, 19 and 20, we give the comparison graphs of clustering evaluation metrics for different datasets with parameter values p={1-8}, respectively. In these figures, we take the different values of parameter p as the horizontal axis, keep the rest parameters constant, and take the values of NMI, ACC, ARI and FM indicators as the vertical axes respectively. The performance of the common parameter p of AmDPC, REDPC and DPC algorithms is further analyzed.

In Fig. 15, we give the performance of DPC, REDPC and AmDPC methods on Pathbased datasets with different parameter values p. In Fig. 15, we can know that the AmDPC algorithm clustering indicator values of NMI, ACC, ARI and FM are higher than the REDPC and DPC methods and gradually stabilize for the varying parameter p values. It indicates that the overall clustering performance of the AmDPC algorithm parameter p is better than that of the DPC and REDPC methods on the Pathbased dataset. In Fig. 16 we can see that the AmDPC algorithm achieves the best clustering results in the Halfkernel dataset for parameter p in the range of five to eight. Both the REDPC and DPC algorithms fail to achieve the best clustering results for the NMI, ACC, ARI and FM metric values in the Halfkernel dataset. In Figs. 15 and 16, we can know that the clustering performance of the AmDPC algorithm on different parameter values is higher than that of the DPC and REDPC algorithms. In Fig. 17, the AmDPC algorithm always has higher clustering index values than the DPC and REDPC algorithms on the parameter p, and achieves the best clustering effect and stabilizes overall. It indicates that the AmDPC algorithm has higher stability and effectiveness. In Fig. 18, for the Breast dataset, the AmDPC algorithm is superior to the DPC and REDPC algorithms overall and more stable in terms of NMI, ACC, ARI and FM evaluation metric values. In Fig. 19, the NMI and ARI values are low for the Pima dataset. However, the AmDPC algorithm is overall higher than the REDPC and DPC algorithms for different parameter values. In Fig. 20, we can see that the AmDPC algorithm has a better clustering performance than the REDPC and DPC algorithms for the four clustering metric values and overall tends to be stable in the range of one to eight for the parameter p.

From these figures, we can know that the AmDPC algorithm has better clustering performance than the REDPC and DPC algorithms on parameter p for different datasets and the overall trend is more stable. In summary, the clustering performance of the AmDPC algorithm for parameter p is better than that of the DPC and REDPC methods, which has better robustness and effectiveness.

Parameters setting of the AmDPC algorithm

In the previous subsection, we have analyzed the sensitivity of the AmDPC algorithm parameters p, $N_i$, $C_1$ and $C_2$. Next, we will analyze the setting of these parameters. Both parameters $N_i$ and p can be used to generate a decision graph and $N_i$ can control the partitioning of low-density hierarchies. By analyzing the clustering results with different parameters, we summarize the way these four parameters are set, especially for unknown data.

As mentioned before, p is an important parameter in the AmDPC, REDPC, and DPC algorithms. For the AmDPC algorithm, the parameter p determines the quality of the decision graph generation and the relationship between the different density points. In Figs. 15, 16, 17, 18, 19 and 20, we show the results of clustering indicators on different datasets with different p values. It is obvious that the AmDPC algorithm obtains better clustering results and is more stable in the range of one to eight. Therefore, for general structural features datasets, we usually set 1$\leqslant $ $p$ $\leqslant $8. For datasets with density concentration and high density points cannot be determined, we generally set 0 $p$ 1. For datasets with complex structure and uneven density distribution, we set 8 $p$ $\leqslant $80.

For the other parameters $N_i$, $C_1$, $C_2$, their settings are more dependent on the datasets and different parameters need to be set for different datasets. The parameter $N_i$ is used in the generation of decision graph and the processing of low-density data and has an important influence on whether some low-density points are merged or kept as a separate class. To better choose an appropriate $N_i$ value, the user needs to observe whether there are multiple consecutive denser low-density points in the obtained sub-clustering centroid decision graph and set a lower $N_i$ value. In this case, we typically set 2$\leqslant $ $N_i$ $\leqslant $8. This case puts the threshold segmentation line A to the right of the low-density points in the decision graph and the threshold segmentation line B to the left of the high-density points (see Fig. 2c). If there is no continuous low-density point, the corresponding larger $N_i$ value can be chosen so that the threshold partition line B is smaller than the low-density points. In this case, for a dataset with n instances, we usually set 8 $N_i$ $\leqslant $ $n$.

Parameters $C_1$ and $C_2$ are used to obtain sub-clustering centroids, which facilitate the differentiation of different density layers while obtaining high-quality low-density points. A reasonable value of parameters $C_1$ and $C_2$ can divide the decision graph into multiple clear sub-clustering centroids along the x-axis. In practice, a smaller value of $C_1$ or $C_2$ can be set if clearer low-density points do not exist. Although $C_1$ or $C_2$ has an important influence on the generation of sub-clustering centroids and clustering results, it is often easier to choose in practice. Assume that the dataset has a maximum distance of $\delta _{max}$ in the decision graph. For datasets with low density points, we usually set 20%$\delta _{max} \leqslant C_1 \leqslant $70%$\delta _{max}$. Since the number of low-density points in the $C_2$ region is relatively small, we generally set 10%$\delta _{max} \leqslant C_2 \leqslant $80%$\delta _{max}$. For datasets that do not contain low-density points, we set $C_1 \>\delta _{max}$ or $C_2 \>\delta _{max}$. For the specific values of $C_1$ and $C_2$, we can further determine the size of $C_1$ and $C_2$ respectively by observing the minimum distance of low-density points to be selected in the decision graph.

Compared with the DPC method, the AmDPC algorithm adds three parameters, but all three parameters can find reasonable parameter values with the best clustering results by observing the decision graph. After a lot of experiments, it is proved that the performance of the AmDPC algorithm is much better than that of the DPC method. From the sensitivity test, it can be seen that the clustering results of the AmDPC algorithm are more stable and better than the DPC algorithm for a given range of parameters. In the Figs. 15, 16, 17, 18, 19 and 20, the parameter values of AmDPC are insensitive within a reasonable range of parameters On the datasets used for sensitivity testing, AmDPC performs better than the DPC and REDPC methods.

In general, we give a comparative analysis of the evaluation metrics of the state-of-the-art methods based on the values of the metrics in Tables 2 and 3, respectively. In Fig. 14, we can see that the AmDPC algorithm has significantly higher values of evaluation metrics in these datasets than the other algorithms. For the synthetic datasets, we can see that the AmDPC algorithm achieves the best clustering results compared to other methods. Most of these synthetic datasets are non-convex datasets, and we can arrive at the best clustering effects very efficiently by clustering the low-density points separately and then merging them.

In Figs. 6 and 7, we give the clustering results for the high density dataset without low density regions. Especially in Fig. 8, the best clustering results are obtained by AP and AmDPC algorithms. It shows that the algorithm also has very good clustering results for other types of datasets as well. In the real-world datasets and Olivetti Face dataset, the AmDPC method is slightly better than the other methods, indicating that the improved algorithm has a higher practical value (see Fig. 14 and Table 3). In synthetic datasets, we can discover that the values of NMI, ACC, ARI, and FM metrics of the AmDPC method are generally better than those of other advanced methods. In the real-world dataset, the clustering evaluation index values of the AmDPC algorithm are generally higher than other state-of-the-art algorithms. From this, we can know that the AmDPC method has higher clustering performance compared to other advanced methods. Therefore, we can conclude that the AmDPC method has certain effectiveness and robustness.

Conclusion

In overview, our proposed AmDPC method can overcome the shortcomings of the DPC method in artificially selecting clustering centers and is very effective for clustering low-density regions with multi-peaks. We use density deviation to measure the relationship between data points and the cut-off distance. The AmDPC algorithm has very good clustering performance for non-convex datasets with low-density peaks points. The algorithm provides a novel clustering model for clustering low-density data points and high-density region data points by dividing the decision map density deviation level. The simulation experimental performance indicates that the AmDPC method also has very good clustering effectiveness for clustering centroids in high-density areas. In simulations with real-world and Olivetti Face datasets, AmDPC has better clustering performance than other methods. Therefore, the improved algorithm is robust and effective compared with other algorithms.

However, the AmDPC method needs to control several params simultaneously to achieve better clustering effects. Since the parameter selection process of AmDPC algorithm has certain complexity. In the future, we will utilize a multi-objective optimization method to optimize these parameters so that we can get the best clustering results quickly.

Acknowledgements

The research was supported by the National Science Foundation of China under Grant No. 61572225 and 61472049, the foundation of JiLin Province Education Department under Grant No. JJKH20190724KJ, the Jilin Provincial Science & Technology Department Foundation under Grant No. 20190302071GX and 20200201164JC, the Development and Reform Commission Foundation of Jilin Province under Grant No. 2019C05311.

Declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Towards a resource efficient and privacy-preserving framework for campus-wide video analytics-based applications

Nächster Artikel Architecture entropy sampling-based evolutionary neural architecture search and its application in osteoporosis diagnosis

https://github.com/milaan9/Clustering-Datasets/tree/master/02.%20Synthetic.

https://github.com/milaan9/Clustering-Datasets/tree/master/01.%20UCI.

https://github.com/milaan9/Clustering-Datasets/tree/master/03.%20Face.

Philip Chen CL, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inform Sci 275:314–347CrossRef

Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107CrossRef

Bello-Orgaz G, Jung JJ, Camacho D (2016) Social big data: recent achievements and new challenges. Inform Fus 28:45–59CrossRef

Chen L, Guo Q, Liu Z, Zhang S, Zhang H (2021) Enhanced synchronization-inspired clustering for high-dimensional data. Complex Intell Syst 7:203–223CrossRef

Cheng D, Zhu Q, Huang J, Wu Q, Yang L (2019) A local cores-based hierarchical clustering algorithm for data sets with complex structures. Neural Comput Appl 31:8051–8068CrossRef

Hassan BA, Rashid TA, Mirjalili S (2021) Formal context reduction in deriving concept hierarchies from corpora using adaptive evolutionary clustering algorithm star. Complex Intell Syst 7:2383–2398CrossRef

Zhong X, Enke D (2017) A comprehensive cluster and classification mining procedure for daily stock market return forecasting. Neurocomputing 267:152–168CrossRef

Verma L, Srivastava S, Negi PC (2018) An intelligent noninvasive model for coronary artery disease detection. Complex Intell Syst 4:11–18CrossRef

Kim K (2020) Identifying the structure of cities by clustering using a new similarity measure based on smart card data. IEEE Trans Intell Trans Syst 21(5):2002–2011CrossRef

10.

Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496CrossRef

11.

Rehioui H, Idrissi A, Abourezq M, Zegrari F (2016) Denclue-im: a new approach for big data clustering. Proc Comput Sci 83:560–567CrossRef

12.

Ester M, Kriegel H.-P, Sander J, Xu X.(1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp. 226–231

13.

Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28(1):100–108CrossRefMATH

14.

Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976MathSciNetCrossRefMATH

15.

Du M, Ding S, Xu X, Xue Y (2018) Density peaks clustering using geodesic distances. Int J Mach Learning Cybernet 9(8):1335–1349

16.

Du M, Ding S, Jia H (2016) Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl-Based Syst 99:135–145CrossRef

17.

Ding J, He X, Yuan J, Jiang B (2018) Automatic clustering based on density peak detection using generalized extreme value distribution. Soft Comput 22:2777–2796CrossRef

18.

Wang Y, Wang D, Pang W, Miao C, Zhou Y (2020) A systematic density-based clustering method using anchor points. Neurocomputing 400:352–370CrossRef

19.

Jiang J, Zhou W, Wang L, Tao X, Li K (2019) Halodpc: An improved recognition method on halo node for density peak clustering algorithm. Int J Pattern Recognit Artificial Intell 33(8):1950012–1195001219CrossRef

20.

Chen J, Yu PS (2021) A domain adaptive density clustering algorithm for data with varying density distribution. IEEE Trans Knowl Data Eng 33(6):2310–2321CrossRef

21.

Du M, Wang R, Ji R, Wang X, Dong Y (2021) Robp a robust border-peeling clustering using cauchy kernel. Inform Sci 571:375–400MathSciNetCrossRef

22.

Qian L, Plant C, Böhm C.(2021) Density-based clustering for adaptive density variation. In: 2021 IEEE International Conference on Data Mining (ICDM), pp. 1282–1287

23.

Ding S, Du M, Sun T, Xu X, Xue Y (2017) An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood. Knowl-Based Syst 133:294–313CrossRef

24.

Liu Y, Ma Z, Yu F (2017) Adaptive density peak clustering based on k-nearest neighbors with aggregating strategy. Knowl-Based Syst 133:208–220CrossRef

25.

Parmar M, Pang W, hao Yu D, Jiang J, Wang L, Wang L, Zhou Y.(2019) Fredpc: A feasible residual error-based density peak clustering algorithm with the fragment merging strategy. IEEE Access 7, 89789–89804

26.

Parmar M, Wang D, Zhang X, Tan A-H, Miao C, Jiang J, Zhou Y (2019) Redpc: a residual error-based density peak clustering algorithm. Neurocomputing 348:82–96CrossRef

27.

Shi Y, Chen Z, Qi Z, Meng F, Cui L (2017) A novel clustering-based image segmentation via density peaks algorithm with mid-level feature. Neural Comput Appl 28:29–39CrossRef

28.

Xu X, Ding S, Shi Z (2018) An improved density peaks clustering algorithm with fast finding cluster centers. Knowl-Based Syst 158:65–74CrossRef

29.

Xu J, Wang G, Deng W (2016) Denpehc: density peak based efficient hierarchical clustering. Inform Sci 373:200–218CrossRef

30.

Wu B, Wilamowski BM (2016) A fast density and grid based clustering method for data with arbitrary shapes and noise. IEEE Trans Ind Inform 13(4):1620–1628CrossRef

31.

Wang Y, Wang D, Zhang X, Pang W, Miao C, Tan A-H, Zhou Y (2020) Mcdpc: multi-center density peak clustering. Neural Comput Appl 32:13465–13478CrossRef

32.

Guan J, Li S, He X, Zhu J, Chen J (2021) Fast hierarchical clustering of local density peaks via an association degree transfer method. Neurocomputing 455:401–418CrossRef

33.

Sampat MP, Wang Z, Gupta S, Bovik AC, Markey MK (2009) Complex wavelet structural similarity: a new image similarity index. IEEE Trans Image Process 18(11):2385–2401MathSciNetCrossRefMATH

Titel: A novel density deviation multi-peaks automatic clustering algorithm
verfasst von: Wei Zhou
Limin Wang
Xuming Han
Milan Parmar
Mingyang Li
Publikationsdatum: 24.06.2022
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems / Ausgabe 1/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00798-3

Datasets	AmDPC	REDPC	DPC
Pathbased	0.6759\(\,\pm \,\)0.1940	0.5557\(\,\pm \,\)0.0186	0.5268\(\,\pm \,\)0.0225
Halfkernel	0.8585\(\,\pm \,\)0.1640	0.7480\(\,\pm \,\)0.1206	0.7274\(\,\pm \,\)0.1117
Zelink1	0.9512\(\,\pm \,\)0.1103	0.7957\(\,\pm \,\)0.1631	0.1766\(\,\pm \,\)0.0487
Breast	0.0839\(\,\pm \,\)0.0291	0.0603\(\,\pm \,\)0.0197	0.0161\(\,\pm \,\)0.0096
Pima	0.0239\(\,\pm \,\)0.0129	0.0206\(\,\pm \,\)0.0104	0.0172\(\,\pm \,\)0.0153
Glass	0.3795\(\,\pm \,\)0.0171	0.1578\(\,\pm \,\)0.0719	0.2464\(\,\pm \,\)0.0388

Datasets	AmDPC	REDPC	DPC
Pathbased	0.8229\(\,\pm \,\)0.1190	0.6958\(\,\pm \,\)0.0583	0.7100\(\,\pm \,\)0.0346
Halfkernel	0.8585\(\,\pm \,\)0.1640	0.7480\(\,\pm \,\)0.1206	0.7274\(\,\pm \,\)0.1117
Zelink1	0.9569\(\,\pm \,\)0.1139	0.7659\(\,\pm \,\)0.2190	0.4816\(\,\pm \,\)0.0608
Breast	0.6683\(\,\pm \,\)0.0836	0.6083\(\,\pm \,\)0.1105	0.5131\(\,\pm \,\)0.0881
Pima	0.5931\(\,\pm \,\)0.1035	0.6125\(\,\pm \,\)0.0223	0.5062\(\,\pm \,\)0.1289
Glass	0.4737\(\,\pm \,\)0.0100	0.3785\(\,\pm \,\)0.0120	0.4014\(\,\pm \,\)0.0468

Datasets	AmDPC	REDPC	DPC
Pathbased	0.6218\(\,\pm \,\)0.2336	0.4622\(\,\pm \,\)0.0908	0.4377\(\,\pm \,\)0.0610
Halfkernel	0.6451\(\,\pm \,\)0.3972	0.4437\(\,\pm \,\)0.0641	0.2805\(\,\pm \,\)0.1733
Zelink1	0.9331\(\,\pm \,\)0.1637	0.6448\(\,\pm \,\)0.2816	0.0447\(\,\pm \,\)0.0381
Breast	0.1431\(\,\pm \,\)0.0314	0.1162\(\,\pm \,\)0.0524	0.0025\(\,\pm \,\)0.0246
Pima	0.0455\(\,\pm \,\)0.0172	0.0448\(\,\pm \,\)0.0190	0.0195\(\,\pm \,\)0.0134
Glass	0.1688\(\,\pm \,\)0.0238	0.0454\(\,\pm \,\)0.0318	0.1101\(\,\pm \,\)0.0498

Datasets	AmDPC	REDPC	DPC
Pathbased	0.7615\(\,\pm \,\)0.1450	0.6670\(\,\pm \,\)0.0106	0.6549\(\,\pm \,\)0.0083
Halfkernel	0.8440\(\,\pm \,\)0.1673	0.7027\(\,\pm \,\)0.0561	0.6648\(\,\pm \,\)0.0715
Zelink1	0.9594\(\,\pm \,\)0.0989	0.8077\(\,\pm \,\)0.1703	0.4875\(\,\pm \,\)0.0826
Breast	0.6622\(\,\pm \,\)0.0938	0.6088\(\,\pm \,\)0.0917	0.5329\(\,\pm \,\)0.0873
Pima	0.6169\(\,\pm \,\)0.1037	0.6057\(\,\pm \,\)0.0647	0.5189\(\,\pm \,\)0.1039
Glass	0.5614\(\,\pm \,\)0.0079	0.4968\(\,\pm \,\)0.0085	0.3672\(\,\pm \,\)0.0893

Datasets [Range of \(N_i\)]	NMI	ACC	ARI	FM
Pathbased [2–9]	0.6795\(\,\pm \,\)0.3952	0.7579\(\,\pm \,\)0.2778	0.6051\(\,\pm \,\)0.4707	0.8181\(\,\pm \,\)0.2047
Halfkernel [4–11]	0.8919\(\,\pm \,\)0.3057	0.9476\(\,\pm \,\)0.1481	0.8783\(\,\pm \,\)0.3444	0.9577\(\,\pm \,\)0.1195
Zelink1 [2–9]	0.8115\(\,\pm \,\)0.1826	0.8579\(\,\pm \,\)0.1713	0.7216\(\,\pm \,\)0.3111	0.8536\(\,\pm \,\)0.1585
Breast [2–9]	0.0605\(\,\pm \,\)0.0395	0.5533\(\,\pm \,\)0.1095	0.0823\(\,\pm \,\)0.0693	0.5714\(\,\pm \,\)0.0833
Pima [2-9-]	0.0284\(\,\pm \,\)0.0187	0.6584\(\,\pm \,\)0.0131	0.0343\(\,\pm \,\)0.0382	0.7232\(\,\pm \,\)0.0125
Glass [4–11]	0.3893\(\,\pm \,\)0.0213	0.4802\(\,\pm \,\)0.0108	0.1912\(\,\pm \,\)0.0063	0.5697\(\,\pm \,\)0.0005

Springer Professional

A novel density deviation multi-peaks automatic clustering algorithm

Abstract

Publisher's Note

Introduction

A novel density deviation multi-peaks automatic clustering algorithm

Density deviation theory

Low-density points acquisition

Automatic clustering of different density levels

Performance evaluation and experimental analysis

Clustering performance evaluation indexes

Experiment results on synthetic datasets

Experiment results on real-world datasets

Clustering time complexity analysis

Clustering performance analysis of facial recognition

Multi-peak points intervals sequencing ablation experiments

Parameters sensitivity analysis of the AmDPC algorithm

Parameters setting of the AmDPC algorithm

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Datasets	Ordering methods	NMI	ACC	ARI	FM
Pathbased	\(\left\{ P_1, P_2,P_3 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_1, P_3,P_2 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_2, P_1,P_3 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_2, P_3,P_1 \right\} \)	0.1301	0.3600	0.0006	0.5375
	\( \left\{ P_3, P_1,P_2 \right\} \)	0.1301	0.3600	0.0006	0.5375
	\( \left\{ P_3, P_2,P_1 \right\} \)	0.1301	0.3600	0.0006	0.5375
2circles	\( \left\{ P_1, P_2,P_3 \right\} \)	0.3111	0.7250	0.2016	0.6584
	\( \left\{ P_1, P_3,P_2 \right\} \)	0.3111	0.7250	0.2016	0.6584
	\( \left\{ P_2, P_1,P_3 \right\} \)	0.3111	0.7250	0.2016	0.6584
	\( \left\{ P_2, P_3,P_1 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_3, P_1,P_2 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_3, P_2,P_1 \right\} \)	1.0000	1.0000	1.0000	1.0000
Flame	\( \left\{ P_1, P_2,P_3 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_1, P_3,P_2 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_2, P_1,P_3 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_2, P_3,P_1 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_3, P_1,P_2 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_3, P_2,P_1 \right\} \)	1.0000	1.0000	1.0000	1.0000
D2	\( \left\{ P_1, P_2,P_3 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_1, P_3,P_2 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_2, P_1,P_3 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_2, P_3,P_1 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_3, P_1,P_2 \right\} \)	1.0000	1.0000	1.0000	1.0000
	\( \left\{ P_3, P_2,P_1 \right\} \)	1.0000	1.0000	1.0000	1.0000
Liver	\( \left\{ P_1, P_2,P_3 \right\} \)	0.0353	0.5913	0.0347	0.6120
	\( \left\{ P_1, P_3,P_2 \right\} \)	0.0353	0.5913	0.0347	0.6120
	\( \left\{ P_2, P_1,P_3 \right\} \)	0.0353	0.5913	0.0347	0.6120
	\( \left\{ P_2, P_3,P_1 \right\} \)	0.0353	0.5913	0.0347	0.6120
	\( \left\{ P_3, P_1,P_2 \right\} \)	0.0353	0.5913	0.0347	0.6120
	\( \left\{ P_3, P_2,P_1 \right\} \)	0.0353	0.5913	0.0347	0.6120
Breast	\( \left\{ P_1, P_2,P_3 \right\} \)	0.1163	0.7509	0.1388	0.7587
	\( \left\{ P_1, P_3,P_2 \right\} \)	0.1163	0.7509	0.1388	0.7587
	\( \left\{ P_2, P_1,P_3 \right\} \)	0.1163	0.7509	0.1388	0.7587
	\( \left\{ P_2, P_3,P_1 \right\} \)	0.0248	0.5451	0.0007	0.5471
	\( \left\{ P_3, P_1,P_2 \right\} \)	0.0248	0.5451	0.0007	0.5471
	\( \left\{ P_3, P_2,P_1 \right\} \)	0.0248	0.5451	0.0007	0.5471
Glass	\( \left\{ P_1, P_2,P_3 \right\} \)	0.4008	0.4860	0.1946	0.5700
	\( \left\{ P_1, P_3,P_2 \right\} \)	0.4008	0.4860	0.1946	0.5700
	\( \left\{ P_2, P_1,P_3 \right\} \)	0.4008	0.4860	0.1946	0.5700
	\( \left\{ P_2, P_3,P_1 \right\} \)	0.3879	0.4766	0.1996	0.5530
	\( \left\{ P_3, P_1,P_2 \right\} \)	0.3879	0.4766	0.1996	0.5530
	\( \left\{ P_3, P_2,P_1 \right\} \)	0.3879	0.4766	0.1996	0.5530
Letter	\( \left\{ P_1, P_2,P_3 \right\} \)	0.4068	0.2587	0.1276	0.1867
	\( \left\{ P_1, P_3,P_2 \right\} \)	0.4068	0.2587	0.1276	0.1867
	\( \left\{ P_2, P_1,P_3 \right\} \)	0.4068	0.2587	0.1276	0.1867
	\( \left\{ P_2, P_3,P_1 \right\} \)	0.4068	0.2587	0.1276	0.1867
	\( \left\{ P_3, P_1,P_2 \right\} \)	0.4068	0.2587	0.1276	0.1867
	\( \left\{ P_3, P_2,P_1 \right\} \)	0.4068	0.2587	0.1276	0.1867

Datasets [Range of \(C_1\)]	NMI	ACC	ARI	FM
Pathbased [2.5–2.6]	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000
Halfkernel [4.0–20.0]	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000
Zelink1 [0.03–0.09]	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000	1.0000\(\,\pm \,\)0.0000
Breast [2.06–2.14]	0.1123\(\,\pm \,\)0.0000	0.7437\(\,\pm \,\)0.0000	0.1965\(\,\pm \,\)0.0000	0.7384\(\,\pm \,\)0.0000
Pima [0.429–0.436]	0.0744\(\,\pm \,\)0.0000	0.6901\(\,\pm \,\)0.0000	0.1284\(\,\pm \,\)0.0000	0.6928\(\,\pm \,\)0.0000
Glass [1.27–1.66]	0.4008\(\,\pm \,\)0.0000	0.4860\(\,\pm \,\)0.0000	0.1946\(\,\pm \,\)0.0000	0.5700\(\,\pm \,\)0.0000

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

A novel density deviation multi-peaks automatic clustering algorithm

Density deviation theory

Low-density points acquisition

Automatic clustering of different density levels

Performance evaluation and experimental analysis

Clustering performance evaluation indexes

Experiment results on synthetic datasets

Experiment results on real-world datasets

Clustering time complexity analysis

Clustering performance analysis of facial recognition

Multi-peak points intervals sequencing ablation experiments

Parameters sensitivity analysis of the AmDPC algorithm

Parameters setting of the AmDPC algorithm

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Weitere Artikel der Ausgabe 1/2023

System dynamics model: developing model for supplier selection with a focus on CSR criteria

Coevolutionary opinion dynamics with sparse interactions in open-ended societies

Some t-conorm-based distance measures and knowledge measures for Pythagorean fuzzy sets with their application in decision-making

A multi-objective particle swarm optimization with density and distribution-based competitive mechanism for sensor ontology meta-matching

Multimodal medical image fusion with convolution sparse representation and mutual information correlation in NSST domain

Solving arithmetic word problems by synergizing syntax-semantics extractor for explicit relations and neural network miner for implicit relations

Premium Partner