A distance based clustering method for arbitrary shaped clusters in large datasets

doi:10.1016/j.patcog.2011.04.027

Pattern Recognition

Volume 44, Issue 12, December 2011, Pages 2862-2870

https://doi.org/10.1016/j.patcog.2011.04.027 Get rights and content

Abstract

Clustering has been widely used in different fields of science, technology, social science, etc. Naturally, clusters are in arbitrary (non-convex) shapes in a dataset. One important class of clustering is distance based method. However, distance based clustering methods usually find clusters of convex shapes. Classical single-link is a distance based clustering method, which can find arbitrary shaped clusters. It scans dataset multiple times and has time requirement of $O (n^{2})$ , where n is the size of the dataset. This is potentially a severe problem for a large dataset. In this paper, we propose a distance based clustering method, l-SL to find arbitrary shaped clusters in a large dataset. In this method, first leaders clustering method is applied to a dataset to derive a set of leaders; subsequently single-link method (with distance stopping criteria) is applied to the leaders set to obtain final clustering. The l-SL method produces a flat clustering. It is considerably faster than the single-link method applied to dataset directly. Clustering result of the l-SL may deviate nominally from final clustering of the single-link method (distance stopping criteria) applied to dataset directly. To compensate deviation of the l-SL, an improvement method is also proposed. Experiments are conducted with standard real world and synthetic datasets. Experimental results show the effectiveness of the proposed clustering methods for large datasets.

Highlights

► Two distance based clustering methods are proposed for arbitrary shaped clusters. ► Clustering results of one method are exactly same as the single-link method. ► Clustering results are analyzed experimentally and theoretically. ► Methods are significantly faster than classical single-link method. ► Methods are highly suitable for large datasets as they scan dataset atmost twice.

Introduction

Clustering problem appears in many different fields like data mining, pattern recognition, statistical data analysis, bio-informatics, etc. Clustering problem can be defined as follows. Let $D = {x_{1}, x_{2}, x_{3}, \dots, x_{n}}$ be a set of n patterns, where each x_i is a N-dimensional vector in a given feature space. Clustering activity is to find groups of patterns, called clusters in $D$ , in such a way that patterns in a cluster are more similar to each other than patterns in distinct clusters. There exist many clustering methods in the literature [1], [2], [3], [4].

Clustering methods are mainly divided into two categories based on the way they produce results viz., partitional clustering and hierarchical clustering methods.

Partitional clustering methods create a single clustering (flat clustering) of a dataset. Partitional methods can be further categorized into two classes based on the criteria used viz., distance based and density based. Distance based methods optimize a global criteria based on the distance between patterns. Some of the popular distance based clustering methods are k-means [5], CLARA [6], and CLARANS [7]. Density based partitional clustering methods optimize local criteria based on density distribution of patterns. DBSCAN [8] and DenClue [9] are of this type of clustering methods.

Hierarchical clustering methods create a sequence of nested clusterings of a dataset. These methods can also be categorized into two classes viz., density based and distance based. Clustering methods like single-link [10], complete-link [11], and average-link [12] are distance based hierarchical clustering methods. Clustering methods like OPTICS [13] and Chameleon [14] are density based hierarchical clustering methods.

Single-link clustering method can find arbitrary shaped clusters in many applications such as image segmentation, spatial data mining, geological mapping. It builds a dendogram where each level represents a clustering of the dataset. A suitable clustering is chosen from the dendogram based on requirements. Selection of a clustering from the dendogram can be done by specifying various stopping conditions as follow [15]:

(a)
Distance-h stopping condition: Distance between any pair of clusters in the clustering should be greater than h.
(b)
k-cluster stopping condition: Number of clusters in the clustering should be k.
(c)
Scale- $α$ stopping condition: Let $ρ^{⁎}$ be the maximum pairwise distance between any two patterns in the dataset. Then, the stopping condition is: distance between any pair of clusters should be greater than $α ρ^{⁎},$ where $0 < α < 1 .$

In this paper, we consider single-link clustering method with distance-h stopping condition and it is referred to as SL method in the remaining part of the paper. In the study [16], Krause exploited SL method in computational biology. However, SL method has the following drawbacks: (i) time complexity of

O (n^{2})

and (ii) scanning the dataset many times. Therefore, this method is not suitable for large datasets.

In this paper, we propose a distance based clustering method which is suitable for clustering large datasets by combining leaders clustering method (partitional clustering) and SL method. We call this hybrid method as leader-single-link (l-SL) method, (l stands for leaders). The l-SL method is significantly faster than the SL method. However, the clustering results produced by l-SL method may deviate from the final clustering produced by the SL method. From various experimental studies, it is found that deviations are nominal. To overcome this deviation (if any), a correction scheme is also proposed and we call it as al-SL method. Execution time of the al-SL method is marginally higher than the l-SL method, but faster than the SL method. In this paper, we also prove that clustering results produced by the al-SL method and the SL method are identical. Experimental results with standard real world and synthetic datasets show the effectiveness of the proposed clustering methods in large datasets.

The rest of the paper is organized as follows. Section 2 describes a summary of related works. Section 3 describes a brief background of the proposed clustering methods. Section 4 describes proposed distance based clustering method (l-SL) and a relationship between SL and l-SL methods. Section 5 describes proposed improvement of the l-SL method. Experimental results and conclusions are discussed in Section 6 and Section 7, respectively.

Section snippets

Related work

In this section, recent works related to the proposed schemes are discussed. Dash et al. [17] proposed a fast hierarchical clustering method based on partially overlapping partitions (POP). It has two phases. In first phase, dataset is partitioned into a number of overlapping cells. Closest pair-distance is calculated for each cell and if overall closest pair-distance is less than a threshold $δ$ , then the pair is merged. This process continues until overall closest pair distance is more than $δ$ .

Background of the proposed method

Two widely used clustering methods viz., leaders clustering and single-link clustering method are discussed in this section. These two clustering methods have their own advantages and disadvantages. The proposed clustering methods exploit these two clustering methods.

Proposed clustering method

The proposed l-SL clustering method is a hybrid scheme with a combination of above two techniques (i.e. leaders and SL method). The l-SL method needs only a single parameter h (i.e. distance between a pair of clusters is more than h). Like other clustering methods (k-means, DBSCAN), we assume that the value of the parameter (h) is known before hand. In this scheme, the set of leaders produced by the leaders clustering method is used as a representative set of data. Subsequently, the SL method

Augmented l-SL (al-SL) clustering method

As discussed earlier, a few clusters in the final results of l-SL method may be required to be merged in order to obtain the exact final results as that of the SL method. Let the clustering result of the l-SL be a clustering $π_{l} = {B_{1}, B_{2}, \dots, B_{k}}$ . Let $(B_{i}, B_{j})$ be a pair of clusters in $π_{l}$ . If there exists a pair of patterns $x \in B_{i}$ and $y \in B_{j}$ such that $∥ x - y ∥ \leq h$ , then B_i and B_j must be merged together. In general, one needs to search for all pairs of clusters in $π_{l}$ to see whether they can be merged. The

Experimental evaluation

In this section, we evaluate the proposed methods (l-SL and al-SL) experimentally. From discussion in Section 2, it is clear that scheme proposed in [17] is not directly related to our schemes. The method proposed in [18] keeps dataset as well as distance-matrix for the entire data in primary memory. The scheme proposed in [20] needs many parameters to be adjusted to produce same clustering as that of the single-link method, which are difficult to determine. This method also requires whole

Conclusions

The SL method can find arbitrary (non-convex) shaped clusters in a dataset. However, SL method scans dataset many times. It has time and space complexity of $O (n^{2})$ . So, SL method is not suitable for large datasets. In this paper, we proposed two clustering methods to detect arbitrary shaped clusters for large datasets. The first method, l-SL takes considerably less time compared to that of the SL method and scans the dataset only once. Clustering results produced by l-SL is very close to that of

Acknowledgments

The authors are thankful to Prof. Vijay S. Iyengar, Fellow, IEEE and Dr. Sanasam Ranbir Singh, IIT Guwahti for their valuable suggestions. This work is supported by Council of Scientific & Industrial Research (CSIR), Government of India.

Bidyut Kr. Patra received B.Sc. (Physics), B.Tech. (CSE) and M.Tech. (CSE) from Calcutta University, Kolkata, India in 1996, 1999 and 2001, respectively. He has been working for his Ph.D. at the Indian Institute of Technology-Guwahati, India since July, 2005. He worked as a Lecturer at Haldia Institute of Technology, Haldia, West Bengal, India during August, 2001-June, 2005. Currently, he is an Assistant Professor at Tezpur Central University, Tezpur, Assam, India. His research interests

References (35)

M. Dash et al.
Fast hierarchical clustering and its validation
Data & Knowledge Engineering
(2003)
P.A. Vijaya et al.
Efficient bottom-up hybrid hierarchical clustering techniques for protein sequence classification
Pattern Recognition
(2006)
M. Liu et al.
A multi-prototype clustering algorithm
Pattern Recognition
(2009)
P. Viswanath et al.
Rough-DBSCAN: a fast hybrid density based clustering method for latge data sets
Pattern Recognition Letters
(2009)
C.F. Olson
Parallel algorithms for hierarchical clustering
Parallel Computing
(1995)
R.O. Duda et al.
Pattern Classification
(2000)
A.K. Jain et al.
Data clustering: a review
ACM Computing Surveys
(1999)
A.K. Jain et al.
Statistical pattern recognition: a review
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2000)
R. Xu et al.
Survey of clustering algorithms
IEEE Transactions on Neural Networks
(2005)
J.B. MacQueen
Some methods for classification and analysis of multivariate observations

L. Kaufman et al.

Finding Groups in Data: An Introduction to Cluster Analysis

(1990)

R.T. Ng et al.

CLARANS: a method for clustering objects for spatial data mining

IEEE Transactions on Knowledge and Data Engineering

(2002)

M. Ester et al.

A density-based algorithm for discovering clusters in large spatial databases with noise

A. Hinneburg et al.

An efficient approach to clustering in large multimedia databases with noise

P.H.A. Sneath et al.

Numerical Taxonomy

(1973)

B. King

Step-wise clustering procedures

Journal of the American Statistical Association

(1967)

M.H. Dunham

Data Mining: Introductory and Advanced Topics

(2003)

Cited by (42)

Clustering and forecasting of day-ahead electricity supply curves using a market-based distance
2024, International Journal of Electrical Power and Energy Systems
Gathering knowledge of supply curves in electricity markets is critical to both energy producers and regulators. Indeed, power producers strategically plan their generation of electricity considering various scenarios to maximize profit, leveraging the characteristics of these curves. For their part, regulators need to forecast the supply curves to monitor the market’s performance and identify market distortions. However, the prevailing approaches in the technical literature for analyzing, clustering, and predicting these curves are based on structural assumptions that electricity supply curves do not satisfy in practice, namely, boundedness and smoothness. Furthermore, any attempt to satisfactorily cluster the supply curves observed in a market must take into account the market’s specific features.
Against this background, this article introduces a hierarchical clustering method based on a novel weighted-distance that is specially tailored to non-bounded and non-smooth supply curves and embeds information on the price distribution of offers, thus overcoming the drawbacks of conventional clustering techniques. Once the clusters have been obtained, a supervised classification procedure is used to characterize them as a function of relevant market variables. Additionally, the proposed distance is used in a learning procedure by which explanatory information is exploited to forecast the supply curves in a day-ahead electricity market. This procedure combines the idea of nearest neighbors with a machine-learning method. The prediction performance of our proposal is extensively evaluated and compared against two nearest-neighbor benchmarks and existing competing methods. To this end, supply curves from the markets of Spain, Pennsylvania-New Jersey-Maryland (PJM), and West Australia are considered.
On the approximation of Euclidean SL via geometric method
2023, Information Sciences
Single-linkage with distance-r stopping condition (SL) is a classical clustering technique, which can discover arbitrary shaped clusters. However, its $O (n^{2})$ worst-case time complexity is still a challenge, where n is the number of points in the dataset. In this paper, we present a relaxed version of the problem in Euclidean space, called the ρ-approximate SL problem. We propose a grid-based algorithm, called ρ-SL, to solve this problem in $O (n + ν)$ expected time, where ν is an integer associated with the maximum and minimum coordinates of each dimension. ρ-SL first partitions the dataset into non-empty grids and constructs a grid tree to organize them for efficient non-empty neighboring grid queries. Then, an initial cluster structure is established to keep track of all pairs of grids that may be mergeable. ρ-SL proceeds to identify the pairs of grids that are mergeable by a fast merging algorithm which takes into account the spatial relationships among points to reduce unnecessary distance calculations. Experiments were conducted on synthetic and real-world datasets to evaluate the performance of ρ-SL. The results show that the proposed algorithm is more efficient than SL and its fastest variants.
Sustainability efficiency assessment of wastewater treatment plants in China: A data envelopment analysis based on cluster benchmarking
2020, Journal of Cleaner Production
Citation Excerpt :
Cluster benchmarking is a method of dividing a set of DMUs into groups (i.e. clusters) according to certain attributes, so DMUs in the same cluster are more similar to each other than to DMUs in other clusters. The major purpose of grouping is to maximize the homogeneity of DMUs in the same cluster and the heterogeneity of DMUs in different clusters (Patra et al., 2011). To achieve the efficiency evaluation of objects relative to the optimal practice under the respective clusters, it is necessary to construct the production frontier separately (Zhu, 2013).
Quantitative evaluation on the efficiency of wastewater treatment plants (WWTPs) is a key issue that needs to be solved. For this purpose, data envelopment analysis (DEA) was employed to establish a comprehensive efficiency evaluation system on WWTPs, including three inputs of operating cost, electricity consumption and labor, three desirable outputs of chemical oxygen demand (COD) removal rate, ammonia nitrogen (NH₃–N) removal rate and reclaimed water yield, and one undesirable output of dry sludge yield. 861 WWTPs in China were assessed by a slacked-based DEA model based on cluster benchmarking. The technology gap ratio (TGR) confirmed that large WWTPs operated more efficiently than small ones. The WWTPs had an average efficiency score of 0.611. Among them, 170 samples were relatively efficient with a score of 1, which means these samples could be a benchmark for other inefficient samples. Different degrees of input excesses or output shortfalls existed in 691 inefficient samples and these samples should be the key objects to improve the operational efficiency. Furthermore, through the Kruskal-Wallis test, the influent COD concentration and capacity load rate showed significant effects on the WWTP performance. These findings, derived from a simple but effective framework, have potential value for managers to make decisions.
A novel complex network community detection approach using discrete particle swarm optimization with particle diversity and mutation
2019, Applied Soft Computing Journal
Citation Excerpt :
Cheng et al. [29] combined the network density and distance construct Density Order Tree (DOT) to represent the original data, and then transform the network community detection problem into the DOT classification problem. Whang et al. [12] proposed an effective overlapping community detection algorithm based on the seed extension method. Montero [30] proposed an algorithm for fuzzy community detection based on N-dimensional clustering and overlapping functions to detect fuzzy community detection output quality.
Community detection in complex network is a hot issue in the field of complex network. This paper proposed a community detection approach in complex network using discrete particle swarm optimization with particle diversity and mutation (DPSO-PDM) strategy, aiming at the modularity optimization. Firstly, this paper redefined particle coding, particle velocity, particle position and evolutionary operation in discretization, which effectively solve the problem of traditional evolutionary algorithm in need of prior knowledge. Secondly, considering the traditional particle swarm optimization algorithm is easy to be trapped in the local optimum; this paper used the hybridizing inertia weight adjustment strategy based on new particle diversity and adaptive mutation strategy to avoid the local convergence of the algorithm. Finally this paper applied the DPSO-PDM algorithm to artificial baseline network data sets and real network data sets. Theoretic analysis and experimental results have shown that DPSO-PDM is effective to detect community structure with stable community division quality and global convergence in complex network.
A novel multi-objective evolutionary algorithm with dynamic decomposition strategy
2019, Swarm and Evolutionary Computation
In this paper, a novel multi-objective evolutionary algorithm (MOEA) is proposed with dynamic decomposition strategy, called MOEA/D-DDS. After the recombination, all the parents and offspring populations both with the size N are combined as a union population in environmental selection, which is then associated to the preset N weight vectors using the constrained decomposition approach. By counting the number of solutions that fall within the feasible region of each subproblem, the number of subproblems that are not associated to any solution can be calculated and recorded by $T_{\max}$ , which somehow shows the distribution of union population and also indicates the number of weight vectors (i.e., subproblems) to be regenerated. Then, the subproblem associated with the largest number of solutions will be found and then further divided into two new subproblems using the proposed dynamic decomposition strategy. This process of dynamic decomposition will be run $T_{\max}$ times in order to have N subproblems associated with at least one solution. At last, a simple convergence indicator is used to select one solution showing the best convergence for each of these N subproblems. Twenty-six well-known test problems are employed to challenge the performance of MOEA/D-DDS and the experiments validate the superiority of MOEA/D-DDS over six recently proposed MOEAs.
Nearest neighbors based density peaks approach to intrusion detection
2018, Chaos, Solitons and Fractals
Citation Excerpt :
The k-nearest neighbor is a kind of typical nonparametric algorithm [35]. The judgement standard of kNN is the distance [36]. Compared with other classification algorithms, it is simpler and more efficient [37].
Intrusion detection systems are very important for network security. However, traditional intrusion detection systems can not identify new type of network intrusion for example zero-day attack. Many machine learning techniques were used in intrusion detection system and they showed better detection performance than other methods. A novel clustering algorithm called Density peaks clustering (DPC) which does not need many parameters and its iterative process is based on density. Because of its simple steps and parameters, it may have many application fields. So we are going to use it in intrusion detection to find a more accurate and efficient classifier. On the basis of some good ideas of DPC, this paper proposes a hybrid learning model based on k-nearest neighbors (kNN) in order to detect attacks more effectively and introduce the density in kNN. In density peaks nearest neighbors (DPNN), KDD-CUP 99 which is the standard dataset in intrusion detection is used to the experiment. Then, we use the dataset to train and calculate some parameters which are used in this algorithm. Finally, the DPNN classifier is used to classify attacks. Experiment results suggest that the DPNN performs better than support vector machine (SVM), k-nearest neighbors (kNN) and many other machine learning methods, and it can effectively detect intrusion attacks and has a good performance in accuracy.

View all citing articles on Scopus

Sukumar Nandi received B.Sc. (Physics), B Tech and M Tech from Calcutta University in 1984, 1987 and 1989, respectively. He received the Ph. D. degree in Computer Science and Engineering from Indian Institute of Technology Kharagpur in 1995. In 1989-90 he was a faculty in Birla Institute of Technology, Mesra, Ranchi, India. During 1991 to 1995, he was a scientific officer in Computer Sc & Engg, Indian Institute of Technology Kharagpur. In 1995 he joined in Indian Institute of Technology Guwahati as an Assistant Professor in Computer Science and Engineering. Subsequently, he became Associate Professor in 1998 and Professor in 2002. Presently, he is the Dean of Academic Affairs of Indian Institute of Technology Guwahati. He was with the School of Computer Engineering, Nanyang Technological University, Singapore as Visiting Senior Fellow for one year (2002-2003). He was member of Board of Governor, Indian Institute of Technology Guwahati for 2005 and 2006. He was General Vice-Chair of 8th International Conference on Distributed Computing and Networking (ICDCN) 2006. He was General Co-Chair of the 15th International Conference on Advance Computing and Communication (ADCOM) 2007. He is also involved in several international conferences as member of advisory board/ Technical Programme Committee. He is reviewer of several international journals and conferences. He is co-author of a book titled Theory and Application of Cellular Automata published by IEEE Computer Society. He has published more than 150 Journals/Conferences papers. His research interests are Computer Networks (Traffic Engineering, Wireless Networks), Computer and Network security, Data mining. He is Senior Member of IEEE, Senior Member of ACM, Fellow of the Institution of Electronics and Telecommunication Engineers and Fellow of the Institution of Engineers (India).

P. Viswanth received his M.Tech (CSE) from the Indian Institute of Technology-Madras, Chennai, India in 1996. From 1996 to 2001, he worked as a faculty member at BITS-Pilani, India and Jawaharlal Nehru Technological University, Hyderabad, India. He received his Ph.D. from the Indian Institute of Science, Bangalore, India in 2005. He received Alumni Medal from IISc-Bangalore as his Ph.D. thesis was selected as the best thesis in the Electrical Sciences Division of IISc-Bangalore in the year 2006. He worked as Assistant Professor in the Department of Computer Science and Engineering, IIT-Guwahati, Guwahati, India from February 2005 to July 2008. At present he is working as a Professor and Dean, R&D (Electrical Sciences) at Rajeev Gandhi Memorial College of Engineering & Technology, Nandyal, Andhra Pradesh, India. His areas of interest include Pattern Recognition, Data Mining, Databases and Algorithms.

View full text

A distance based clustering method for arbitrary shaped clusters in large datasets

Abstract

Highlights

Introduction

Section snippets

Related work

Background of the proposed method

Proposed clustering method

Augmented l-SL (al-SL) clustering method

Experimental evaluation

Conclusions

Acknowledgments

Data & Knowledge Engineering

Pattern Recognition

Pattern Recognition

Pattern Recognition Letters

Parallel Computing

Pattern Classification

Data clustering: a review

ACM Computing Surveys

Statistical pattern recognition: a review

IEEE Transactions on Pattern Analysis and Machine Intelligence

Survey of clustering algorithms

IEEE Transactions on Neural Networks

Some methods for classification and analysis of multivariate observations

Finding Groups in Data: An Introduction to Cluster Analysis

CLARANS: a method for clustering objects for spatial data mining

IEEE Transactions on Knowledge and Data Engineering

A density-based algorithm for discovering clusters in large spatial databases with noise

An efficient approach to clustering in large multimedia databases with noise

Numerical Taxonomy

Step-wise clustering procedures

Journal of the American Statistical Association

Data Mining: Introductory and Advanced Topics