A distance based clustering method for arbitrary shaped clusters in large datasets
Highlights
► Two distance based clustering methods are proposed for arbitrary shaped clusters. ► Clustering results of one method are exactly same as the single-link method. ► Clustering results are analyzed experimentally and theoretically. ► Methods are significantly faster than classical single-link method. ► Methods are highly suitable for large datasets as they scan dataset atmost twice.
Introduction
Clustering problem appears in many different fields like data mining, pattern recognition, statistical data analysis, bio-informatics, etc. Clustering problem can be defined as follows. Let be a set of n patterns, where each xi is a N-dimensional vector in a given feature space. Clustering activity is to find groups of patterns, called clusters in , in such a way that patterns in a cluster are more similar to each other than patterns in distinct clusters. There exist many clustering methods in the literature [1], [2], [3], [4].
Clustering methods are mainly divided into two categories based on the way they produce results viz., partitional clustering and hierarchical clustering methods.
Partitional clustering methods create a single clustering (flat clustering) of a dataset. Partitional methods can be further categorized into two classes based on the criteria used viz., distance based and density based. Distance based methods optimize a global criteria based on the distance between patterns. Some of the popular distance based clustering methods are k-means [5], CLARA [6], and CLARANS [7]. Density based partitional clustering methods optimize local criteria based on density distribution of patterns. DBSCAN [8] and DenClue [9] are of this type of clustering methods.
Hierarchical clustering methods create a sequence of nested clusterings of a dataset. These methods can also be categorized into two classes viz., density based and distance based. Clustering methods like single-link [10], complete-link [11], and average-link [12] are distance based hierarchical clustering methods. Clustering methods like OPTICS [13] and Chameleon [14] are density based hierarchical clustering methods.
Single-link clustering method can find arbitrary shaped clusters in many applications such as image segmentation, spatial data mining, geological mapping. It builds a dendogram where each level represents a clustering of the dataset. A suitable clustering is chosen from the dendogram based on requirements. Selection of a clustering from the dendogram can be done by specifying various stopping conditions as follow [15]:
- (a)
Distance-h stopping condition: Distance between any pair of clusters in the clustering should be greater than h.
- (b)
k-cluster stopping condition: Number of clusters in the clustering should be k.
- (c)
Scale- stopping condition: Let be the maximum pairwise distance between any two patterns in the dataset. Then, the stopping condition is: distance between any pair of clusters should be greater than where
In this paper, we propose a distance based clustering method which is suitable for clustering large datasets by combining leaders clustering method (partitional clustering) and SL method. We call this hybrid method as leader-single-link (l-SL) method, (l stands for leaders). The l-SL method is significantly faster than the SL method. However, the clustering results produced by l-SL method may deviate from the final clustering produced by the SL method. From various experimental studies, it is found that deviations are nominal. To overcome this deviation (if any), a correction scheme is also proposed and we call it as al-SL method. Execution time of the al-SL method is marginally higher than the l-SL method, but faster than the SL method. In this paper, we also prove that clustering results produced by the al-SL method and the SL method are identical. Experimental results with standard real world and synthetic datasets show the effectiveness of the proposed clustering methods in large datasets.
The rest of the paper is organized as follows. Section 2 describes a summary of related works. Section 3 describes a brief background of the proposed clustering methods. Section 4 describes proposed distance based clustering method (l-SL) and a relationship between SL and l-SL methods. Section 5 describes proposed improvement of the l-SL method. Experimental results and conclusions are discussed in Section 6 and Section 7, respectively.
Section snippets
Related work
In this section, recent works related to the proposed schemes are discussed. Dash et al. [17] proposed a fast hierarchical clustering method based on partially overlapping partitions (POP). It has two phases. In first phase, dataset is partitioned into a number of overlapping cells. Closest pair-distance is calculated for each cell and if overall closest pair-distance is less than a threshold , then the pair is merged. This process continues until overall closest pair distance is more than .
Background of the proposed method
Two widely used clustering methods viz., leaders clustering and single-link clustering method are discussed in this section. These two clustering methods have their own advantages and disadvantages. The proposed clustering methods exploit these two clustering methods.
Proposed clustering method
The proposed l-SL clustering method is a hybrid scheme with a combination of above two techniques (i.e. leaders and SL method). The l-SL method needs only a single parameter h (i.e. distance between a pair of clusters is more than h). Like other clustering methods (k-means, DBSCAN), we assume that the value of the parameter (h) is known before hand. In this scheme, the set of leaders produced by the leaders clustering method is used as a representative set of data. Subsequently, the SL method
Augmented l-SL (al-SL) clustering method
As discussed earlier, a few clusters in the final results of l-SL method may be required to be merged in order to obtain the exact final results as that of the SL method. Let the clustering result of the l-SL be a clustering . Let be a pair of clusters in . If there exists a pair of patterns and such that , then Bi and Bj must be merged together. In general, one needs to search for all pairs of clusters in to see whether they can be merged. The
Experimental evaluation
In this section, we evaluate the proposed methods (l-SL and al-SL) experimentally. From discussion in Section 2, it is clear that scheme proposed in [17] is not directly related to our schemes. The method proposed in [18] keeps dataset as well as distance-matrix for the entire data in primary memory. The scheme proposed in [20] needs many parameters to be adjusted to produce same clustering as that of the single-link method, which are difficult to determine. This method also requires whole
Conclusions
The SL method can find arbitrary (non-convex) shaped clusters in a dataset. However, SL method scans dataset many times. It has time and space complexity of . So, SL method is not suitable for large datasets. In this paper, we proposed two clustering methods to detect arbitrary shaped clusters for large datasets. The first method, l-SL takes considerably less time compared to that of the SL method and scans the dataset only once. Clustering results produced by l-SL is very close to that of
Acknowledgments
The authors are thankful to Prof. Vijay S. Iyengar, Fellow, IEEE and Dr. Sanasam Ranbir Singh, IIT Guwahti for their valuable suggestions. This work is supported by Council of Scientific & Industrial Research (CSIR), Government of India.
Bidyut Kr. Patra received B.Sc. (Physics), B.Tech. (CSE) and M.Tech. (CSE) from Calcutta University, Kolkata, India in 1996, 1999 and 2001, respectively. He has been working for his Ph.D. at the Indian Institute of Technology-Guwahati, India since July, 2005. He worked as a Lecturer at Haldia Institute of Technology, Haldia, West Bengal, India during August, 2001-June, 2005. Currently, he is an Assistant Professor at Tezpur Central University, Tezpur, Assam, India. His research interests
References (35)
- et al.
Fast hierarchical clustering and its validation
Data & Knowledge Engineering
(2003) - et al.
Efficient bottom-up hybrid hierarchical clustering techniques for protein sequence classification
Pattern Recognition
(2006) - et al.
A multi-prototype clustering algorithm
Pattern Recognition
(2009) - et al.
Rough-DBSCAN: a fast hybrid density based clustering method for latge data sets
Pattern Recognition Letters
(2009) Parallel algorithms for hierarchical clustering
Parallel Computing
(1995)- et al.
Pattern Classification
(2000) - et al.
Data clustering: a review
ACM Computing Surveys
(1999) - et al.
Statistical pattern recognition: a review
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2000) - et al.
Survey of clustering algorithms
IEEE Transactions on Neural Networks
(2005) Some methods for classification and analysis of multivariate observations
Finding Groups in Data: An Introduction to Cluster Analysis
CLARANS: a method for clustering objects for spatial data mining
IEEE Transactions on Knowledge and Data Engineering
A density-based algorithm for discovering clusters in large spatial databases with noise
An efficient approach to clustering in large multimedia databases with noise
Numerical Taxonomy
Step-wise clustering procedures
Journal of the American Statistical Association
Data Mining: Introductory and Advanced Topics
Cited by (42)
Clustering and forecasting of day-ahead electricity supply curves using a market-based distance
2024, International Journal of Electrical Power and Energy SystemsOn the approximation of Euclidean SL via geometric method
2023, Information SciencesSustainability efficiency assessment of wastewater treatment plants in China: A data envelopment analysis based on cluster benchmarking
2020, Journal of Cleaner ProductionCitation Excerpt :Cluster benchmarking is a method of dividing a set of DMUs into groups (i.e. clusters) according to certain attributes, so DMUs in the same cluster are more similar to each other than to DMUs in other clusters. The major purpose of grouping is to maximize the homogeneity of DMUs in the same cluster and the heterogeneity of DMUs in different clusters (Patra et al., 2011). To achieve the efficiency evaluation of objects relative to the optimal practice under the respective clusters, it is necessary to construct the production frontier separately (Zhu, 2013).
A novel complex network community detection approach using discrete particle swarm optimization with particle diversity and mutation
2019, Applied Soft Computing JournalCitation Excerpt :Cheng et al. [29] combined the network density and distance construct Density Order Tree (DOT) to represent the original data, and then transform the network community detection problem into the DOT classification problem. Whang et al. [12] proposed an effective overlapping community detection algorithm based on the seed extension method. Montero [30] proposed an algorithm for fuzzy community detection based on N-dimensional clustering and overlapping functions to detect fuzzy community detection output quality.
A novel multi-objective evolutionary algorithm with dynamic decomposition strategy
2019, Swarm and Evolutionary ComputationNearest neighbors based density peaks approach to intrusion detection
2018, Chaos, Solitons and FractalsCitation Excerpt :The k-nearest neighbor is a kind of typical nonparametric algorithm [35]. The judgement standard of kNN is the distance [36]. Compared with other classification algorithms, it is simpler and more efficient [37].
Bidyut Kr. Patra received B.Sc. (Physics), B.Tech. (CSE) and M.Tech. (CSE) from Calcutta University, Kolkata, India in 1996, 1999 and 2001, respectively. He has been working for his Ph.D. at the Indian Institute of Technology-Guwahati, India since July, 2005. He worked as a Lecturer at Haldia Institute of Technology, Haldia, West Bengal, India during August, 2001-June, 2005. Currently, he is an Assistant Professor at Tezpur Central University, Tezpur, Assam, India. His research interests include Pattern Recognition, Data Mining and Algorithms.
Sukumar Nandi received B.Sc. (Physics), B Tech and M Tech from Calcutta University in 1984, 1987 and 1989, respectively. He received the Ph. D. degree in Computer Science and Engineering from Indian Institute of Technology Kharagpur in 1995. In 1989-90 he was a faculty in Birla Institute of Technology, Mesra, Ranchi, India. During 1991 to 1995, he was a scientific officer in Computer Sc & Engg, Indian Institute of Technology Kharagpur. In 1995 he joined in Indian Institute of Technology Guwahati as an Assistant Professor in Computer Science and Engineering. Subsequently, he became Associate Professor in 1998 and Professor in 2002. Presently, he is the Dean of Academic Affairs of Indian Institute of Technology Guwahati. He was with the School of Computer Engineering, Nanyang Technological University, Singapore as Visiting Senior Fellow for one year (2002-2003). He was member of Board of Governor, Indian Institute of Technology Guwahati for 2005 and 2006. He was General Vice-Chair of 8th International Conference on Distributed Computing and Networking (ICDCN) 2006. He was General Co-Chair of the 15th International Conference on Advance Computing and Communication (ADCOM) 2007. He is also involved in several international conferences as member of advisory board/ Technical Programme Committee. He is reviewer of several international journals and conferences. He is co-author of a book titled Theory and Application of Cellular Automata published by IEEE Computer Society. He has published more than 150 Journals/Conferences papers. His research interests are Computer Networks (Traffic Engineering, Wireless Networks), Computer and Network security, Data mining. He is Senior Member of IEEE, Senior Member of ACM, Fellow of the Institution of Electronics and Telecommunication Engineers and Fellow of the Institution of Engineers (India).
P. Viswanth received his M.Tech (CSE) from the Indian Institute of Technology-Madras, Chennai, India in 1996. From 1996 to 2001, he worked as a faculty member at BITS-Pilani, India and Jawaharlal Nehru Technological University, Hyderabad, India. He received his Ph.D. from the Indian Institute of Science, Bangalore, India in 2005. He received Alumni Medal from IISc-Bangalore as his Ph.D. thesis was selected as the best thesis in the Electrical Sciences Division of IISc-Bangalore in the year 2006. He worked as Assistant Professor in the Department of Computer Science and Engineering, IIT-Guwahati, Guwahati, India from February 2005 to July 2008. At present he is working as a Professor and Dean, R&D (Electrical Sciences) at Rajeev Gandhi Memorial College of Engineering & Technology, Nandyal, Andhra Pradesh, India. His areas of interest include Pattern Recognition, Data Mining, Databases and Algorithms.