Elsevier

Pattern Recognition

Volume 44, Issue 12, December 2011, Pages 2862-2870
Pattern Recognition

A distance based clustering method for arbitrary shaped clusters in large datasets

https://doi.org/10.1016/j.patcog.2011.04.027Get rights and content

Abstract

Clustering has been widely used in different fields of science, technology, social science, etc. Naturally, clusters are in arbitrary (non-convex) shapes in a dataset. One important class of clustering is distance based method. However, distance based clustering methods usually find clusters of convex shapes. Classical single-link is a distance based clustering method, which can find arbitrary shaped clusters. It scans dataset multiple times and has time requirement of O(n2), where n is the size of the dataset. This is potentially a severe problem for a large dataset. In this paper, we propose a distance based clustering method, l-SL to find arbitrary shaped clusters in a large dataset. In this method, first leaders clustering method is applied to a dataset to derive a set of leaders; subsequently single-link method (with distance stopping criteria) is applied to the leaders set to obtain final clustering. The l-SL method produces a flat clustering. It is considerably faster than the single-link method applied to dataset directly. Clustering result of the l-SL may deviate nominally from final clustering of the single-link method (distance stopping criteria) applied to dataset directly. To compensate deviation of the l-SL, an improvement method is also proposed. Experiments are conducted with standard real world and synthetic datasets. Experimental results show the effectiveness of the proposed clustering methods for large datasets.

Highlights

► Two distance based clustering methods are proposed for arbitrary shaped clusters. ► Clustering results of one method are exactly same as the single-link method. ► Clustering results are analyzed experimentally and theoretically. ► Methods are significantly faster than classical single-link method. ► Methods are highly suitable for large datasets as they scan dataset atmost twice.

Introduction

Clustering problem appears in many different fields like data mining, pattern recognition, statistical data analysis, bio-informatics, etc. Clustering problem can be defined as follows. Let D={x1,x2,x3,,xn} be a set of n patterns, where each xi is a N-dimensional vector in a given feature space. Clustering activity is to find groups of patterns, called clusters in D, in such a way that patterns in a cluster are more similar to each other than patterns in distinct clusters. There exist many clustering methods in the literature [1], [2], [3], [4].

Clustering methods are mainly divided into two categories based on the way they produce results viz., partitional clustering and hierarchical clustering methods.

Partitional clustering methods create a single clustering (flat clustering) of a dataset. Partitional methods can be further categorized into two classes based on the criteria used viz., distance based and density based. Distance based methods optimize a global criteria based on the distance between patterns. Some of the popular distance based clustering methods are k-means [5], CLARA [6], and CLARANS [7]. Density based partitional clustering methods optimize local criteria based on density distribution of patterns. DBSCAN [8] and DenClue [9] are of this type of clustering methods.

Hierarchical clustering methods create a sequence of nested clusterings of a dataset. These methods can also be categorized into two classes viz., density based and distance based. Clustering methods like single-link [10], complete-link [11], and average-link [12] are distance based hierarchical clustering methods. Clustering methods like OPTICS [13] and Chameleon [14] are density based hierarchical clustering methods.

Single-link clustering method can find arbitrary shaped clusters in many applications such as image segmentation, spatial data mining, geological mapping. It builds a dendogram where each level represents a clustering of the dataset. A suitable clustering is chosen from the dendogram based on requirements. Selection of a clustering from the dendogram can be done by specifying various stopping conditions as follow [15]:

  • (a)

    Distance-h stopping condition: Distance between any pair of clusters in the clustering should be greater than h.

  • (b)

    k-cluster stopping condition: Number of clusters in the clustering should be k.

  • (c)

    Scale-α stopping condition: Let ρ be the maximum pairwise distance between any two patterns in the dataset. Then, the stopping condition is: distance between any pair of clusters should be greater than αρ, where 0<α<1.

In this paper, we consider single-link clustering method with distance-h stopping condition and it is referred to as SL method in the remaining part of the paper. In the study [16], Krause exploited SL method in computational biology. However, SL method has the following drawbacks: (i) time complexity of O(n2) and (ii) scanning the dataset many times. Therefore, this method is not suitable for large datasets.

In this paper, we propose a distance based clustering method which is suitable for clustering large datasets by combining leaders clustering method (partitional clustering) and SL method. We call this hybrid method as leader-single-link (l-SL) method, (l stands for leaders). The l-SL method is significantly faster than the SL method. However, the clustering results produced by l-SL method may deviate from the final clustering produced by the SL method. From various experimental studies, it is found that deviations are nominal. To overcome this deviation (if any), a correction scheme is also proposed and we call it as al-SL method. Execution time of the al-SL method is marginally higher than the l-SL method, but faster than the SL method. In this paper, we also prove that clustering results produced by the al-SL method and the SL method are identical. Experimental results with standard real world and synthetic datasets show the effectiveness of the proposed clustering methods in large datasets.

The rest of the paper is organized as follows. Section 2 describes a summary of related works. Section 3 describes a brief background of the proposed clustering methods. Section 4 describes proposed distance based clustering method (l-SL) and a relationship between SL and l-SL methods. Section 5 describes proposed improvement of the l-SL method. Experimental results and conclusions are discussed in Section 6 and Section 7, respectively.

Section snippets

Related work

In this section, recent works related to the proposed schemes are discussed. Dash et al. [17] proposed a fast hierarchical clustering method based on partially overlapping partitions (POP). It has two phases. In first phase, dataset is partitioned into a number of overlapping cells. Closest pair-distance is calculated for each cell and if overall closest pair-distance is less than a threshold δ, then the pair is merged. This process continues until overall closest pair distance is more than δ.

Background of the proposed method

Two widely used clustering methods viz., leaders clustering and single-link clustering method are discussed in this section. These two clustering methods have their own advantages and disadvantages. The proposed clustering methods exploit these two clustering methods.

Proposed clustering method

The proposed l-SL clustering method is a hybrid scheme with a combination of above two techniques (i.e. leaders and SL method). The l-SL method needs only a single parameter h (i.e. distance between a pair of clusters is more than h). Like other clustering methods (k-means, DBSCAN), we assume that the value of the parameter (h) is known before hand. In this scheme, the set of leaders produced by the leaders clustering method is used as a representative set of data. Subsequently, the SL method

Augmented l-SL (al-SL) clustering method

As discussed earlier, a few clusters in the final results of l-SL method may be required to be merged in order to obtain the exact final results as that of the SL method. Let the clustering result of the l-SL be a clustering πl={B1,B2,,Bk}. Let (Bi,Bj) be a pair of clusters in πl. If there exists a pair of patterns xBi and yBj such that xyh, then Bi and Bj must be merged together. In general, one needs to search for all pairs of clusters in πl to see whether they can be merged. The

Experimental evaluation

In this section, we evaluate the proposed methods (l-SL and al-SL) experimentally. From discussion in Section 2, it is clear that scheme proposed in [17] is not directly related to our schemes. The method proposed in [18] keeps dataset as well as distance-matrix for the entire data in primary memory. The scheme proposed in [20] needs many parameters to be adjusted to produce same clustering as that of the single-link method, which are difficult to determine. This method also requires whole

Conclusions

The SL method can find arbitrary (non-convex) shaped clusters in a dataset. However, SL method scans dataset many times. It has time and space complexity of O(n2). So, SL method is not suitable for large datasets. In this paper, we proposed two clustering methods to detect arbitrary shaped clusters for large datasets. The first method, l-SL takes considerably less time compared to that of the SL method and scans the dataset only once. Clustering results produced by l-SL is very close to that of

Acknowledgments

The authors are thankful to Prof. Vijay S. Iyengar, Fellow, IEEE and Dr. Sanasam Ranbir Singh, IIT Guwahti for their valuable suggestions. This work is supported by Council of Scientific & Industrial Research (CSIR), Government of India.

Bidyut Kr. Patra received B.Sc. (Physics), B.Tech. (CSE) and M.Tech. (CSE) from Calcutta University, Kolkata, India in 1996, 1999 and 2001, respectively. He has been working for his Ph.D. at the Indian Institute of Technology-Guwahati, India since July, 2005. He worked as a Lecturer at Haldia Institute of Technology, Haldia, West Bengal, India during August, 2001-June, 2005. Currently, he is an Assistant Professor at Tezpur Central University, Tezpur, Assam, India. His research interests

References (35)

  • L. Kaufman et al.

    Finding Groups in Data: An Introduction to Cluster Analysis

    (1990)
  • R.T. Ng et al.

    CLARANS: a method for clustering objects for spatial data mining

    IEEE Transactions on Knowledge and Data Engineering

    (2002)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

  • A. Hinneburg et al.

    An efficient approach to clustering in large multimedia databases with noise

  • P.H.A. Sneath et al.

    Numerical Taxonomy

    (1973)
  • B. King

    Step-wise clustering procedures

    Journal of the American Statistical Association

    (1967)
  • M.H. Dunham

    Data Mining: Introductory and Advanced Topics

    (2003)
  • Cited by (42)

    • Clustering and forecasting of day-ahead electricity supply curves using a market-based distance

      2024, International Journal of Electrical Power and Energy Systems
    • Sustainability efficiency assessment of wastewater treatment plants in China: A data envelopment analysis based on cluster benchmarking

      2020, Journal of Cleaner Production
      Citation Excerpt :

      Cluster benchmarking is a method of dividing a set of DMUs into groups (i.e. clusters) according to certain attributes, so DMUs in the same cluster are more similar to each other than to DMUs in other clusters. The major purpose of grouping is to maximize the homogeneity of DMUs in the same cluster and the heterogeneity of DMUs in different clusters (Patra et al., 2011). To achieve the efficiency evaluation of objects relative to the optimal practice under the respective clusters, it is necessary to construct the production frontier separately (Zhu, 2013).

    • A novel complex network community detection approach using discrete particle swarm optimization with particle diversity and mutation

      2019, Applied Soft Computing Journal
      Citation Excerpt :

      Cheng et al. [29] combined the network density and distance construct Density Order Tree (DOT) to represent the original data, and then transform the network community detection problem into the DOT classification problem. Whang et al. [12] proposed an effective overlapping community detection algorithm based on the seed extension method. Montero [30] proposed an algorithm for fuzzy community detection based on N-dimensional clustering and overlapping functions to detect fuzzy community detection output quality.

    • Nearest neighbors based density peaks approach to intrusion detection

      2018, Chaos, Solitons and Fractals
      Citation Excerpt :

      The k-nearest neighbor is a kind of typical nonparametric algorithm [35]. The judgement standard of kNN is the distance [36]. Compared with other classification algorithms, it is simpler and more efficient [37].

    View all citing articles on Scopus

    Bidyut Kr. Patra received B.Sc. (Physics), B.Tech. (CSE) and M.Tech. (CSE) from Calcutta University, Kolkata, India in 1996, 1999 and 2001, respectively. He has been working for his Ph.D. at the Indian Institute of Technology-Guwahati, India since July, 2005. He worked as a Lecturer at Haldia Institute of Technology, Haldia, West Bengal, India during August, 2001-June, 2005. Currently, he is an Assistant Professor at Tezpur Central University, Tezpur, Assam, India. His research interests include Pattern Recognition, Data Mining and Algorithms.

    Sukumar Nandi received B.Sc. (Physics), B Tech and M Tech from Calcutta University in 1984, 1987 and 1989, respectively. He received the Ph. D. degree in Computer Science and Engineering from Indian Institute of Technology Kharagpur in 1995. In 1989-90 he was a faculty in Birla Institute of Technology, Mesra, Ranchi, India. During 1991 to 1995, he was a scientific officer in Computer Sc & Engg, Indian Institute of Technology Kharagpur. In 1995 he joined in Indian Institute of Technology Guwahati as an Assistant Professor in Computer Science and Engineering. Subsequently, he became Associate Professor in 1998 and Professor in 2002. Presently, he is the Dean of Academic Affairs of Indian Institute of Technology Guwahati. He was with the School of Computer Engineering, Nanyang Technological University, Singapore as Visiting Senior Fellow for one year (2002-2003). He was member of Board of Governor, Indian Institute of Technology Guwahati for 2005 and 2006. He was General Vice-Chair of 8th International Conference on Distributed Computing and Networking (ICDCN) 2006. He was General Co-Chair of the 15th International Conference on Advance Computing and Communication (ADCOM) 2007. He is also involved in several international conferences as member of advisory board/ Technical Programme Committee. He is reviewer of several international journals and conferences. He is co-author of a book titled Theory and Application of Cellular Automata published by IEEE Computer Society. He has published more than 150 Journals/Conferences papers. His research interests are Computer Networks (Traffic Engineering, Wireless Networks), Computer and Network security, Data mining. He is Senior Member of IEEE, Senior Member of ACM, Fellow of the Institution of Electronics and Telecommunication Engineers and Fellow of the Institution of Engineers (India).

    P. Viswanth received his M.Tech (CSE) from the Indian Institute of Technology-Madras, Chennai, India in 1996. From 1996 to 2001, he worked as a faculty member at BITS-Pilani, India and Jawaharlal Nehru Technological University, Hyderabad, India. He received his Ph.D. from the Indian Institute of Science, Bangalore, India in 2005. He received Alumni Medal from IISc-Bangalore as his Ph.D. thesis was selected as the best thesis in the Electrical Sciences Division of IISc-Bangalore in the year 2006. He worked as Assistant Professor in the Department of Computer Science and Engineering, IIT-Guwahati, Guwahati, India from February 2005 to July 2008. At present he is working as a Professor and Dean, R&D (Electrical Sciences) at Rajeev Gandhi Memorial College of Engineering & Technology, Nandyal, Andhra Pradesh, India. His areas of interest include Pattern Recognition, Data Mining, Databases and Algorithms.

    View full text