Top

Knowledge and Information Systems

Published in:

01-01-2015 | Regular Paper

Effective data summarization for hierarchical clustering in large datasets

Authors: Bidyut Kr. Patra, Sukumar Nandi

Published in: Knowledge and Information Systems | Issue 1/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link is a distance-based hierarchical clustering method, which can find non-convex (arbitrary)-shaped clusters in a dataset. However, this method cannot be used for clustering large dataset as this method either keeps entire dataset in main memory or scans dataset multiple times from secondary memory of the machine. Both of them are potentially severe problems for cluster analysis in large datasets. One remedy for both problems is to create a summary of a given dataset efficiently, and the summary is subsequently used to speed up clustering methods in large datasets. In this paper, we propose a summarization scheme termed data sphere (ds) to speed up single-link clustering method in large datasets. The ds utilizes sequential leaders clustering method to collect important statistics of a given dataset. The single-link method is modified to work with ds. Modified clustering method is termed as summarized single-link (SSL). The SSL method is considerably faster than the single-link method applied directly to the dataset, and clustering results produced by SSL method are close to the clustering results produced by single-link method. The SSL method outperforms single-link using data bubble (summarization scheme) both in terms of clustering accuracy and computation time. To speed up proposed summarization scheme, a technique is introduced to reduce a large number of distance computations in leaders method. Experimental studies demonstrate effectiveness of the proposed summarization scheme for large datasets.

next article Fast time intervals mining using the transitivity of temporal relations

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Let \(C_1\) and \(C_2\) be two groups (clusters) of patterns, respectively. Then, distance between them: \(\hbox {Distance}(C_1,C_2) =\min \{||x_i-x_j||\mid x_i\in C_1, x_j\in C_2\}\).

Two approaches can be found in [16].

Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60CrossRef

Babu VS, Viswanath P (2009) Rough-fuzzy weighted k-nearest leader classifier for large data sets. Pattern Recognit 42(9):1719–1731CrossRefMATH

Bradley PS, Fayyad UM, Reina C (1998) Scaling clustering algorithms to large databases. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98). New York City, New York, USA, August, pp 9–15

Breunig MM, Kriegel H-P, Krger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Mehrotra S, Sellis T (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. Santa Barbara, CA, USA, May, pp 79–90

Breunig MM, Kriegel HP, Sander J (2000) Fast hierarchical clustering based on compressed data and OPTICS. In: Zighed D, Komorowski H, Zytkow J (eds) Proceedings of principles of data mining and knowledge discovery, 4th European conference, PKDD 2000. Lyon, France, September, pp 232–242

Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Singapore

Ester M, Kriegel H-P, Sander J (1996) Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). Portland, Oregon, USA, pp 226–231

Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH

Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRef

10.

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef

11.

King B (1967) Step-wise clustering procedures. J Am Stat Assoc 62(317):86–101CrossRef

12.

Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Proceedings of rough sets and current trends in computing (RSCTC 2010), 7th international conference, RSCTC 2010, Warsaw, Poland, June, 2010. Lecture Notes in Computer Science 6086, Springer, pp 60–69

13.

Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137CrossRefMathSciNetMATH

14.

Murtagh F (1984) Complexities of hierarchic clustering algorithms: state of the art. Comput Stat Q 1:101–113MATH

15.

Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016CrossRef

16.

Patra BK, Nandi S, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large datasets. Pattern Recognit 44(12):2862–2870CrossRefMATH

17.

Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRef

18.

Ross S (2002) A first course in probability. Pearson Education, New Delhi

19.

Sarma T, Viswanath P, Reddy B (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4(2):107–117

20.

Sneath A, Sokal PH (1973) Numerical taxonomy. Freeman, LondonMATH

21.

Steinhaus H (1956) Sur la division des corps matériels en parties. Bull Acad Polon Sci Cl III 4:801–804MathSciNet

22.

Viswanath P, Babu V (2009) Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. Pattern Recognit Lett 30(16):1477–1488CrossRef

23.

Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. Montreal, Quebec, Canada, June, pp 103–114

24.

Zhao Y, Karypis G (2002) Criterion functions for document clustering: experiments and analysis. Technical report, University of Minnesota

25.

Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Freytag J, Lockemann P, Abiteboul S, Carey M, Selinger P, Heuer A (eds) Proceedings of 29th international conference on very large data bases (VLDB 2003), September, 2003. Germany, Berlin, pp 452–463

Title: Effective data summarization for hierarchical clustering in large datasets
Authors: Bidyut Kr. Patra
Sukumar Nandi
Publication date: 01-01-2015
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 1/2015
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-013-0709-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 1/2015

Fast PageRank approximation by adaptive sampling

An automatic extraction method of the domains of competence for learning classifiers using data complexity measures

Defining and evaluating network communities based on ground-truth

Fast time intervals mining using the transitivity of temporal relations

A markov prediction model for data-driven semi-structured business processes

A hybrid approach for content extraction with text density and visual importance of DOM nodes

Premium Partner