Skip to main content
Erschienen in: Knowledge and Information Systems 1/2015

01.01.2015 | Regular Paper

Effective data summarization for hierarchical clustering in large datasets

verfasst von: Bidyut Kr. Patra, Sukumar Nandi

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link is a distance-based hierarchical clustering method, which can find non-convex (arbitrary)-shaped clusters in a dataset. However, this method cannot be used for clustering large dataset as this method either keeps entire dataset in main memory or scans dataset multiple times from secondary memory of the machine. Both of them are potentially severe problems for cluster analysis in large datasets. One remedy for both problems is to create a summary of a given dataset efficiently, and the summary is subsequently used to speed up clustering methods in large datasets. In this paper, we propose a summarization scheme termed data sphere (ds) to speed up single-link clustering method in large datasets. The ds utilizes sequential leaders clustering method to collect important statistics of a given dataset. The single-link method is modified to work with ds. Modified clustering method is termed as summarized single-link (SSL). The SSL method is considerably faster than the single-link method applied directly to the dataset, and clustering results produced by SSL method are close to the clustering results produced by single-link method. The SSL method outperforms single-link using data bubble (summarization scheme) both in terms of clustering accuracy and computation time. To speed up proposed summarization scheme, a technique is introduced to reduce a large number of distance computations in leaders method. Experimental studies demonstrate effectiveness of the proposed summarization scheme for large datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
Let \(C_1\) and \(C_2\) be two groups (clusters) of patterns, respectively. Then, distance between them: \(\hbox {Distance}(C_1,C_2) =\min \{||x_i-x_j||\mid x_i\in C_1, x_j\in C_2\}\).
 
2
Two approaches can be found in [16].
 
Literatur
1.
Zurück zum Zitat Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60CrossRef Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60CrossRef
2.
Zurück zum Zitat Babu VS, Viswanath P (2009) Rough-fuzzy weighted k-nearest leader classifier for large data sets. Pattern Recognit 42(9):1719–1731CrossRefMATH Babu VS, Viswanath P (2009) Rough-fuzzy weighted k-nearest leader classifier for large data sets. Pattern Recognit 42(9):1719–1731CrossRefMATH
3.
Zurück zum Zitat Bradley PS, Fayyad UM, Reina C (1998) Scaling clustering algorithms to large databases. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98). New York City, New York, USA, August, pp 9–15 Bradley PS, Fayyad UM, Reina C (1998) Scaling clustering algorithms to large databases. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98). New York City, New York, USA, August, pp 9–15
4.
Zurück zum Zitat Breunig MM, Kriegel H-P, Krger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Mehrotra S, Sellis T (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. Santa Barbara, CA, USA, May, pp 79–90 Breunig MM, Kriegel H-P, Krger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Mehrotra S, Sellis T (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. Santa Barbara, CA, USA, May, pp 79–90
5.
Zurück zum Zitat Breunig MM, Kriegel HP, Sander J (2000) Fast hierarchical clustering based on compressed data and OPTICS. In: Zighed D, Komorowski H, Zytkow J (eds) Proceedings of principles of data mining and knowledge discovery, 4th European conference, PKDD 2000. Lyon, France, September, pp 232–242 Breunig MM, Kriegel HP, Sander J (2000) Fast hierarchical clustering based on compressed data and OPTICS. In: Zighed D, Komorowski H, Zytkow J (eds) Proceedings of principles of data mining and knowledge discovery, 4th European conference, PKDD 2000. Lyon, France, September, pp 232–242
6.
Zurück zum Zitat Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Singapore Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, Singapore
7.
Zurück zum Zitat Ester M, Kriegel H-P, Sander J (1996) Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). Portland, Oregon, USA, pp 226–231 Ester M, Kriegel H-P, Sander J (1996) Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). Portland, Oregon, USA, pp 226–231
8.
Zurück zum Zitat Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH
9.
Zurück zum Zitat Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRef Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRef
10.
Zurück zum Zitat Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
11.
Zurück zum Zitat King B (1967) Step-wise clustering procedures. J Am Stat Assoc 62(317):86–101CrossRef King B (1967) Step-wise clustering procedures. J Am Stat Assoc 62(317):86–101CrossRef
12.
Zurück zum Zitat Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Proceedings of rough sets and current trends in computing (RSCTC 2010), 7th international conference, RSCTC 2010, Warsaw, Poland, June, 2010. Lecture Notes in Computer Science 6086, Springer, pp 60–69 Kryszkiewicz M, Lasek P (2010) TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Proceedings of rough sets and current trends in computing (RSCTC 2010), 7th international conference, RSCTC 2010, Warsaw, Poland, June, 2010. Lecture Notes in Computer Science 6086, Springer, pp 60–69
14.
Zurück zum Zitat Murtagh F (1984) Complexities of hierarchic clustering algorithms: state of the art. Comput Stat Q 1:101–113MATH Murtagh F (1984) Complexities of hierarchic clustering algorithms: state of the art. Comput Stat Q 1:101–113MATH
15.
Zurück zum Zitat Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016CrossRef Ng RT, Han J (2002) CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016CrossRef
16.
Zurück zum Zitat Patra BK, Nandi S, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large datasets. Pattern Recognit 44(12):2862–2870CrossRefMATH Patra BK, Nandi S, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large datasets. Pattern Recognit 44(12):2862–2870CrossRefMATH
17.
Zurück zum Zitat Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRef Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRef
18.
Zurück zum Zitat Ross S (2002) A first course in probability. Pearson Education, New Delhi Ross S (2002) A first course in probability. Pearson Education, New Delhi
19.
Zurück zum Zitat Sarma T, Viswanath P, Reddy B (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4(2):107–117 Sarma T, Viswanath P, Reddy B (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4(2):107–117
20.
Zurück zum Zitat Sneath A, Sokal PH (1973) Numerical taxonomy. Freeman, LondonMATH Sneath A, Sokal PH (1973) Numerical taxonomy. Freeman, LondonMATH
21.
Zurück zum Zitat Steinhaus H (1956) Sur la division des corps matériels en parties. Bull Acad Polon Sci Cl III 4:801–804MathSciNet Steinhaus H (1956) Sur la division des corps matériels en parties. Bull Acad Polon Sci Cl III 4:801–804MathSciNet
22.
Zurück zum Zitat Viswanath P, Babu V (2009) Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. Pattern Recognit Lett 30(16):1477–1488CrossRef Viswanath P, Babu V (2009) Rough-DBSCAN: a fast hybrid density based clustering method for large data sets. Pattern Recognit Lett 30(16):1477–1488CrossRef
23.
Zurück zum Zitat Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. Montreal, Quebec, Canada, June, pp 103–114 Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. Montreal, Quebec, Canada, June, pp 103–114
24.
Zurück zum Zitat Zhao Y, Karypis G (2002) Criterion functions for document clustering: experiments and analysis. Technical report, University of Minnesota Zhao Y, Karypis G (2002) Criterion functions for document clustering: experiments and analysis. Technical report, University of Minnesota
25.
Zurück zum Zitat Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Freytag J, Lockemann P, Abiteboul S, Carey M, Selinger P, Heuer A (eds) Proceedings of 29th international conference on very large data bases (VLDB 2003), September, 2003. Germany, Berlin, pp 452–463 Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Freytag J, Lockemann P, Abiteboul S, Carey M, Selinger P, Heuer A (eds) Proceedings of 29th international conference on very large data bases (VLDB 2003), September, 2003. Germany, Berlin, pp 452–463
Metadaten
Titel
Effective data summarization for hierarchical clustering in large datasets
verfasst von
Bidyut Kr. Patra
Sukumar Nandi
Publikationsdatum
01.01.2015
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2015
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-013-0709-8

Weitere Artikel der Ausgabe 1/2015

Knowledge and Information Systems 1/2015 Zur Ausgabe