Skip to main content
Erschienen in: Soft Computing 5/2014

01.05.2014 | Methodologies and Application

GACH: a grid-based algorithm for hierarchical clustering of high-dimensional data

verfasst von: Eghbal G. Mansoori

Erschienen in: Soft Computing | Ausgabe 5/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper proposes a grid-based hierarchical clustering algorithm (GACH) as an efficient and robust method to explore clusters in high-dimensional data with no prior knowledge. It discovers the initial positions of the potential clusters automatically and then combines them hierarchically to obtain the final clusters. In this regard, GACH first projects the data patterns on a two-dimensional space (i.e., on a plane established by two features) to overcome the curse of dimensionality problem in high-dimensional data. To choose these two well-informed features, a simple and fast feature selection algorithm is proposed. Then, through meshing the plane with grid lines, GACH detects the crowded grid points. The nearest data patterns around these grid points are considered as initial members of some potential clusters. By returning the patterns back to their true dimensions, GACH refines these clusters. In the merging phase, GACH combines the closely adjacent clusters in a hierarchical bottom-up manner to construct the final clusters’ members. The main features of GACH are: (1) it automatically discovers the clusters, (2) the obtained clusters are stable, (3) it is efficient for data sets with high dimensions, and (4) its merging process involves a threshold which can be obtained in advance for well-clustered data. To assess our proposed algorithm, it is applied on some benchmark data sets and the validity of obtained clusters is compared with the results of some other clustering algorithms. This comparison shows that GACH is accurate, efficient and feasible to discover clusters in high-dimensional data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic sub-space clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD International Conference on MOD, pp 94–105 Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic sub-space clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD International Conference on MOD, pp 94–105
Zurück zum Zitat Asuncion A, Newman DJ (2007) UCI machine learning repository. Department of Information and Computer Science, University of California, Irvine Asuncion A, Newman DJ (2007) UCI machine learning repository. Department of Information and Computer Science, University of California, Irvine
Zurück zum Zitat Benson SYL, Hong Y (2007) Assessment of microarray data clustering results based on a new geometrical index for cluster validity. Soft Comput 11(4):341–348 Benson SYL, Hong Y (2007) Assessment of microarray data clustering results based on a new geometrical index for cluster validity. Soft Comput 11(4):341–348
Zurück zum Zitat Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Zurück zum Zitat Chandra B, Gupta M (2013) A novel approach for distance-based semi-supervised clustering using functional link neural network. Soft Comput 17(3):369–379 Chandra B, Gupta M (2013) A novel approach for distance-based semi-supervised clustering using functional link neural network. Soft Comput 17(3):369–379
Zurück zum Zitat Chang CI, Lin NP, Jan NY (2009) An axis shifted clustering algorithm. Tamkang J Sci Eng 12(2):183–192 Chang CI, Lin NP, Jan NY (2009) An axis shifted clustering algorithm. Tamkang J Sci Eng 12(2):183–192
Zurück zum Zitat Everitt B, Landau S, Leese M (2001) Cluster analysis. Arnold, London Everitt B, Landau S, Leese M (2001) Cluster analysis. Arnold, London
Zurück zum Zitat Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
Zurück zum Zitat Hinneburg A, Keim D (1999) Optimal grid-clustering: toward breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th VLDB Conference, pp 506–517 Hinneburg A, Keim D (1999) Optimal grid-clustering: toward breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th VLDB Conference, pp 506–517
Zurück zum Zitat Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Tech 2(8):3441–3446 Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Tech 2(8):3441–3446
Zurück zum Zitat Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
Zurück zum Zitat Jarman IH, Etchells TA, Bacciu D, Garibaldi JM, Ellis IO, Lisboa PJG (2011) Clustering of protein expression data: a benchmark of statistical and neural approaches. Soft Comput 15(8):1459–1469CrossRef Jarman IH, Etchells TA, Bacciu D, Garibaldi JM, Ellis IO, Lisboa PJG (2011) Clustering of protein expression data: a benchmark of statistical and neural approaches. Soft Comput 15(8):1459–1469CrossRef
Zurück zum Zitat Kohavi R, Provost F (1998) Glossary of terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process 30(2/3) Kohavi R, Provost F (1998) Glossary of terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process 30(2/3)
Zurück zum Zitat Krishnapuram R, Keller JM (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1(2):98–110CrossRef Krishnapuram R, Keller JM (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1(2):98–110CrossRef
Zurück zum Zitat Mansoori EG (2011) FRBC: a fuzzy rule-based clustering algorithm. IEEE Trans Fuzzy Syst 19(5):960–971CrossRef Mansoori EG (2011) FRBC: a fuzzy rule-based clustering algorithm. IEEE Trans Fuzzy Syst 19(5):960–971CrossRef
Zurück zum Zitat Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654CrossRef Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654CrossRef
Zurück zum Zitat McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of Fifth Berkeley Symposium, Math Statistics and Probability, pp 281–297 McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of Fifth Berkeley Symposium, Math Statistics and Probability, pp 281–297
Zurück zum Zitat Monmarché N, Slimane M, Venturini G (1999) AntClass: discovery of clusters in numeric data by an hybridization of an ant colony with the Kmeans algorithm. Internal Report No 213, E3i Monmarché N, Slimane M, Venturini G (1999) AntClass: discovery of clusters in numeric data by an hybridization of an ant colony with the Kmeans algorithm. Internal Report No 213, E3i
Zurück zum Zitat Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507CrossRef Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507CrossRef
Zurück zum Zitat Ordonez C, Omiecinski E (2004) Efficient disk-based K-means clustering for relational databases. IEEE Trans Knowl Data Eng 16(8):909–921CrossRef Ordonez C, Omiecinski E (2004) Efficient disk-based K-means clustering for relational databases. IEEE Trans Knowl Data Eng 16(8):909–921CrossRef
Zurück zum Zitat Schikuta E (1993) Grid-clustering: a hierarchical clustering method for very large data sets. In: Technical Report TR-CRPC No. 93358, Center for Research on Parallel Computation, Rice University, Houston Schikuta E (1993) Grid-clustering: a hierarchical clustering method for very large data sets. In: Technical Report TR-CRPC No. 93358, Center for Research on Parallel Computation, Rice University, Houston
Zurück zum Zitat Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J Very Large Data Bases 8:289–304CrossRef Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J Very Large Data Bases 8:289–304CrossRef
Zurück zum Zitat Schockaert S, De-Cock M, Cornelis C, Kerr EE (2007) Clustering web search results using fuzzy ants. Int J Intell Syst 22(5):455–474CrossRefMATH Schockaert S, De-Cock M, Cornelis C, Kerr EE (2007) Clustering web search results using fuzzy ants. Int J Intell Syst 22(5):455–474CrossRefMATH
Zurück zum Zitat Sledge IJ, Havens TC, Huband JM, Bezdek JC, Keller JM (2009) Finding the number of clusters in ordered dissimilarities. Soft Comput 13(12):1125–1142CrossRef Sledge IJ, Havens TC, Huband JM, Bezdek JC, Keller JM (2009) Finding the number of clusters in ordered dissimilarities. Soft Comput 13(12):1125–1142CrossRef
Zurück zum Zitat Vicente D, Vellido A (2004) A review of hierarchical models for data clustering and visualization. In: Gir’aldez R, Riquelme JC, Aguilar-Ruiz JS (eds) Tendencias de la Minería de Datos en España. Red Española de Minería de Datos Vicente D, Vellido A (2004) A review of hierarchical models for data clustering and visualization. In: Gir’aldez R, Riquelme JC, Aguilar-Ruiz JS (eds) Tendencias de la Minería de Datos en España. Red Española de Minería de Datos
Zurück zum Zitat Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRef Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRef
Zurück zum Zitat Yager R, Filev D (1994) Approximate clustering via the mountain method. IEEE Trans Syst Man Cybern 24(8):1279–1284CrossRef Yager R, Filev D (1994) Approximate clustering via the mountain method. IEEE Trans Syst Man Cybern 24(8):1279–1284CrossRef
Zurück zum Zitat Yang W, Muntz R, Wang W, Yang J (1997) STING: a statistical information grid approach to spatial data mining. In: Proceedings of 23rd International Conference on VLDB, pp 186–195 Yang W, Muntz R, Wang W, Yang J (1997) STING: a statistical information grid approach to spatial data mining. In: Proceedings of 23rd International Conference on VLDB, pp 186–195
Zurück zum Zitat Yue S, Wei M, Wang J, Wang H (2008) A general grid-clustering approach. Pattern Recognit Lett 29:1372–1384CrossRef Yue S, Wei M, Wang J, Wang H (2008) A general grid-clustering approach. Pattern Recognit Lett 29:1372–1384CrossRef
Metadaten
Titel
GACH: a grid-based algorithm for hierarchical clustering of high-dimensional data
verfasst von
Eghbal G. Mansoori
Publikationsdatum
01.05.2014
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 5/2014
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-013-1105-8

Weitere Artikel der Ausgabe 5/2014

Soft Computing 5/2014 Zur Ausgabe