Skip to main content
Erschienen in: The Journal of Supercomputing 1/2014

01.07.2014

A parallel clustering method combined information bottleneck theory and centroid-based clustering

verfasst von: Zhanquan Sun, Geoffrey Fox, Weidong Gu, Zhao Li

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering is an important research topic of data mining. Information bottleneck theory-based clustering method is suitable for dealing with complicated clustering problems because that its information loss metric can measure arbitrary statistical relationships between samples. It has been widely applied to many kinds of areas. With the development of information technology, the electronic data scale becomes larger and larger. Classical information bottleneck theory-based clustering method is out of work to deal with large-scale dataset because of expensive computational cost. Parallel clustering method based on MapReduce model is the most efficient method to deal with large-scale data-intensive clustering problems. A parallel clustering method based on MapReduce model is developed in this paper. In the method, parallel information bottleneck theory clustering method based on MapReduce is proposed to determine the initial clustering center. An objective method is proposed to determine the final number of clusters automatically. Parallel centroid-based clustering method is proposed to determine the final clustering result. The clustering results are visualized with interpolation MDS dimension reduction method. The efficiency of the method is illustrated with a practical DNA clustering example.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Khana SS, Ahmad A (2013) Cluster center initialization algorithm for K-modes clustering. Expert Sys Appl 40(18):7444–7456 Khana SS, Ahmad A (2013) Cluster center initialization algorithm for K-modes clustering. Expert Sys Appl 40(18):7444–7456
2.
Zurück zum Zitat Sim K, Yap GE, Hardoon DR et al (2013) Centroid-based actionable 3D subspace clustering. IEEE Trans Knowl Data Eng 25(6):1213–1226CrossRef Sim K, Yap GE, Hardoon DR et al (2013) Centroid-based actionable 3D subspace clustering. IEEE Trans Knowl Data Eng 25(6):1213–1226CrossRef
3.
Zurück zum Zitat Tishby N, Fernando C, Bialek W (1999) The information bottleneck method. In: The 37th annual allerton conference on communication, control and computing, Monticello, pp 1–11 Tishby N, Fernando C, Bialek W (1999) The information bottleneck method. In: The 37th annual allerton conference on communication, control and computing, Monticello, pp 1–11
4.
Zurück zum Zitat Coldberger J, Gordon S, Greenspan H (2006) Unsupervised image-set clustering using an information theoretic framework. IEEE Trans Image Process 15(2):449–457CrossRef Coldberger J, Gordon S, Greenspan H (2006) Unsupervised image-set clustering using an information theoretic framework. IEEE Trans Image Process 15(2):449–457CrossRef
5.
Zurück zum Zitat Slonim N, Somerville T, Tishby N (2001) Objective classification of galaxy spectra using the information bottleneck method. Mon Not R Astron 323:270–284CrossRef Slonim N, Somerville T, Tishby N (2001) Objective classification of galaxy spectra using the information bottleneck method. Mon Not R Astron 323:270–284CrossRef
6.
Zurück zum Zitat Swedlow JR, Zanetti G, Best C (2011) Nat. Methods. Channeling the data deluge 8:463–465 Swedlow JR, Zanetti G, Best C (2011) Nat. Methods. Channeling the data deluge 8:463–465
7.
Zurück zum Zitat Fox GC, Qiu XH et al (2009) Biomedical case studies in data intensive computing. Lect Notes Comput Sci 5931:2–18CrossRef Fox GC, Qiu XH et al (2009) Biomedical case studies in data intensive computing. Lect Notes Comput Sci 5931:2–18CrossRef
8.
Zurück zum Zitat Sun ZQ, Fox GC (2012) Study on parallel SVM based on MapReduce. In: International conference on parallel and distributed processing techniques and applications, CSREA Press, pp 495–501 Sun ZQ, Fox GC (2012) Study on parallel SVM based on MapReduce. In: International conference on parallel and distributed processing techniques and applications, CSREA Press, pp 495–501
9.
Zurück zum Zitat Blake JA, Bult CJ (2006) Beyond the data deluge: data integration and bio-ontologies. J Biomed Inform 39(3):314–320CrossRef Blake JA, Bult CJ (2006) Beyond the data deluge: data integration and bio-ontologies. J Biomed Inform 39(3):314–320CrossRef
10.
Zurück zum Zitat Qiu J (2010) Scalable programming and algorithms for data intensive life science. J Integr Biol 15(4):1–3 Qiu J (2010) Scalable programming and algorithms for data intensive life science. J Integr Biol 15(4):1–3
11.
Zurück zum Zitat Guha R, Gilbert K, Fox GC et al (2010) Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. Curr Comput-Aided Drug Des 6:50–67CrossRef Guha R, Gilbert K, Fox GC et al (2010) Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. Curr Comput-Aided Drug Des 6:50–67CrossRef
12.
Zurück zum Zitat Chang CC, He B, Zhang Z (2004) Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor 6(2):67–76CrossRef Chang CC, He B, Zhang Z (2004) Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor 6(2):67–76CrossRef
13.
Zurück zum Zitat Fox GC, Bae SH et al (2008) Parallel data mining from multicore to cloudy grids. High performance computing and grids workshop, IOS Press, pp 311–340 Fox GC, Bae SH et al (2008) Parallel data mining from multicore to cloudy grids. High performance computing and grids workshop, IOS Press, pp 311–340
14.
Zurück zum Zitat Li JJ, Cui J, Wang D et al (2011) Survey of MapReduce parallel programming model. Acta Electronica Sinica 39(11):2635–2642 Li JJ, Cui J, Wang D et al (2011) Survey of MapReduce parallel programming model. Acta Electronica Sinica 39(11):2635–2642
15.
Zurück zum Zitat Ekanayake J, Li H et al (2010) Twister: a runtime for iterative MapReduce. In: The first international workshop on MapReduce and its applications of ACM HPDC, ACM press, pp 810–818 Ekanayake J, Li H et al (2010) Twister: a runtime for iterative MapReduce. In: The first international workshop on MapReduce and its applications of ACM HPDC, ACM press, pp 810–818
16.
Zurück zum Zitat Jolliffe IT (2002) Principal component analysis. Springer, New YorkMATH Jolliffe IT (2002) Principal component analysis. Springer, New YorkMATH
17.
Zurück zum Zitat George KM (2010) Self-organizing maps. INTECH George KM (2010) Self-organizing maps. INTECH
18.
Zurück zum Zitat Borg I, Patrick JF (2005) Modern multidimensional scaling: theory and applications. Springer, New York Borg I, Patrick JF (2005) Modern multidimensional scaling: theory and applications. Springer, New York
19.
Zurück zum Zitat Bae S-H, Qiu J, Fox G (2012) Adaptive interpolation of multidimensional scaling. In: International conference on computational science, pp 393–402 Bae S-H, Qiu J, Fox G (2012) Adaptive interpolation of multidimensional scaling. In: International conference on computational science, pp 393–402
20.
Zurück zum Zitat Ananstassiou D (2000) Frequency-domain analysis of biomolecular sequences. Bioinformatics 16(12):1073–1081CrossRef Ananstassiou D (2000) Frequency-domain analysis of biomolecular sequences. Bioinformatics 16(12):1073–1081CrossRef
21.
Zurück zum Zitat Liang B, Chen DY (2010) DNA sequence classification based on ant colony optimization clustering algorithm. Comput Eng Appl 46(25):124–126MathSciNet Liang B, Chen DY (2010) DNA sequence classification based on ant colony optimization clustering algorithm. Comput Eng Appl 46(25):124–126MathSciNet
Metadaten
Titel
A parallel clustering method combined information bottleneck theory and centroid-based clustering
verfasst von
Zhanquan Sun
Geoffrey Fox
Weidong Gu
Zhao Li
Publikationsdatum
01.07.2014
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 1/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1174-1

Weitere Artikel der Ausgabe 1/2014

The Journal of Supercomputing 1/2014 Zur Ausgabe