Top

The Journal of Supercomputing

Published in:

01-07-2014

A parallel clustering method combined information bottleneck theory and centroid-based clustering

Authors: Zhanquan Sun, Geoffrey Fox, Weidong Gu, Zhao Li

Published in: The Journal of Supercomputing | Issue 1/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Clustering is an important research topic of data mining. Information bottleneck theory-based clustering method is suitable for dealing with complicated clustering problems because that its information loss metric can measure arbitrary statistical relationships between samples. It has been widely applied to many kinds of areas. With the development of information technology, the electronic data scale becomes larger and larger. Classical information bottleneck theory-based clustering method is out of work to deal with large-scale dataset because of expensive computational cost. Parallel clustering method based on MapReduce model is the most efficient method to deal with large-scale data-intensive clustering problems. A parallel clustering method based on MapReduce model is developed in this paper. In the method, parallel information bottleneck theory clustering method based on MapReduce is proposed to determine the initial clustering center. An objective method is proposed to determine the final number of clusters automatically. Parallel centroid-based clustering method is proposed to determine the final clustering result. The clustering results are visualized with interpolation MDS dimension reduction method. The efficiency of the method is illustrated with a practical DNA clustering example.

previous article An energy-aware heuristic framework for virtual machine consolidation in Cloud computing

next article Design of 4-disjoint gamma interconnection network layouts and reliability analysis of gamma interconnection Networks

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Khana SS, Ahmad A (2013) Cluster center initialization algorithm for K-modes clustering. Expert Sys Appl 40(18):7444–7456

Sim K, Yap GE, Hardoon DR et al (2013) Centroid-based actionable 3D subspace clustering. IEEE Trans Knowl Data Eng 25(6):1213–1226CrossRef

Tishby N, Fernando C, Bialek W (1999) The information bottleneck method. In: The 37th annual allerton conference on communication, control and computing, Monticello, pp 1–11

Coldberger J, Gordon S, Greenspan H (2006) Unsupervised image-set clustering using an information theoretic framework. IEEE Trans Image Process 15(2):449–457CrossRef

Slonim N, Somerville T, Tishby N (2001) Objective classification of galaxy spectra using the information bottleneck method. Mon Not R Astron 323:270–284CrossRef

Swedlow JR, Zanetti G, Best C (2011) Nat. Methods. Channeling the data deluge 8:463–465

Fox GC, Qiu XH et al (2009) Biomedical case studies in data intensive computing. Lect Notes Comput Sci 5931:2–18CrossRef

Sun ZQ, Fox GC (2012) Study on parallel SVM based on MapReduce. In: International conference on parallel and distributed processing techniques and applications, CSREA Press, pp 495–501

Blake JA, Bult CJ (2006) Beyond the data deluge: data integration and bio-ontologies. J Biomed Inform 39(3):314–320CrossRef

10.

Qiu J (2010) Scalable programming and algorithms for data intensive life science. J Integr Biol 15(4):1–3

11.

Guha R, Gilbert K, Fox GC et al (2010) Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. Curr Comput-Aided Drug Des 6:50–67CrossRef

12.

Chang CC, He B, Zhang Z (2004) Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor 6(2):67–76CrossRef

13.

Fox GC, Bae SH et al (2008) Parallel data mining from multicore to cloudy grids. High performance computing and grids workshop, IOS Press, pp 311–340

14.

Li JJ, Cui J, Wang D et al (2011) Survey of MapReduce parallel programming model. Acta Electronica Sinica 39(11):2635–2642

15.

Ekanayake J, Li H et al (2010) Twister: a runtime for iterative MapReduce. In: The first international workshop on MapReduce and its applications of ACM HPDC, ACM press, pp 810–818

16.

Jolliffe IT (2002) Principal component analysis. Springer, New YorkMATH

17.

George KM (2010) Self-organizing maps. INTECH

18.

Borg I, Patrick JF (2005) Modern multidimensional scaling: theory and applications. Springer, New York

19.

Bae S-H, Qiu J, Fox G (2012) Adaptive interpolation of multidimensional scaling. In: International conference on computational science, pp 393–402

20.

Ananstassiou D (2000) Frequency-domain analysis of biomolecular sequences. Bioinformatics 16(12):1073–1081CrossRef

21.

Liang B, Chen DY (2010) DNA sequence classification based on ant colony optimization clustering algorithm. Comput Eng Appl 46(25):124–126MathSciNet

Title: A parallel clustering method combined information bottleneck theory and centroid-based clustering
Authors: Zhanquan Sun
Geoffrey Fox
Weidong Gu
Zhao Li
Publication date: 01-07-2014
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 1/2014
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1174-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 1/2014

A queuing theory model for cloud computing

Improved extra group network: a new fault-tolerant multistage interconnection network

Multi-core implementation of decomposition-based packet classification algorithms

FuPerMod: a software tool for the optimization of data-parallel applications on heterogeneous platforms

An energy-aware heuristic framework for virtual machine consolidation in Cloud computing

Average distance, surface area, and other structural properties of exchanged hypercubes

Premium Partner