nach oben

Erschienen in:

2020 | OriginalPaper | Buchkapitel

A Performance Comparison of Clustering Algorithms for Big Data on DataMPI

verfasst von : Mo Hai

Erschienen in: Data Science

Verlag: Springer Singapore

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Clustering algorithms for big data have important applications in finance. DataMPI is a communication library based on key-value pairs that extends MPI for Hadoop and Spark. We study the performance of K-means, fuzzy K-means and Canopy clustering algorithms on the DataMPI cluster by experiments. Firstly, we observe the influence of the number of nodes on the clustering time and scaleup; and then we observe the influence of the size of the memory of each node on the clustering time and memoryup; at the same time, we compare the performance of these three clustering algorithms on different text data set. From experimental results we can find that: (1) When the size of data set, the size of the memory, and the number of nodes keep the same, Canopy is the fastest, followed by K-means, and the fuzzy K-means is the slowest; (2) When the size of the memory of each node is fixed, these three algorithms have a good scaleup on all of text data set, which shows that the increase of the number of nodes can significantly improve the efficiency of these three algorithms; (3) When the number of nodes is fixed, and as the size of the memory is increased from 1 GB to 4 GB, the clustering time is significantly decreased, which shows that these three clustering algorithms have a good memoryup.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Optimal Rating Prediction in Recommender Systems

Nächstes Kapitel A Novel Way to Build Stock Market Sentiment Lexicon

Gantz, J., Reinsel, D.: The digital universe decade-are you ready? (2010). http://idcdocserv.com/expired.asp?925

Gantz, J., Reinsel, D., Chite, C., et al.: The expanding digital universe (2007). http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

Manyika, J., Chui, M., Brown, B., et al.: Big data: the next frontier for innovation, competition, and productivity. Technical rep., McKinsey Global Institute (2011)

http://datampi.org

Xinxin, L.: Research and implementation of hierarchical clustering algorithm based on MPI, Harbin University of Science and Technology (2012)

Ying, W.: Text clustering method analysis of wikipedia. In: Proceedings of the 2009 Graduate Conference on Communication and Information Technology, p. 5 (2009)

Lina, F.: Research on parallel k-means clustering method and its application in resume data, Yunnan University (2010)

Beiyuan, C.: Research and application of multi-level parallel algorithm based on MPI environment, Jilin University (2011)

Liang, F., Feng, C., Lu, X., et al.: Performance benefits of DataMPI: a case study with BigDataBench. Comput. Sci. 8807, 111–123 (2014)

10.

Hai, M., Zhang, Y., Li, H.: A performance comparison of big data processing platform based on parallel clustering algorithms. In: Proceedings of ITQM (2018)

11.

https://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-05/

12.

http://archive.ics.uci.edu/ml/datasets/Amazon+book+reviews

13.

http://archive.ics.uci.edu/ml/datasets/NIPS+Conference+Papers+1987-2015

Titel: A Performance Comparison of Clustering Algorithms for Big Data on DataMPI
verfasst von: Mo Hai
Verlag: Springer Singapore
Buch: Data Science
Print ISBN: 978-981-15-2809-5

Electronic ISBN: 978-981-15-2810-1

Copyright-Jahr: 2020
DOI: https://doi.org/10.1007/978-981-15-2810-1_33

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"