Skip to main content

2020 | OriginalPaper | Buchkapitel

A Performance Comparison of Clustering Algorithms for Big Data on DataMPI

verfasst von : Mo Hai

Erschienen in: Data Science

Verlag: Springer Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering algorithms for big data have important applications in finance. DataMPI is a communication library based on key-value pairs that extends MPI for Hadoop and Spark. We study the performance of K-means, fuzzy K-means and Canopy clustering algorithms on the DataMPI cluster by experiments. Firstly, we observe the influence of the number of nodes on the clustering time and scaleup; and then we observe the influence of the size of the memory of each node on the clustering time and memoryup; at the same time, we compare the performance of these three clustering algorithms on different text data set. From experimental results we can find that: (1) When the size of data set, the size of the memory, and the number of nodes keep the same, Canopy is the fastest, followed by K-means, and the fuzzy K-means is the slowest; (2) When the size of the memory of each node is fixed, these three algorithms have a good scaleup on all of text data set, which shows that the increase of the number of nodes can significantly improve the efficiency of these three algorithms; (3) When the number of nodes is fixed, and as the size of the memory is increased from 1 GB to 4 GB, the clustering time is significantly decreased, which shows that these three clustering algorithms have a good memoryup.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat Manyika, J., Chui, M., Brown, B., et al.: Big data: the next frontier for innovation, competition, and productivity. Technical rep., McKinsey Global Institute (2011) Manyika, J., Chui, M., Brown, B., et al.: Big data: the next frontier for innovation, competition, and productivity. Technical rep., McKinsey Global Institute (2011)
5.
Zurück zum Zitat Xinxin, L.: Research and implementation of hierarchical clustering algorithm based on MPI, Harbin University of Science and Technology (2012) Xinxin, L.: Research and implementation of hierarchical clustering algorithm based on MPI, Harbin University of Science and Technology (2012)
6.
Zurück zum Zitat Ying, W.: Text clustering method analysis of wikipedia. In: Proceedings of the 2009 Graduate Conference on Communication and Information Technology, p. 5 (2009) Ying, W.: Text clustering method analysis of wikipedia. In: Proceedings of the 2009 Graduate Conference on Communication and Information Technology, p. 5 (2009)
7.
Zurück zum Zitat Lina, F.: Research on parallel k-means clustering method and its application in resume data, Yunnan University (2010) Lina, F.: Research on parallel k-means clustering method and its application in resume data, Yunnan University (2010)
8.
Zurück zum Zitat Beiyuan, C.: Research and application of multi-level parallel algorithm based on MPI environment, Jilin University (2011) Beiyuan, C.: Research and application of multi-level parallel algorithm based on MPI environment, Jilin University (2011)
9.
Zurück zum Zitat Liang, F., Feng, C., Lu, X., et al.: Performance benefits of DataMPI: a case study with BigDataBench. Comput. Sci. 8807, 111–123 (2014) Liang, F., Feng, C., Lu, X., et al.: Performance benefits of DataMPI: a case study with BigDataBench. Comput. Sci. 8807, 111–123 (2014)
10.
Zurück zum Zitat Hai, M., Zhang, Y., Li, H.: A performance comparison of big data processing platform based on parallel clustering algorithms. In: Proceedings of ITQM (2018) Hai, M., Zhang, Y., Li, H.: A performance comparison of big data processing platform based on parallel clustering algorithms. In: Proceedings of ITQM (2018)
Metadaten
Titel
A Performance Comparison of Clustering Algorithms for Big Data on DataMPI
verfasst von
Mo Hai
Copyright-Jahr
2020
Verlag
Springer Singapore
DOI
https://doi.org/10.1007/978-981-15-2810-1_33