Skip to main content
Erschienen in: The Journal of Supercomputing 2/2014

01.08.2014

High performance parallel \(k\)-means clustering for disk-resident datasets on multi-core CPUs

verfasst von: Ali Hadian, Saeed Shahrivari

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. \(k\)-means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. \(k\)-means++) and stream-based algorithms. Also, the quality of solution is often equal to \(k\)-means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. J Exp Algorithm 17(1):2–4CrossRef Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. J Exp Algorithm 17(1):2–4CrossRef
3.
Zurück zum Zitat Ana LNF, Jain AK (2003) Robust data clustering. In: Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition, vol 2, IEEE, 2003, pp II–128 Ana LNF, Jain AK (2003) Robust data clustering. In: Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition, vol 2, IEEE, 2003, pp II–128
4.
Zurück zum Zitat Apiletti D, Baralis E, Cerquitelli T, Chiusano S, Grimaudo L (2013) Searum: a cloud-based service for association rule mining. In: Trust, security and privacy in computing and communications (TrustCom), 12th IEEE International Conference on, IEEE, 2013, pp 1283–1290 Apiletti D, Baralis E, Cerquitelli T, Chiusano S, Grimaudo L (2013) Searum: a cloud-based service for association rule mining. In: Trust, security and privacy in computing and communications (TrustCom), 12th IEEE International Conference on, IEEE, 2013, pp 1283–1290
5.
Zurück zum Zitat Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035 Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
6.
Zurück zum Zitat Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proc VLDB Endow 5(7):622–633CrossRef Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proc VLDB Endow 5(7):622–633CrossRef
7.
Zurück zum Zitat Bekkerman R, Bilenko M, Langford J (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, Cambridge Bekkerman R, Bilenko M, Langford J (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, Cambridge
8.
Zurück zum Zitat Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications. Springer, Berlin Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications. Springer, Berlin
9.
Zurück zum Zitat Bottou L, Bengio Y (1995) Convergence properties of the K-means algorithm. Advances in Neural Information Processing Systems 7 (NIPS’94), pp 585–592 Bottou L, Bengio Y (1995) Convergence properties of the K-means algorithm. Advances in Neural Information Processing Systems 7 (NIPS’94), pp 585–592
10.
Zurück zum Zitat Braverman V, Meyerson A, Ostrovsky R, Roytman A, Shindler M, Tagiku B (2011) Streaming k-means on well-clusterable data. In: Proceedings of the 22nd annual ACM-SIAM symposium on discrete algorithms, SIAM, pp 26–40 Braverman V, Meyerson A, Ostrovsky R, Roytman A, Shindler M, Tagiku B (2011) Streaming k-means on well-clusterable data. In: Proceedings of the 22nd annual ACM-SIAM symposium on discrete algorithms, SIAM, pp 26–40
11.
Zurück zum Zitat Chandra A, Yao X (2006) Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing 69(7):686–700 Chandra A, Yao X (2006) Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing 69(7):686–700
12.
Zurück zum Zitat Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40CrossRefMathSciNet Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40CrossRefMathSciNet
13.
Zurück zum Zitat Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef
14.
Zurück zum Zitat Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining. Springer, Berlin, pp 245–260 Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining. Springer, Berlin, pp 245–260
15.
Zurück zum Zitat Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: ICML, pp 106–113 Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: ICML, pp 106–113
16.
Zurück zum Zitat Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689 Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689
17.
Zurück zum Zitat Feldman D, Schmidt M, Sohler C (2013) Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the 24th ACM-SIAM symposium on discrete algorithms (SODA13) Feldman D, Schmidt M, Sohler C (2013) Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the 24th ACM-SIAM symposium on discrete algorithms (SODA13)
18.
Zurück zum Zitat Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications, vol 20. Society for Industrial and Applied Mathematics Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications, vol 20. Society for Industrial and Applied Mathematics
19.
Zurück zum Zitat Hamerly YFG (2007) PG-means: learning the number of clusters in data. In: Advances in neural information processing systems, vol 19. The MIT Press, Cambridge, pp 393–400 Hamerly YFG (2007) PG-means: learning the number of clusters in data. In: Advances in neural information processing systems, vol 19. The MIT Press, Cambridge, pp 393–400
20.
Zurück zum Zitat Hawwash B, Nasraoui O (2012) Stream-dashboard: a framework for mining, tracking and validating clusters in a data stream. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine ’12ACM, New York, pp 109–117 Hawwash B, Nasraoui O (2012) Stream-dashboard: a framework for mining, tracking and validating clusters in a data stream. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine ’12ACM, New York, pp 109–117
21.
Zurück zum Zitat Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRef Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRef
22.
Zurück zum Zitat Mahafzah BA, Al-Badarneh AF, Zakaria MZ (2009) A new sampling technique for association rule mining. J Inf Sci 35(3):358–376CrossRef Mahafzah BA, Al-Badarneh AF, Zakaria MZ (2009) A new sampling technique for association rule mining. J Inf Sci 35(3):358–376CrossRef
23.
Zurück zum Zitat McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley-Interscience, New York McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley-Interscience, New York
24.
Zurück zum Zitat OCallaghan L (2003) Approximation algorithms for clustering streams and large data sets. PhD thesis, Stanford University OCallaghan L (2003) Approximation algorithms for clustering streams and large data sets. PhD thesis, Stanford University
25.
Zurück zum Zitat Pelleg D, Moore A (2000) Others: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 1. San Francisco, pp 727–734 Pelleg D, Moore A (2000) Others: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 1. San Francisco, pp 727–734
26.
Zurück zum Zitat Rajaraman A, Ullman JD (2012) Mining of massive datasets. Cambridge University Press, Cambridge Rajaraman A, Ullman JD (2012) Mining of massive datasets. Cambridge University Press, Cambridge
27.
Zurück zum Zitat Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178 Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178
28.
Zurück zum Zitat Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In. Advances in neural information processing systems, pp 2375–2383 Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In. Advances in neural information processing systems, pp 2375–2383
29.
Zurück zum Zitat Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Soc Series B (Stat Methodol) 63(2):411–423CrossRefMATHMathSciNet Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Soc Series B (Stat Methodol) 63(2):411–423CrossRefMATHMathSciNet
30.
Zurück zum Zitat Wang L, Leckie C, Ramamohanarao K, Bezdek J (2009) Automatically determining the number of clusters in unlabeled data sets. IEEE Trans Knowl Data Eng 21(3):335–350CrossRef Wang L, Leckie C, Ramamohanarao K, Bezdek J (2009) Automatically determining the number of clusters in unlabeled data sets. IEEE Trans Knowl Data Eng 21(3):335–350CrossRef
31.
Zurück zum Zitat Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Others: top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Others: top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef
32.
Zurück zum Zitat Xu R, Wunsch D (2008) Clustering, vol 10. Wiley-IEEE Press, New York Xu R, Wunsch D (2008) Clustering, vol 10. Wiley-IEEE Press, New York
33.
34.
Zurück zum Zitat Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Cloud computing. Springer, Berlin, pp 674–679 Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Cloud computing. Springer, Berlin, pp 674–679
Metadaten
Titel
High performance parallel -means clustering for disk-resident datasets on multi-core CPUs
verfasst von
Ali Hadian
Saeed Shahrivari
Publikationsdatum
01.08.2014
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 2/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-014-1185-y

Weitere Artikel der Ausgabe 2/2014

The Journal of Supercomputing 2/2014 Zur Ausgabe