nach oben

The Journal of Supercomputing

Erschienen in:

01.08.2014

High performance parallel \(k\)-means clustering for disk-resident datasets on multi-core CPUs

verfasst von: Ali Hadian, Saeed Shahrivari

Erschienen in: The Journal of Supercomputing | Ausgabe 2/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. \(k\)-means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. \(k\)-means++) and stream-based algorithms. Also, the quality of solution is often equal to \(k\)-means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.

Vorheriger Artikel Combined performance and availability analysis of distributed resources in grid computing

Nächster Artikel A run-time optimization approach for reducing data movements using locality-aware searching

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The source code can be found at https://github.com/alihadian/mpkmeans.

http://kdd.ics.uci.edu/databases/kddcup99/.

http://archive.ics.uci.edu/ml/datasets/Covertype.

http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption.

http://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990).

http://opencv.org/.

Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. J Exp Algorithm 17(1):2–4CrossRef

Agarwal PK, Har-Peled S, Varadarajan KR (2004) Approximating extent measures of points. J ACM 51(4):606–635CrossRefMATHMathSciNet

Ana LNF, Jain AK (2003) Robust data clustering. In: Proceedings. 2003 IEEE computer society conference on computer vision and pattern recognition, vol 2, IEEE, 2003, pp II–128

Apiletti D, Baralis E, Cerquitelli T, Chiusano S, Grimaudo L (2013) Searum: a cloud-based service for association rule mining. In: Trust, security and privacy in computing and communications (TrustCom), 12th IEEE International Conference on, IEEE, 2013, pp 1283–1290

Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035

Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable k-means++. Proc VLDB Endow 5(7):622–633CrossRef

Bekkerman R, Bilenko M, Langford J (2012) Scaling up machine learning: parallel and distributed approaches. Cambridge University Press, Cambridge

Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications. Springer, Berlin

Bottou L, Bengio Y (1995) Convergence properties of the K-means algorithm. Advances in Neural Information Processing Systems 7 (NIPS’94), pp 585–592

10.

Braverman V, Meyerson A, Ostrovsky R, Roytman A, Shindler M, Tagiku B (2011) Streaming k-means on well-clusterable data. In: Proceedings of the 22nd annual ACM-SIAM symposium on discrete algorithms, SIAM, pp 26–40

11.

Chandra A, Yao X (2006) Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing 69(7):686–700

12.

Chiang MMT, Mirkin B (2010) Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads. J Classif 27(1):3–40CrossRefMathSciNet

13.

Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRef

14.

Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining. Springer, Berlin, pp 245–260

15.

Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: ICML, pp 106–113

16.

Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 681–689

17.

Feldman D, Schmidt M, Sohler C (2013) Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the 24th ACM-SIAM symposium on discrete algorithms (SODA13)

18.

Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications, vol 20. Society for Industrial and Applied Mathematics

19.

Hamerly YFG (2007) PG-means: learning the number of clusters in data. In: Advances in neural information processing systems, vol 19. The MIT Press, Cambridge, pp 393–400

20.

Hawwash B, Nasraoui O (2012) Stream-dashboard: a framework for mining, tracking and validating clusters in a data stream. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, BigMine ’12ACM, New York, pp 109–117

21.

Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666CrossRef

22.

Mahafzah BA, Al-Badarneh AF, Zakaria MZ (2009) A new sampling technique for association rule mining. J Inf Sci 35(3):358–376CrossRef

23.

McLachlan GJ, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley-Interscience, New York

24.

OCallaghan L (2003) Approximation algorithms for clustering streams and large data sets. PhD thesis, Stanford University

25.

Pelleg D, Moore A (2000) Others: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the seventeenth international conference on machine learning, vol 1. San Francisco, pp 727–734

26.

Rajaraman A, Ullman JD (2012) Mining of massive datasets. Cambridge University Press, Cambridge

27.

Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178

28.

Shindler M, Wong A, Meyerson AW (2011) Fast and accurate k-means for large datasets. In. Advances in neural information processing systems, pp 2375–2383

29.

Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J Royal Stat Soc Series B (Stat Methodol) 63(2):411–423CrossRefMATHMathSciNet

30.

Wang L, Leckie C, Ramamohanarao K, Bezdek J (2009) Automatically determining the number of clusters in unlabeled data sets. IEEE Trans Knowl Data Eng 21(3):335–350CrossRef

31.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Others: top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef

32.

Xu R, Wunsch D (2008) Clustering, vol 10. Wiley-IEEE Press, New York

33.

Yan M, Ye K (2007) Determining the number of clusters using the weighted gap statistic. Biometrics 63(4):1031–1037CrossRefMATHMathSciNet

34.

Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Cloud computing. Springer, Berlin, pp 674–679

Titel: High performance parallel -means clustering for disk-resident datasets on multi-core CPUs
verfasst von: Ali Hadian
Saeed Shahrivari
Publikationsdatum: 01.08.2014
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 2/2014
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-014-1185-y

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 2/2014

TB-SnW: Trust-based Spray-and-Wait routing for delay-tolerant networks

PGSW-OS: a novel approach for resource management in a semantic web operating system based on a P2P grid architecture

Molecular solutions for minimum and exact cover problems in the tile assembly model

Parallel reconfiguration algorithms for mesh-connected processor arrays

Maximizing network lifetime in wireless sensor networks with regular topologies

A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud