Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 2/2013

01.04.2013 | Original Article

A hybrid approach to speed-up the k-means clustering method

verfasst von: T. Hitendra Sarma, P. Viswanath, B. Eswara Reddy

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 2/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

k-means clustering method is an iterative partition-based method which for finite data-sets converges to a solution in a finite time. The running time of this method grows linearly with respect to the size of the data-set. Many variants have been proposed to speed-up the conventional k-means clustering method. In this paper, we propose a prototype-based hybrid approach to speed-up the k-means clustering method. The proposed method, first partitions the data-set into small clusters (grouplets), which are of varying sizes. Each grouplet is represented by a prototype. Later, the set of prototypes is partitioned into k clusters using the modified k-means method. The modified k-means clustering method is similar to the conventional k-means method but it avoids empty clusters (the clusters to which no pattern is assigned) in the iterative process. In each cluster of prototypes, each prototype is replaced by its corresponding set of patterns (which formed the grouplet) to derive a partition of the data-set. Since this partition of the data-set can deviate from the partition obtained using the conventional k-means method over the entire data-set, a correcting step is proposed. Both theoretically and experimentally, the conventional k-means method and the proposed hybrid method (augmented with the correcting step) are shown to yield the same result (provided, the initial k seed points are same). But, the proposed method is much faster than the conventional one. Experimentally, the proposed method is compared with the conventional method and the other recent methods that are proposed to speed-up the k-means method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Fußnoten
1
For the sake of simplicity, we assume that the patterns are from a Euclidean space and Euclidean distance is used, whereas the proposed methods are applicable with any distance metric.
 
Literatur
1.
Zurück zum Zitat Almedia MB, Braga A, Braga JP (2000) Svm-km: speeding svms learning with a priory cluster selection and k-means. In: Proceedings of the sixth Brazilian Symposium on Neural Networks 162–167 Almedia MB, Braga A, Braga JP (2000) Svm-km: speeding svms learning with a priory cluster selection and k-means. In: Proceedings of the sixth Brazilian Symposium on Neural Networks 162–167
2.
Zurück zum Zitat Alsabti K, Ranka S, Singh V (1998) An efficient k-means clustering algorithm. In: Proceedings of First Workshop High Performance Data Mining (March 1998) Alsabti K, Ranka S, Singh V (1998) An efficient k-means clustering algorithm. In: Proceedings of First Workshop High Performance Data Mining (March 1998)
3.
Zurück zum Zitat Ananthanarayana V, Murty M, Subramanian D (2001) An incremental data mining algorithm for compact realization of prototypes. Pattern Recognit 34:2249–2251MATHCrossRef Ananthanarayana V, Murty M, Subramanian D (2001) An incremental data mining algorithm for compact realization of prototypes. Pattern Recognit 34:2249–2251MATHCrossRef
4.
Zurück zum Zitat Babu TR, Murty MN (2001) Comparison of genetic algorithms based prototype selection schemes. Pattern Recognit 34:523–525CrossRef Babu TR, Murty MN (2001) Comparison of genetic algorithms based prototype selection schemes. Pattern Recognit 34:523–525CrossRef
5.
Zurück zum Zitat Berkhin P (2002) Survey of clustering data mining techniques. Technical Report, Accure Software Berkhin P (2002) Survey of clustering data mining techniques. Technical Report, Accure Software
6.
Zurück zum Zitat Bidyut Kr. Patra, Sukumar Nandi, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large data-sets. Pattern Recognit 44:2862–2870 Bidyut Kr. Patra, Sukumar Nandi, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large data-sets. Pattern Recognit 44:2862–2870
7.
Zurück zum Zitat Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances in Neural Information Processing Systems 7, MIT Press, Cambridge, pp. 585–592 Bottou L, Bengio Y (1995) Convergence properties of the k-means algorithms. In: Advances in Neural Information Processing Systems 7, MIT Press, Cambridge, pp. 585–592
8.
Zurück zum Zitat Bradley PS, Fayyad U, Raina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp 9–15 Bradley PS, Fayyad U, Raina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of Fourth International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp 9–15
9.
Zurück zum Zitat Chitta R, Murthy MN (2010) Two-level k-means clustering algorithm for k-\(\tau\) relationship establishment and linear-time classification. Pattern Recognit 43:796–804MATHCrossRef Chitta R, Murthy MN (2010) Two-level k-means clustering algorithm for k-\(\tau\) relationship establishment and linear-time classification. Pattern Recognit 43:796–804MATHCrossRef
10.
Zurück zum Zitat Davidson I, Satyanarayana A (2004) Speeding up k-means clustering by bootstrap averaging. IEEE ICDM Davidson I, Satyanarayana A (2004) Speeding up k-means clustering by bootstrap averaging. IEEE ICDM
11.
Zurück zum Zitat Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. Eighteenth International Conference on Machine Learning Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. Eighteenth International Conference on Machine Learning
12.
Zurück zum Zitat Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1):51–57CrossRef Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1):51–57CrossRef
13.
Zurück zum Zitat Gereon Frahling, Christian Sohler (2006) A fast k-means implementation using coresets. In: Proceedings of the twenty-second annual symposium on Computational geometry (SCG ’06), pp. 135–143. ACM Press, New York, USA Gereon Frahling, Christian Sohler (2006) A fast k-means implementation using coresets. In: Proceedings of the twenty-second annual symposium on Computational geometry (SCG ’06), pp. 135–143. ACM Press, New York, USA
14.
Zurück zum Zitat Gongde Guo, Si Chen, Lifei Chen (2011) Soft subspace clustering with an improved feature weight self-adjustment mechanism. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0038 Gongde Guo, Si Chen, Lifei Chen (2011) Soft subspace clustering with an improved feature weight self-adjustment mechanism. Int J Mach Learn Cybern. doi:10.​1007/​s13042-011-0038
15.
16.
Zurück zum Zitat Guha S, Rastogi R, Shim K (1998) Cure:an efficient clustering algorithm for large databases. In: Proceedings of Conference Management of Data (ACM SIGMOD’98), pp 73–84 Guha S, Rastogi R, Shim K (1998) Cure:an efficient clustering algorithm for large databases. In: Proceedings of Conference Management of Data (ACM SIGMOD’98), pp 73–84
17.
Zurück zum Zitat Han J, Kamber M (2000) Data mining: concepts and techniques. 2nd edn. Morgan Kaufmann, USA Han J, Kamber M (2000) Data mining: concepts and techniques. 2nd edn. Morgan Kaufmann, USA
18.
Zurück zum Zitat Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH
19.
Zurück zum Zitat Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, Englewood CliffsMATH Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, Englewood CliffsMATH
20.
Zurück zum Zitat Jain A, Murthy MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain A, Murthy MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
21.
Zurück zum Zitat Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRef Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRef
22.
Zurück zum Zitat Jain AK, Duin P, Mao J (2000) Statistical pattern recognition: A review. IEEE Transact Pattern Anal Mach Intell 22(1):4–37CrossRef Jain AK, Duin P, Mao J (2000) Statistical pattern recognition: A review. IEEE Transact Pattern Anal Mach Intell 22(1):4–37CrossRef
24.
Zurück zum Zitat Abdul Nazeer KA, Sebastian MP (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. In: Proceedings of the World Congress on Engineering 2009 vol I. London, U.K. Abdul Nazeer KA, Sebastian MP (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. In: Proceedings of the World Congress on Engineering 2009 vol I. London, U.K.
25.
Zurück zum Zitat Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A The analysis of a simple k-means clustering algorithm. In: Proceedings of 16th Annual ACM Symposium Computational Geometry, pp 100–109 (June 2000) Kanungo T, Mount D, Netanyahu N, Piatko C, Silverman R, Wu A The analysis of a simple k-means clustering algorithm. In: Proceedings of 16th Annual ACM Symposium Computational Geometry, pp 100–109 (June 2000)
26.
Zurück zum Zitat Kanungo T, Mount DM, Netanyahu NS, christine D Piatko, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transact Pattern Anal Mach Intell 24(7):881–892CrossRef Kanungo T, Mount DM, Netanyahu NS, christine D Piatko, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transact Pattern Anal Mach Intell 24(7):881–892CrossRef
27.
Zurück zum Zitat Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkCrossRef Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New YorkCrossRef
28.
Zurück zum Zitat Krishna K, Murty M (1999) Genetic k-means algorithm. IEEE Transact Syst Man Cybern Part B Cybern 29(3):433–439CrossRef Krishna K, Murty M (1999) Genetic k-means algorithm. IEEE Transact Syst Man Cybern Part B Cybern 29(3):433–439CrossRef
30.
Zurück zum Zitat Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004) Fgka: A fast genetic k-means clustering algorithm. In: Proceedings of ACM Symposium on Applied Computing. pp 622–623 Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004) Fgka: A fast genetic k-means clustering algorithm. In: Proceedings of ACM Symposium on Applied Computing. pp 622–623
31.
Zurück zum Zitat MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, pp. 281–297 MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, pp. 281–297
32.
Zurück zum Zitat Pakhira MK (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1):220–226 Pakhira MK (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1):220–226
33.
Zurück zum Zitat parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105CrossRef parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105CrossRef
34.
Zurück zum Zitat Pelleg D, Moore A Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281 (Aug 1999) Pelleg D, Moore A Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281 (Aug 1999)
35.
Zurück zum Zitat Pelleg D, Moore A (2000) x-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (July 2000) Pelleg D, Moore A (2000) x-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (July 2000)
37.
Zurück zum Zitat Jin R, Goswami A, Agarwal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40CrossRef Jin R, Goswami A, Agarwal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40CrossRef
38.
Zurück zum Zitat Fahim AM, Salem AM, Torkey A, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. J Zhejiang Univ 10(7):1626–1633CrossRef Fahim AM, Salem AM, Torkey A, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. J Zhejiang Univ 10(7):1626–1633CrossRef
39.
Zurück zum Zitat Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: Third International Symposium on Intelligent Information Technology and Security Informatics (IITSI), pp 63–67 (April 2010) Na S, Xumin L, Yong G (2010) Research on k-means clustering algorithm: an improved k-means clustering algorithm. In: Third International Symposium on Intelligent Information Technology and Security Informatics (IITSI), pp 63–67 (April 2010)
40.
Zurück zum Zitat Spath H (1980) Cluster analysis algorithms for data reduction and classification. Ellis Horwood, Chichester Spath H (1980) Cluster analysis algorithms for data reduction and classification. Ellis Horwood, Chichester
41.
Zurück zum Zitat Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Proceedings of Algorithms Engineering and Experiments(ALENEX02), pp. 166–177. Springer, Berlin Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: Proceedings of Algorithms Engineering and Experiments(ALENEX02), pp. 166–177. Springer, Berlin
42.
Zurück zum Zitat Hitendra Sarma T, Viswanath P (2009) Speeding-up the k-means clustering method: A prototype based approach. In: Proceedings of 3rd International Conference on Pattern Recognition and Machine Intelligence(PReMI)LNCS 5909, pp. 56–61. Springer, Berlin Hitendra Sarma T, Viswanath P (2009) Speeding-up the k-means clustering method: A prototype based approach. In: Proceedings of 3rd International Conference on Pattern Recognition and Machine Intelligence(PReMI)LNCS 5909, pp. 56–61. Springer, Berlin
43.
Zurück zum Zitat Vijaya P, Murty MN, Subramanian DK (2004) Leaders-subleaders: an efficient hierarchical clustering algorithm for large data sets. Pattern Recognit Lett 25:505–513CrossRef Vijaya P, Murty MN, Subramanian DK (2004) Leaders-subleaders: an efficient hierarchical clustering algorithm for large data sets. Pattern Recognit Lett 25:505–513CrossRef
44.
Zurück zum Zitat Viswanath P, Pinkesh R (2006) l-dbscan : a fast hybrid density based clustering method. In: Proceedings of the 18th Intl. Conf. on Pattern Recognition (ICPR-06), vol. 1, pp. 912–915. IEEE Computer Society, Hong Kong Viswanath P, Pinkesh R (2006) l-dbscan : a fast hybrid density based clustering method. In: Proceedings of the 18th Intl. Conf. on Pattern Recognition (ICPR-06), vol. 1, pp. 912–915. IEEE Computer Society, Hong Kong
47.
Zurück zum Zitat Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu P (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu P (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRef
48.
Zurück zum Zitat Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Transact Neural Netw 16(3):645–678CrossRef Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Transact Neural Netw 16(3):645–678CrossRef
Metadaten
Titel
A hybrid approach to speed-up the k-means clustering method
verfasst von
T. Hitendra Sarma
P. Viswanath
B. Eswara Reddy
Publikationsdatum
01.04.2013
Verlag
Springer-Verlag
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 2/2013
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-012-0079-7

Weitere Artikel der Ausgabe 2/2013

International Journal of Machine Learning and Cybernetics 2/2013 Zur Ausgabe

Neuer Inhalt