Skip to main content

2019 | OriginalPaper | Buchkapitel

A High Performance Modified K-Means Algorithm for Dynamic Data Clustering in Multi-core CPUs Based Environments

verfasst von : Giuliano Laccetti, Marco Lapegna, Valeria Mele, Diego Romano

Erschienen in: Internet and Distributed Computing Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

K-means algorithm is one of the most widely used methods in data mining and statistical data analysis to partition several objects in K distinct groups, called clusters, on the basis of their similarities. The main problel and distributed clustering algorithms start to be designem of this algorithm is that it requires the number of clusters as an input data, but in the real life it is very difficult to fix in advance such value. In this work we propose a parallel modified K-means algorithm where the number of clusters is increased at run time in a iterative procedure until a given cluster quality metric is satisfied. To improve the performance of the procedure, at each iteration two new clusters are created, splitting only the cluster with the worst value of the quality metric. Furthermore, experiments in a multi-core CPUs based environment are presented.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abubaker, M., Ashour, W.M.: Efficient data clustering algorithms: improvements over K-means. Int. J. Intell. Syst. Appl. 5, 37–49 (2013) Abubaker, M., Ashour, W.M.: Efficient data clustering algorithms: improvements over K-means. Int. J. Intell. Syst. Appl. 5, 37–49 (2013)
2.
Zurück zum Zitat Aggarwal, C.C., Reddy, C.K.: Data Clustering, Algorithms and Applications. Chapman and Hall/CRC, London (2013)CrossRef Aggarwal, C.C., Reddy, C.K.: Data Clustering, Algorithms and Applications. Chapman and Hall/CRC, London (2013)CrossRef
3.
Zurück zum Zitat Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput. Sci. 18, 369–378 (2013)CrossRef Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput. Sci. 18, 369–378 (2013)CrossRef
4.
Zurück zum Zitat Boccia, V., Carracciuolo, L., Laccetti, G., Lapegna, M., Mele, V.: HADAB: enabling fault tolerance in parallel applications running in distributed environments. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011. LNCS, vol. 7203, pp. 700–709. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31464-3_71CrossRef Boccia, V., Carracciuolo, L., Laccetti, G., Lapegna, M., Mele, V.: HADAB: enabling fault tolerance in parallel applications running in distributed environments. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011. LNCS, vol. 7203, pp. 700–709. Springer, Heidelberg (2012). https://​doi.​org/​10.​1007/​978-3-642-31464-3_​71CrossRef
5.
Zurück zum Zitat Caruso, P., Laccetti, G., Lapegna, M.: A performance contract system in a grid enabling, component based programming environment. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 982–992. Springer, Heidelberg (2005). https://doi.org/10.1007/11508380_100CrossRef Caruso, P., Laccetti, G., Lapegna, M.: A performance contract system in a grid enabling, component based programming environment. In: Sloot, P.M.A., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) EGC 2005. LNCS, vol. 3470, pp. 982–992. Springer, Heidelberg (2005). https://​doi.​org/​10.​1007/​11508380_​100CrossRef
6.
Zurück zum Zitat D’Ambra, P., Danelutto, M., di Serafino, D., Lapegna, M.: Advanced environments for parallel and distributed applications: a view of the current status. Parallel Comput. 28, 1637–1662 (2002)CrossRef D’Ambra, P., Danelutto, M., di Serafino, D., Lapegna, M.: Advanced environments for parallel and distributed applications: a view of the current status. Parallel Comput. 28, 1637–1662 (2002)CrossRef
7.
Zurück zum Zitat D’Ambra, P., Danelutto, M., di Serafino, D., Lapegna, M.: Integrating MPI-based numerical software into an advanced parallel computing environment. In: Proceedings of the Eleventh Euromicro Conference on Parallel Distributed and Network-based Procesing, Clematis ed., pp. 283–291. IEEE (2003) D’Ambra, P., Danelutto, M., di Serafino, D., Lapegna, M.: Integrating MPI-based numerical software into an advanced parallel computing environment. In: Proceedings of the Eleventh Euromicro Conference on Parallel Distributed and Network-based Procesing, Clematis ed., pp. 283–291. IEEE (2003)
8.
Zurück zum Zitat D’Apuzzo, M., Lapegna, M., Murli, A.: Scalability and load balancing in adaptive algorithms for multidimensional integration. Parallel Comput. 23, 1199–1210 (1997)MathSciNetCrossRef D’Apuzzo, M., Lapegna, M., Murli, A.: Scalability and load balancing in adaptive algorithms for multidimensional integration. Parallel Comput. 23, 1199–1210 (1997)MathSciNetCrossRef
9.
Zurück zum Zitat Di Fatta, G., Blasa, F., Cafiero, S., Fortino, G.: Fault tolerant decentralised K-means clustering for asynchronous large-scale networks. J. Parallel Distrib. Comput. 73(2013), 317–329 (2013)CrossRef Di Fatta, G., Blasa, F., Cafiero, S., Fortino, G.: Fault tolerant decentralised K-means clustering for asynchronous large-scale networks. J. Parallel Distrib. Comput. 73(2013), 317–329 (2013)CrossRef
11.
Zurück zum Zitat Frey, P.W., Slate, D.J.: Letter recognition using Holland-style adaptive classifiers. Mach. Learn. 6, 161–182 (1991) Frey, P.W., Slate, D.J.: Letter recognition using Holland-style adaptive classifiers. Mach. Learn. 6, 161–182 (1991)
12.
Zurück zum Zitat Gan, D.G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability. SIAM, Philadelphia. ASA, Alexandria (2007) Gan, D.G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability. SIAM, Philadelphia. ASA, Alexandria (2007)
13.
Zurück zum Zitat Gregoretti, F., Laccetti, G., Murli, A., Oliva, G., Scafuri, U.: MGF: a grid-enabled MPI library. Future Gener. Comput. Syst. 24, 158–165 (2008)CrossRef Gregoretti, F., Laccetti, G., Murli, A., Oliva, G., Scafuri, U.: MGF: a grid-enabled MPI library. Future Gener. Comput. Syst. 24, 158–165 (2008)CrossRef
14.
Zurück zum Zitat He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8, 83–99 (2014)MathSciNetCrossRef He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comput. Sci. 8, 83–99 (2014)MathSciNetCrossRef
15.
Zurück zum Zitat Huang, Z.X.: Extensions to the K-means algorithm for clustering large datasets with categorical values. Data Min. Knowl. Disc. 2, 283–304 (1998)CrossRef Huang, Z.X.: Extensions to the K-means algorithm for clustering large datasets with categorical values. Data Min. Knowl. Disc. 2, 283–304 (1998)CrossRef
16.
Zurück zum Zitat Joshi, A., Kaur, R.: A review: comparative study of various clustering techniques in data mining. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 55–57 (2013) Joshi, A., Kaur, R.: A review: comparative study of various clustering techniques in data mining. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 55–57 (2013)
17.
Zurück zum Zitat Karypis, G., Kumar, V.: Parallel multilevel K-way partitioning for irregular graphs. SIAM Rev. 41, 278–300 (1999)MathSciNetCrossRef Karypis, G., Kumar, V.: Parallel multilevel K-way partitioning for irregular graphs. SIAM Rev. 41, 278–300 (1999)MathSciNetCrossRef
19.
Zurück zum Zitat Laccetti, G., Lapegna, M., Mele, V., Montella, R.: An adaptive algorithm for high-dimensional integrals on heterogeneous CPUGPU systems. Concurr. Comput. Pract. Exp. 31, e4945 (2018) Laccetti, G., Lapegna, M., Mele, V., Montella, R.: An adaptive algorithm for high-dimensional integrals on heterogeneous CPUGPU systems. Concurr. Comput. Pract. Exp. 31, e4945 (2018)
20.
Zurück zum Zitat Laccetti, G., Lapegna, M., Mele, V., Romano, D., Murli, A.: A double adaptive algorithm for multidimensional integration on multicore based HPC systems. Int. J. Parallel Program. 40, 397–409 (2012)CrossRef Laccetti, G., Lapegna, M., Mele, V., Romano, D., Murli, A.: A double adaptive algorithm for multidimensional integration on multicore based HPC systems. Int. J. Parallel Program. 40, 397–409 (2012)CrossRef
21.
Zurück zum Zitat Laccetti, G., Lapegna, M., Mele, V., Romano, D.: A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 704–713. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55224-3_66CrossRef Laccetti, G., Lapegna, M., Mele, V., Romano, D.: A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 704–713. Springer, Heidelberg (2014). https://​doi.​org/​10.​1007/​978-3-642-55224-3_​66CrossRef
22.
Zurück zum Zitat Laccetti, G., Lapegna, M., Mele, V.: A loosely coordinated model for heap-based priority queues in multicore environments. Int. J. Parallel Prog. 44, 901–921 (2016)CrossRef Laccetti, G., Lapegna, M., Mele, V.: A loosely coordinated model for heap-based priority queues in multicore environments. Int. J. Parallel Prog. 44, 901–921 (2016)CrossRef
23.
Zurück zum Zitat Lapegna, M.: A global adaptive quadrature for the approximate computation of multidimensional integrals on a distributed memory multiprocessor. Concurr. Pract. Exp. 4, 413–426 (1992)CrossRef Lapegna, M.: A global adaptive quadrature for the approximate computation of multidimensional integrals on a distributed memory multiprocessor. Concurr. Pract. Exp. 4, 413–426 (1992)CrossRef
24.
25.
Zurück zum Zitat Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann (2000) Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann (2000)
26.
Zurück zum Zitat Pena, J.M., Lozano, J.A., Larranaga, P.: An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn. Lett. 20, 1027–1040 (1999)CrossRef Pena, J.M., Lozano, J.A., Larranaga, P.: An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn. Lett. 20, 1027–1040 (1999)CrossRef
27.
Zurück zum Zitat Shindler, M., Wong, A., Meyerson, A.: Fast and accurate k-means for large datasets. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.): Proceedings of 25th Annual Conference on Neural Information Processing Systems, pp. 2375–2383 (2011) Shindler, M., Wong, A., Meyerson, A.: Fast and accurate k-means for large datasets. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.): Proceedings of 25th Annual Conference on Neural Information Processing Systems, pp. 2375–2383 (2011)
29.
Zurück zum Zitat Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2, 165–193 (2015)CrossRef Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2, 165–193 (2015)CrossRef
30.
Zurück zum Zitat Xu, R., Wunsch, D.: Survey of clustering algorithms. Trans. Neural Netw. 16, 645–678 (2005)CrossRef Xu, R., Wunsch, D.: Survey of clustering algorithms. Trans. Neural Netw. 16, 645–678 (2005)CrossRef
Metadaten
Titel
A High Performance Modified K-Means Algorithm for Dynamic Data Clustering in Multi-core CPUs Based Environments
verfasst von
Giuliano Laccetti
Marco Lapegna
Valeria Mele
Diego Romano
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-34914-1_9

Premium Partner