Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 6/2015

01.12.2015 | Original Article

MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability

verfasst von: Simone A. Ludwig

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 6/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The management and analysis of big data has been identified as one of the most important emerging needs in recent years. This is because of the sheer volume and increasing complexity of data being created or collected. Current clustering algorithms can not handle big data, and therefore, scalable solutions are necessary. Since fuzzy clustering algorithms have shown to outperform hard clustering approaches in terms of accuracy, this paper investigates the parallelization and scalability of a common and effective fuzzy clustering algorithm named fuzzy c-means (FCM) algorithm. The algorithm is parallelized using the MapReduce paradigm outlining how the Map and Reduce primitives are implemented. A validity analysis is conducted in order to show that the implementation works correctly achieving competitive purity results compared to state-of-the art clustering algorithms. Furthermore, a scalability analysis is conducted to demonstrate the performance of the parallel FCM implementation with increasing number of computing nodes used.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
2.
Zurück zum Zitat Ghosh A, Jain LC (2005) Evolutionary computation in data mining series: studies in fuzziness and soft computing, vol 163. Springer, New YorkCrossRef Ghosh A, Jain LC (2005) Evolutionary computation in data mining series: studies in fuzziness and soft computing, vol 163. Springer, New YorkCrossRef
3.
Zurück zum Zitat Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, New York. ISBN: 0-321-32136-7 Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, New York. ISBN: 0-321-32136-7
4.
Zurück zum Zitat Jabeen H, Baig AR (2010) Review of classification using genetic programming. Int J Eng Sci Technol 2(2):94–103 Jabeen H, Baig AR (2010) Review of classification using genetic programming. Int J Eng Sci Technol 2(2):94–103
5.
Zurück zum Zitat Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco
6.
Zurück zum Zitat Ludwig SA (2014) Clonal selection based fuzzy C-means algorithm for clustering. In: GECCO '14 Proceedings of the 2014 conference on genetic and evolutionary computation, pp 105–112 Ludwig SA (2014) Clonal selection based fuzzy C-means algorithm for clustering. In: GECCO '14 Proceedings of the 2014 conference on genetic and evolutionary computation, pp 105–112
8.
Zurück zum Zitat Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, NorwellCrossRefMATH Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, NorwellCrossRefMATH
9.
Zurück zum Zitat Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle RiverMATH Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle RiverMATH
10.
Zurück zum Zitat Ruspini EH (1970) Numerical methods for fuzzy clustering. Inf Sci 2:319350CrossRef Ruspini EH (1970) Numerical methods for fuzzy clustering. Inf Sci 2:319350CrossRef
12.
Zurück zum Zitat Lee HS (1999) Automatic clustering of business process in business systems planning. Eur J Oper Res 114:354–362CrossRefMATH Lee HS (1999) Automatic clustering of business process in business systems planning. Eur J Oper Res 114:354–362CrossRefMATH
13.
Zurück zum Zitat Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic theory and application. Prentice Hall PTR, Upper Saddle River Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic theory and application. Prentice Hall PTR, Upper Saddle River
14.
Zurück zum Zitat Rosenfeld A (1975) Fuzzy graphs. In: Zadeh LA, Fu KS, Shimura M (eds) Fuzzy sets and their applications to cognitive and decision processes. Academic Press, New York Rosenfeld A (1975) Fuzzy graphs. In: Zadeh LA, Fu KS, Shimura M (eds) Fuzzy sets and their applications to cognitive and decision processes. Academic Press, New York
15.
Zurück zum Zitat Matula DW (1970) Cluster analysis via graph theoretic techniques. In: Proceedings of the Louisiana conference on combinatorics, graph theory and computing, Winnipeg Matula DW (1970) Cluster analysis via graph theoretic techniques. In: Proceedings of the Louisiana conference on combinatorics, graph theory and computing, Winnipeg
16.
Zurück zum Zitat Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57MathSciNetCrossRefMATH Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57MathSciNetCrossRefMATH
17.
Zurück zum Zitat Guerrero-Bote VP, Lopez-Pujalte C, de Moya-Anegon F, Herrero-Solana V (2003) Comparison of neural models for document clustering. Int J Approx Reason 34:287–305CrossRefMATH Guerrero-Bote VP, Lopez-Pujalte C, de Moya-Anegon F, Herrero-Solana V (2003) Comparison of neural models for document clustering. Int J Approx Reason 34:287–305CrossRefMATH
18.
Zurück zum Zitat Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7):773–781CrossRef Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7):773–781CrossRef
19.
Zurück zum Zitat Bezdek JC, Coray C, Gunderson R, Watson J (1981) Detection and characterization of cluster substructure—linear structure, fuzzy c-varieties and convex combinations thereof. SIAM J Appl Math 40(2):358–372MathSciNetCrossRefMATH Bezdek JC, Coray C, Gunderson R, Watson J (1981) Detection and characterization of cluster substructure—linear structure, fuzzy c-varieties and convex combinations thereof. SIAM J Appl Math 40(2):358–372MathSciNetCrossRefMATH
20.
Zurück zum Zitat Yang Y, Huang S (2007) Image segmentation by fuzzy c-means clustering algorithm with a novel penalty term. Comput Inform 26:17–31MATH Yang Y, Huang S (2007) Image segmentation by fuzzy c-means clustering algorithm with a novel penalty term. Comput Inform 26:17–31MATH
21.
Zurück zum Zitat Cai W, Chen S, Zhang D (2007) Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognit 40(3):825–838CrossRefMATH Cai W, Chen S, Zhang D (2007) Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognit 40(3):825–838CrossRefMATH
22.
Zurück zum Zitat Sarma TH, Viswanath P, Reddy BE (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4:107–113CrossRef Sarma TH, Viswanath P, Reddy BE (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4:107–113CrossRef
23.
Zurück zum Zitat Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation (OSDI’04), vol 6, p 10 Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation (OSDI’04), vol 6, p 10
24.
Zurück zum Zitat He Y, Tan H, Luo W, Feng S, Fan J (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99MathSciNetCrossRef He Y, Tan H, Luo W, Feng S, Fan J (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99MathSciNetCrossRef
25.
Zurück zum Zitat Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Proceedings of the CloudCom’09. Springer, Berlin, pp 674–679 Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Proceedings of the CloudCom’09. Springer, Berlin, pp 674–679
26.
Zurück zum Zitat Zhou P, Lei J, Ye W (2011) Large-scale data sets clustering based on mapreduce and hadoop. Comput Inf Syst 7(16):5956–5963 Zhou P, Lei J, Ye W (2011) Large-scale data sets clustering based on mapreduce and hadoop. Comput Inf Syst 7(16):5956–5963
27.
Zurück zum Zitat Li H-G, Wu G-Q, Hu X-G, Zhang J, Li L, Wu X (2011) K-means clustering with bagging and mapreduce. In: Proceedings of the 44th Hawaii international conference on system sciences. IEEE Computer Society, Washington, DC, pp 1–8 Li H-G, Wu G-Q, Hu X-G, Zhang J, Li L, Wu X (2011) K-means clustering with bagging and mapreduce. In: Proceedings of the 44th Hawaii international conference on system sciences. IEEE Computer Society, Washington, DC, pp 1–8
28.
Zurück zum Zitat Nair S, Mehta J (2011) Clustering with Apache Hadoop. In: Proceedings of the international conference, workshop on emerging trends in technology (ICWET’11), New York. ACM, New York, pp 505–509 Nair S, Mehta J (2011) Clustering with Apache Hadoop. In: Proceedings of the international conference, workshop on emerging trends in technology (ICWET’11), New York. ACM, New York, pp 505–509
29.
Zurück zum Zitat Papadimitriou S, Sun J (2008) Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Proceedings of the IEEE ICDM’08, Washington, DC, pp 512–521 Papadimitriou S, Sun J (2008) Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Proceedings of the IEEE ICDM’08, Washington, DC, pp 512–521
30.
Zurück zum Zitat Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 681–689 Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 681–689
31.
Zurück zum Zitat Yang J, Li X (2013) Mapreduce based method for big data semantic clustering. In: Proceedings of the 2013 IEEE international conference on systems, man, and cybernetics (SMC’13). IEEE Computer Society, Washington, DC, pp 2814–2819 Yang J, Li X (2013) Mapreduce based method for big data semantic clustering. In: Proceedings of the 2013 IEEE international conference on systems, man, and cybernetics (SMC’13). IEEE Computer Society, Washington, DC, pp 2814–2819
32.
Zurück zum Zitat Cordeiro F, Traina Jr C, Traina AJM, Lopez J, Kang U, Taloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 690–698 Cordeiro F, Traina Jr C, Traina AJM, Lopez J, Kang U, Taloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 690–698
35.
Zurück zum Zitat Modenesi MV, Costa MCA, Evsukoff AG, Ebecken NF (2007) Parallel fuzzy c-means cluster analysis. In: Lecture notes in computer science on high performance computing for computational science (VECPAR’06). Springer, New York Modenesi MV, Costa MCA, Evsukoff AG, Ebecken NF (2007) Parallel fuzzy c-means cluster analysis. In: Lecture notes in computer science on high performance computing for computational science (VECPAR’06). Springer, New York
36.
Zurück zum Zitat Blackard JA (1998) Comparison of neural networks and discriminant analysis in predicting forest cover types. Ph.D. dissertation, Department of Forest Sciences, Colorado State University, Fort Collins, Colorado Blackard JA (1998) Comparison of neural networks and discriminant analysis in predicting forest cover types. Ph.D. dissertation, Department of Forest Sciences, Colorado State University, Fort Collins, Colorado
38.
Zurück zum Zitat Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
39.
Zurück zum Zitat Karypis G (2003) CLUTO: a clustering toolkit. University of Minnesota, Computer Science. Tech. Rep. 02-017 Karypis G (2003) CLUTO: a clustering toolkit. University of Minnesota, Computer Science. Tech. Rep. 02-017
40.
Zurück zum Zitat Havens TC, Chitta R, Jain AK, Rong J (2011) Speedup of fuzzy and possibilistic kernel c-means for large-scale clustering. In: Proceedings of IEEE international conference on fuzzy systems (FUZZ), pp 463–470 Havens TC, Chitta R, Jain AK, Rong J (2011) Speedup of fuzzy and possibilistic kernel c-means for large-scale clustering. In: Proceedings of IEEE international conference on fuzzy systems (FUZZ), pp 463–470
41.
Zurück zum Zitat Hathaway R, Bezdek J (1995) Optimization of clustering criteria by reformulation. IEEE Trans Fuzzy Syst 3:241245CrossRef Hathaway R, Bezdek J (1995) Optimization of clustering criteria by reformulation. IEEE Trans Fuzzy Syst 3:241245CrossRef
Metadaten
Titel
MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability
verfasst von
Simone A. Ludwig
Publikationsdatum
01.12.2015
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 6/2015
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-015-0367-0

Weitere Artikel der Ausgabe 6/2015

International Journal of Machine Learning and Cybernetics 6/2015 Zur Ausgabe

Editorial

Editorial

Neuer Inhalt