Skip to main content
Erschienen in: Neural Computing and Applications 24/2020

04.02.2019 | WSOM 2017

A fuzzy data reduction cluster method based on boundary information for large datasets

verfasst von: Gustavo R. L. Silva, Paulo C. Neto, Luiz C. B. Torres, Antônio P. Braga

Erschienen in: Neural Computing and Applications | Ausgabe 24/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The fuzzy c-means algorithm (FCM) is aimed at computing the membership degree of each data point to its corresponding cluster center. This computation needs to calculate the distance matrix between the cluster center and the data point. The main bottleneck of the FCM algorithm is the computing of the membership matrix for all data points. This work presents a new clustering method, the bdrFCM (boundary data reduction fuzzy c-means). Our algorithm is based on the original FCM proposal, adapted to detect and remove the boundary regions of clusters. Our implementation efforts are directed in two aspects: processing large datasets in less time and reducing the data volume, maintaining the quality of the clusters. A significant volume of real data application (> 106 records) was used, and we identified that bdrFCM implementation has good scalability to handle datasets with millions of data points.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267CrossRef Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267CrossRef
2.
Zurück zum Zitat Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2):191CrossRef Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2):191CrossRef
4.
Zurück zum Zitat Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J R Stat Soc Ser B (Stat Methodol) 76(4):795MathSciNetCrossRef Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J R Stat Soc Ser B (Stat Methodol) 76(4):795MathSciNetCrossRef
5.
Zurück zum Zitat Liang F, Cheng Y, Song Q, Park J, Yang P (2013) A resampling-based stochastic approximation method for analysis of large geostatistical data. J Am Stat Assoc 108(501):325MathSciNetCrossRef Liang F, Cheng Y, Song Q, Park J, Yang P (2013) A resampling-based stochastic approximation method for analysis of large geostatistical data. J Am Stat Assoc 108(501):325MathSciNetCrossRef
6.
Zurück zum Zitat Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Montreal, Canada Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Montreal, Canada
7.
Zurück zum Zitat Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65 Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65
8.
Zurück zum Zitat Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Morgan Kaufmann, pp 506–517 Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Morgan Kaufmann, pp 506–517
9.
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38MathSciNetMATH Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38MathSciNetMATH
10.
Zurück zum Zitat Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130CrossRef Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130CrossRef
11.
Zurück zum Zitat Parker JK, Hall LO (2014) Accelerating fuzzy-c means using an estimated subsample size. IEEE Trans Fuzzy Syst 22(5):1229CrossRef Parker JK, Hall LO (2014) Accelerating fuzzy-c means using an estimated subsample size. IEEE Trans Fuzzy Syst 22(5):1229CrossRef
12.
Zurück zum Zitat Tien ND et al (2017) Tune up fuzzy c-means for big data: some novel hybrid clustering algorithms based on initial selection and incremental clustering. Int J Fuzzy Syst 19(5):1585MathSciNetCrossRef Tien ND et al (2017) Tune up fuzzy c-means for big data: some novel hybrid clustering algorithms based on initial selection and incremental clustering. Int J Fuzzy Syst 19(5):1585MathSciNetCrossRef
13.
Zurück zum Zitat Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern Part B (Cybern) 27(5):787CrossRef Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern Part B (Cybern) 27(5):787CrossRef
14.
Zurück zum Zitat R Core Team (2017) UCI Machine Learning Repository. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 2 Jan 2019 R Core Team (2017) UCI Machine Learning Repository. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://​www.​R-project.​org/​. Accessed 2 Jan 2019
16.
Zurück zum Zitat Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677MATH Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677MATH
17.
Zurück zum Zitat Leisch F, Dimitriadou E (2010) mlbench: machine learning benchmark problems. R package version 2.1-1 Leisch F, Dimitriadou E (2010) mlbench: machine learning benchmark problems. R package version 2.1-1
22.
Zurück zum Zitat LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278CrossRef LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278CrossRef
25.
Zurück zum Zitat Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433CrossRef Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433CrossRef
26.
Zurück zum Zitat Jaccard P (1908) Nouvelles recherches sur la distribution florale Jaccard P (1908) Nouvelles recherches sur la distribution florale
27.
Zurück zum Zitat Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553CrossRef Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553CrossRef
Metadaten
Titel
A fuzzy data reduction cluster method based on boundary information for large datasets
verfasst von
Gustavo R. L. Silva
Paulo C. Neto
Luiz C. B. Torres
Antônio P. Braga
Publikationsdatum
04.02.2019
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 24/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-019-04049-4

Weitere Artikel der Ausgabe 24/2020

Neural Computing and Applications 24/2020 Zur Ausgabe

S.I. : Developing nature-inspired intelligence by neural systems

A comprehensive investigation into sclera biometrics: a novel dataset and performance study

Premium Partner