nach oben

Neural Computing and Applications

Erschienen in:

04.02.2019 | WSOM 2017

A fuzzy data reduction cluster method based on boundary information for large datasets

verfasst von: Gustavo R. L. Silva, Paulo C. Neto, Luiz C. B. Torres, Antônio P. Braga

Erschienen in: Neural Computing and Applications | Ausgabe 24/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The fuzzy c-means algorithm (FCM) is aimed at computing the membership degree of each data point to its corresponding cluster center. This computation needs to calculate the distance matrix between the cluster center and the data point. The main bottleneck of the FCM algorithm is the computing of the membership matrix for all data points. This work presents a new clustering method, the bdrFCM (boundary data reduction fuzzy c-means). Our algorithm is based on the original FCM proposal, adapted to detect and remove the boundary regions of clusters. Our implementation efforts are directed in two aspects: processing large datasets in less time and reducing the data volume, maintaining the quality of the clusters. A significant volume of real data application (> 10⁶ records) was used, and we identified that bdrFCM implementation has good scalability to handle datasets with millions of data points.

Vorheriger Artikel An energy-based SOM model not requiring periodic boundary conditions

Nächster Artikel The importance of interpretability and visualization in machine learning for applications in medicine and health care

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://cran.r-project.org/web/packages/mlbench/mlbench.pdf.

https://archive.ics.uci.edu/ml/datasets/Poker+Hand.

http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring.

http://yann.lecun.com/exdb/mnist/.

https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/.

https://archive.ics.uci.edu/ml/datasets/skin+segmentation.

Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267CrossRef

Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2):191CrossRef

Li F, Nath S (2014) Scalable data summarization on big data. Distrib Parallel Databases 32(3):313. https://doi.org/10.1007/s10619-014-7145-yCrossRef

Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J R Stat Soc Ser B (Stat Methodol) 76(4):795MathSciNetCrossRef

Liang F, Cheng Y, Song Q, Park J, Yang P (2013) A resampling-based stochastic approximation method for analysis of large geostatistical data. J Am Stat Assoc 108(501):325MathSciNetCrossRef

Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, Montreal, Canada

Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65

Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th international conference on very large databases. Morgan Kaufmann, pp 506–517

Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38MathSciNetMATH

10.

Havens TC, Bezdek JC, Leckie C, Hall LO, Palaniswami M (2012) Fuzzy c-means algorithms for very large data. IEEE Trans Fuzzy Syst 20(6):1130CrossRef

11.

Parker JK, Hall LO (2014) Accelerating fuzzy-c means using an estimated subsample size. IEEE Trans Fuzzy Syst 22(5):1229CrossRef

12.

Tien ND et al (2017) Tune up fuzzy c-means for big data: some novel hybrid clustering algorithms based on initial selection and incremental clustering. Int J Fuzzy Syst 19(5):1585MathSciNetCrossRef

13.

Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Trans Syst Man Cybern Part B (Cybern) 27(5):787CrossRef

14.

R Core Team (2017) UCI Machine Learning Repository. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/. Accessed 2 Jan 2019

15.

Stetco A, Zeng XJ, Keane J (2015) Fuzzy c-means++: fuzzy c-means with effective seeding initialization. Expert Syst Appl 42(21):7541. https://doi.org/10.1016/j.eswa.2015.05.014CrossRef

16.

Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677MATH

17.

Leisch F, Dimitriadou E (2010) mlbench: machine learning benchmark problems. R package version 2.1-1

18.

UML Repository (2017) Iris. https://archive.ics.uci.edu/ml/datasets/iris. Accessed 2 Jan 2019

19.

UML Repository (2017) Breast cancer. https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original). Accessed 2 Jan 2019

20.

Cattral R, Oppacher F (2007) Poker hand data set. Carleton University. https://archive.ics.uci.edu/ml/datasets/Poker+Hand. Accessed 16 Aug 2017

21.

Attila Reiss DG (2012) Pamap2 physical activity monitoring data set. Department Augmented Vision. http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring. Accessed 16 Aug 2017

22.

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278CrossRef

23.

Blackard JA (1998) Covertype data set. Colorado State University. https://archive.ics.uci.edu/ml/datasets/covertype. Accessed 16 Aug 2017

24.

Rajen Bhatt AD (2012) Skin data set. https://archive.ics.uci.edu/ml/machine-learning-databases/00229/Accessed 16 Aug 2017

25.

Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433CrossRef

26.

Jaccard P (1908) Nouvelles recherches sur la distribution florale

27.

Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553CrossRef

Titel: A fuzzy data reduction cluster method based on boundary information for large datasets
verfasst von: Gustavo R. L. Silva
Paulo C. Neto
Luiz C. B. Torres
Antônio P. Braga
Publikationsdatum: 04.02.2019
Verlag: Springer London
Erschienen in: Neural Computing and Applications / Ausgabe 24/2020
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-019-04049-4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 24/2020

Convolutional neural networks for segmenting xylem vessels in stained cross-sectional images

Simultaneous multi-descent regression and feature learning for facial landmarking in depth images

A comprehensive investigation into sclera biometrics: a novel dataset and performance study

Diagnosis method of ultrasonic elasticity image of peripheral lung cancer based on genetic algorithm

A generative adversarial network with structural enhancement and spectral supplement for pan-sharpening

Automated design of error-resilient and hardware-efficient deep neural networks

Premium Partner