Skip to main content
Erschienen in: Neural Computing and Applications 17/2021

02.01.2021 | Original Article

A multidisciplinary ensemble algorithm for clustering heterogeneous datasets

verfasst von: Bryar A. Hassan, Tarik A. Rashid

Erschienen in: Neural Computing and Applications | Ausgabe 17/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering is a commonly used method for exploring and analysing data where the primary objective is to categorise observations into similar clusters. In recent decades, several algorithms and methods have been developed for analysing clustered data. We notice that most of these techniques deterministically define a cluster based on the value of the attributes, distance, and density of homogenous and single-featured datasets. However, these definitions are not successful in adding clear semantic meaning to the clusters produced. Evolutionary operators and statistical and multidisciplinary techniques may help in generating meaningful clusters. Based on this premise, we propose a new evolutionary clustering algorithm (ECA*) based on social class ranking and meta-heuristic algorithms for stochastically analysing heterogeneous and multifeatured datasets. The ECA* is integrated with recombinational evolutionary operators, Levy flight optimisation, and some statistical techniques, such as quartiles and percentiles, as well as the Euclidean distance of the K-means algorithm. Experiments are conducted to evaluate the ECA* against five conventional approaches: K-means (KM), K-means++ (KM++), expectation maximisation (EM), learning vector quantisation (LVQ), and the genetic algorithm for clustering++ (GENCLUST++). That the end, 32 heterogeneous and multifeatured datasets are used to examine their performance using internal and external and basic statistical performance clustering measures and to measure how their performance is sensitive to five features of these datasets (cluster overlap, the number of clusters, cluster dimensionality, the cluster structure, and the cluster shape) in the form of an operational framework. The results indicate that the ECA* surpasses its counterpart techniques in terms of the ability to find the right clusters. Significantly, compared to its counterpart techniques, the ECA* is less sensitive to the five properties of the datasets mentioned above. Thus, the order of overall performance of these algorithms, from best performing to worst performing, is the ECA*, EM, KM++, KM, LVQ, and the GENCLUST++. Meanwhile, the overall performance rank of the ECA* is 1.1 (where the rank of 1 represents the best performing algorithm and the rank of 6 refers to the worst performing algorithm) for 32 datasets based on the five dataset features mentioned above.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ghosal A, Nandy A, Das AK et al (2020) A short review on different clustering techniques and their applications. Emerging technology in modelling and graphics. Springer, Berlin, pp 69–83CrossRef Ghosal A, Nandy A, Das AK et al (2020) A short review on different clustering techniques and their applications. Emerging technology in modelling and graphics. Springer, Berlin, pp 69–83CrossRef
2.
Zurück zum Zitat Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48:4743–4759CrossRef Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48:4743–4759CrossRef
3.
Zurück zum Zitat Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31:651–666CrossRef Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31:651–666CrossRef
4.
Zurück zum Zitat Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035 Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 1027–1035
5.
Zurück zum Zitat Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33:1455–1465CrossRef Maulik U, Bandyopadhyay S (2000) Genetic algorithm-based clustering technique. Pattern Recognit 33:1455–1465CrossRef
6.
Zurück zum Zitat Koschke R, Eisenbarth T (2000) A framework for experimental evaluation of clustering techniques. In: Proceedings IWPC 2000. 8th International Workshop on Program Comprehension. IEEE, pp 201–210 Koschke R, Eisenbarth T (2000) A framework for experimental evaluation of clustering techniques. In: Proceedings IWPC 2000. 8th International Workshop on Program Comprehension. IEEE, pp 201–210
7.
Zurück zum Zitat Hassan BA, Rashid TA (2019) Operational framework for recent advances in backtracking search optimisation algorithm: A systematic review and performance evaluation. Appl Math Comput 370:124919MathSciNetMATH Hassan BA, Rashid TA (2019) Operational framework for recent advances in backtracking search optimisation algorithm: A systematic review and performance evaluation. Appl Math Comput 370:124919MathSciNetMATH
8.
Zurück zum Zitat Hassan BA, Rashid TA (2020) Datasets on statistical analysis and performance evaluation of backtracking search optimisation algorithm compared with its counterpart algorithms. Data Br 28:105046CrossRef Hassan BA, Rashid TA (2020) Datasets on statistical analysis and performance evaluation of backtracking search optimisation algorithm compared with its counterpart algorithms. Data Br 28:105046CrossRef
9.
Zurück zum Zitat Kamaruzaman AF, Zain AM, Yusuf SM, Udin A (2013) Levy flight algorithm for optimization problems—a literature review. Appl Mech Mater 421:496–501CrossRef Kamaruzaman AF, Zain AM, Yusuf SM, Udin A (2013) Levy flight algorithm for optimization problems—a literature review. Appl Mech Mater 421:496–501CrossRef
10.
Zurück zum Zitat Kraus MW, Keltner D (2013) Social class rank, essentialism, and punitive judgment. J Pers Soc Psychol 105:247CrossRef Kraus MW, Keltner D (2013) Social class rank, essentialism, and punitive judgment. J Pers Soc Psychol 105:247CrossRef
11.
Zurück zum Zitat Benvenuto F, Piana M, Campi C, Massone AM (2018) A hybrid supervised/unsupervised machine learning approach to solar flare prediction. Astrophys J 853:90CrossRef Benvenuto F, Piana M, Campi C, Massone AM (2018) A hybrid supervised/unsupervised machine learning approach to solar flare prediction. Astrophys J 853:90CrossRef
12.
Zurück zum Zitat Chen D, Zou F, Lu R, Li S (2019) Backtracking search optimization algorithm based on knowledge learning. Inf Sci (Ny) 473:202–226MathSciNetCrossRef Chen D, Zou F, Lu R, Li S (2019) Backtracking search optimization algorithm based on knowledge learning. Inf Sci (Ny) 473:202–226MathSciNetCrossRef
13.
Zurück zum Zitat Hruschka ER, Campello RJGB, Freitas AA (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man. Cybern Part C (Appl Rev) 39:133–155 Hruschka ER, Campello RJGB, Freitas AA (2009) A survey of evolutionary algorithms for clustering. IEEE Trans Syst Man. Cybern Part C (Appl Rev) 39:133–155
14.
Zurück zum Zitat Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37CrossRef Wu X, Kumar V, Quinlan JR et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37CrossRef
15.
Zurück zum Zitat Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769 Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769
16.
Zurück zum Zitat Bahmani B, Moseley B, Vattani A et al (2012) Scalable k-means++. Proc VLDB Endow 5:622–633CrossRef Bahmani B, Moseley B, Vattani A et al (2012) Scalable k-means++. Proc VLDB Endow 5:622–633CrossRef
17.
Zurück zum Zitat Moon TK (1996) The expectation-maximization algorithm. IEEE Signal Process Mag 13:47–60CrossRef Moon TK (1996) The expectation-maximization algorithm. IEEE Signal Process Mag 13:47–60CrossRef
18.
Zurück zum Zitat Kohonen T (1989) Self-organizing feature maps. Self-organization and associative memory. Springer, Heidelberg, pp 119–157CrossRef Kohonen T (1989) Self-organizing feature maps. Self-organization and associative memory. Springer, Heidelberg, pp 119–157CrossRef
19.
Zurück zum Zitat Kohonen T (1995) Learning vector quantization. Self-organizing maps. Springer, Heidelberg, pp 175–189CrossRef Kohonen T (1995) Learning vector quantization. Self-organizing maps. Springer, Heidelberg, pp 175–189CrossRef
20.
Zurück zum Zitat Sato A, Yamada K (1996) Generalized learning vector quantization. Advances in neural information processing systems. MIT Press, Cambridge, pp 423–429 Sato A, Yamada K (1996) Generalized learning vector quantization. Advances in neural information processing systems. MIT Press, Cambridge, pp 423–429
21.
Zurück zum Zitat Chang D-X, Zhang X-D, Zheng C-W (2009) A genetic algorithm with gene rearrangement for K-means clustering. Pattern Recogn 42:1210–1222CrossRef Chang D-X, Zhang X-D, Zheng C-W (2009) A genetic algorithm with gene rearrangement for K-means clustering. Pattern Recogn 42:1210–1222CrossRef
22.
Zurück zum Zitat Asadi M, Mazinani SM (2019) Presenting a new clustering algorithm by combining intelligent bat and chaotic map algorithms to improve energy consumption in wireless sensor network. Springer, SingaporeCrossRef Asadi M, Mazinani SM (2019) Presenting a new clustering algorithm by combining intelligent bat and chaotic map algorithms to improve energy consumption in wireless sensor network. Springer, SingaporeCrossRef
23.
Zurück zum Zitat Di Gesú V, Giancarlo R, Lo BG et al (2005) GenClust: a genetic algorithm for clustering gene expression data. BMC Bioinf 6:289CrossRef Di Gesú V, Giancarlo R, Lo BG et al (2005) GenClust: a genetic algorithm for clustering gene expression data. BMC Bioinf 6:289CrossRef
24.
Zurück zum Zitat Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-means. Knowledge-Based Syst 71:345–365CrossRef Rahman MA, Islam MZ (2014) A hybrid clustering technique combining a novel genetic algorithm with K-means. Knowledge-Based Syst 71:345–365CrossRef
25.
Zurück zum Zitat Islam MZ, Estivill-Castro V, Rahman MA, Bossomaier T (2018) Combining k-means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering. Expert Syst Appl 91:402–417CrossRef Islam MZ, Estivill-Castro V, Rahman MA, Bossomaier T (2018) Combining k-means and a genetic algorithm through a novel arrangement of genetic operators for high quality clustering. Expert Syst Appl 91:402–417CrossRef
26.
Zurück zum Zitat Rokach L, Maimon O (2005) Clustering methods. Data mining and knowledge discovery handbook. Springer, Berlin, pp 321–352CrossRef Rokach L, Maimon O (2005) Clustering methods. Data mining and knowledge discovery handbook. Springer, Berlin, pp 321–352CrossRef
27.
Zurück zum Zitat Szekely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J Classif 22:151–184MathSciNetCrossRef Szekely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending ward’s minimum variance method. J Classif 22:151–184MathSciNetCrossRef
28.
Zurück zum Zitat Civicioglu P (2013) Backtracking search optimization algorithm for numerical optimization problems. Appl Math Comput 219:8121–8144MathSciNetMATH Civicioglu P (2013) Backtracking search optimization algorithm for numerical optimization problems. Appl Math Comput 219:8121–8144MathSciNetMATH
29.
Zurück zum Zitat Lughofer E (2012) A dynamic split-and-merge approach for evolving cluster models. Evol Syst 3:135–151CrossRef Lughofer E (2012) A dynamic split-and-merge approach for evolving cluster models. Evol Syst 3:135–151CrossRef
30.
Zurück zum Zitat Visalakshi NK, Suguna J (2009) K-means clustering using max–min distance measure. In: NAFIPS 2009–2009 annual meeting of the north american fuzzy information processing society. IEEE, pp 1–6 Visalakshi NK, Suguna J (2009) K-means clustering using max–min distance measure. In: NAFIPS 2009–2009 annual meeting of the north american fuzzy information processing society. IEEE, pp 1–6
32.
Zurück zum Zitat Hassani M, Seidl T (2017) Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J Comput Sci 4:171–183CrossRef Hassani M, Seidl T (2017) Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J Comput Sci 4:171–183CrossRef
33.
Zurück zum Zitat Fränti P (2000) Genetic algorithm with deterministic crossover for vector quantization. Pattern Recogn Lett 21:61–68CrossRef Fränti P (2000) Genetic algorithm with deterministic crossover for vector quantization. Pattern Recogn Lett 21:61–68CrossRef
34.
Zurück zum Zitat Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar k-means problem is NP-hard. Theor Comput Sci 442:13–21MathSciNetCrossRef Mahajan M, Nimbhorkar P, Varadarajan K (2012) The planar k-means problem is NP-hard. Theor Comput Sci 442:13–21MathSciNetCrossRef
35.
Zurück zum Zitat Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recogn 47:3034–3045CrossRef Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recogn 47:3034–3045CrossRef
38.
Zurück zum Zitat Hassan BA (2020) CSCF: a chaotic sine cosine firefly algorithm for practical application problems. Neural Comput Appl 1–20 Hassan BA (2020) CSCF: a chaotic sine cosine firefly algorithm for practical application problems. Neural Comput Appl 1–20
39.
Zurück zum Zitat Saeed MHR, Hassan BA, Qader SM (2017) An optimized framework to adopt computer laboratory administrations for operating system and application installations. Kurdistan J Appl Res 2(3):92–97CrossRef Saeed MHR, Hassan BA, Qader SM (2017) An optimized framework to adopt computer laboratory administrations for operating system and application installations. Kurdistan J Appl Res 2(3):92–97CrossRef
40.
Zurück zum Zitat Hassan BA, Ahmed AM, Saeed SA, Saeed AA (2016) Evaluating e-government services in Kurdistan institution for strategic studies and scientific research using the EGOVSAT model. Kurdistan J Appl Res 1(2):1–7CrossRef Hassan BA, Ahmed AM, Saeed SA, Saeed AA (2016) Evaluating e-government services in Kurdistan institution for strategic studies and scientific research using the EGOVSAT model. Kurdistan J Appl Res 1(2):1–7CrossRef
Metadaten
Titel
A multidisciplinary ensemble algorithm for clustering heterogeneous datasets
verfasst von
Bryar A. Hassan
Tarik A. Rashid
Publikationsdatum
02.01.2021
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 17/2021
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-020-05649-1

Weitere Artikel der Ausgabe 17/2021

Neural Computing and Applications 17/2021 Zur Ausgabe

S. I : Hybridization of Neural Computing with Nature Inspired Algorithms

A novel equilibrium optimization algorithm for multi-thresholding image segmentation problems

S. I : Hybridization of Neural Computing with Nature Inspired Algorithms

VNE strategy based on chaos hybrid flower pollination algorithm considering multi-criteria decision making

Premium Partner