Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 4/2024

21.09.2023 | Original Article

A hybrid similarity measure-based clustering approach for mixed attribute data

verfasst von: Kexin Chu, Min Zhang, Yaling Xun, Jifu Zhang

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 4/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In mixed attribute clustering, the similarity measure superposition is skewed due to the difference of measuring different attribute types. In this paper, a new clustering approach for mixed attribute data is proposed using hybrid similarity measure. Firstly, a hybrid similarity measure formula is defined using the information entropy, therefore the similarity difference among various attribute types is effectively reduced, and the inclination of similarity measure superposition is alleviated. Secondly, a calculation formula of similarity mean for mixed attributes is defined, which can describe the centralized trend of data distribution, and can be effectively used to merge of clustering clusters. Thus, artificial setting of similarity threshold parameters can be avoided. Thirdly, a novel clustering analysis algorithm for mixed attributes is proposed using hybrid similarity measure and allocation strategy of boundary data objects. In the end, experimental results validate that the algorithm performs well on clustering effect, scalability and anti-noise, as well as the stability and effectiveness of the similarity mean by using UCI, artificial data sets and stellar spectral data sets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Zhou Q, Xia B, Xue W, et al (2017) An advanced inventory data mining system for business intelligence[C]//2017 IEEE Third International Conference on Big Data Computing Service and Applications(BigDataService). IEEE, 210-217 Zhou Q, Xia B, Xue W, et al (2017) An advanced inventory data mining system for business intelligence[C]//2017 IEEE Third International Conference on Big Data Computing Service and Applications(BigDataService). IEEE, 210-217
2.
Zurück zum Zitat Ding K, Ma K, Wang S et al (2021) Comparison of full-reference image quality models for optimization of image processing systems[J]. Int J Comput Vision 129(4):1258–1281CrossRef Ding K, Ma K, Wang S et al (2021) Comparison of full-reference image quality models for optimization of image processing systems[J]. Int J Comput Vision 129(4):1258–1281CrossRef
3.
Zurück zum Zitat Wang L, Zhang J, Chen G et al (2020) Identifying comparable entities with indirectly associative relations and word embeddings from web search logs[J]. Decis Support Syst 141:113465CrossRef Wang L, Zhang J, Chen G et al (2020) Identifying comparable entities with indirectly associative relations and word embeddings from web search logs[J]. Decis Support Syst 141:113465CrossRef
4.
Zurück zum Zitat Xiao H, Zhang W, Li W et al (2021) Joint clustering and blockchain for real-time information security transmission at the crossroads in C-V2X networks[J]. IEEE Int Things J 8(18):13926–13938CrossRef Xiao H, Zhang W, Li W et al (2021) Joint clustering and blockchain for real-time information security transmission at the crossroads in C-V2X networks[J]. IEEE Int Things J 8(18):13926–13938CrossRef
5.
Zurück zum Zitat El-Shafeiy E, Sallam KM, Chakrabortty RK et al (2021) A clustering based Swarm Intelligence optimization technique for the Internet of Medical Things[J]. Expert Syst Appl 173:114648CrossRef El-Shafeiy E, Sallam KM, Chakrabortty RK et al (2021) A clustering based Swarm Intelligence optimization technique for the Internet of Medical Things[J]. Expert Syst Appl 173:114648CrossRef
6.
Zurück zum Zitat Jinyin C, Xiang L, Haibing Z et al (2017) A novel cluster center fast determination clustering algorithm[J]. Appl Soft Comput 57:539–555CrossRef Jinyin C, Xiang L, Haibing Z et al (2017) A novel cluster center fast determination clustering algorithm[J]. Appl Soft Comput 57:539–555CrossRef
7.
Zurück zum Zitat Long X, Wu S, Cui B, et al (2019) Analysis of satellite observation task clustering based on the improved clique partition algorithm[C]//2019 IEEE Congress on Evolutionary Computation (CEC). IEEE, 1314-1321 Long X, Wu S, Cui B, et al (2019) Analysis of satellite observation task clustering based on the improved clique partition algorithm[C]//2019 IEEE Congress on Evolutionary Computation (CEC). IEEE, 1314-1321
8.
Zurück zum Zitat Jafarzadegan M, Safi-Esfahani F, Beheshti Z (2019) Combining hierarchical clustering approaches using the PCA method[J]. Expert Syst Appl 137:1–10CrossRef Jafarzadegan M, Safi-Esfahani F, Beheshti Z (2019) Combining hierarchical clustering approaches using the PCA method[J]. Expert Syst Appl 137:1–10CrossRef
9.
Zurück zum Zitat Sarfraz S, Murray N, Sharma V, et al (2021) Temporally-weighted hierarchical clustering for unsupervised action segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11225-11234 Sarfraz S, Murray N, Sharma V, et al (2021) Temporally-weighted hierarchical clustering for unsupervised action segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11225-11234
10.
Zurück zum Zitat Pang Ning, Zhang* Jifu, Zhang Chaowei, Qin Xiao (2019) Parallel Hierarchical Subspace Clustering of Categorical Data[J]. IEEE Transactions on Computers, 86(4): 542-555 Pang Ning, Zhang* Jifu, Zhang Chaowei, Qin Xiao (2019) Parallel Hierarchical Subspace Clustering of Categorical Data[J]. IEEE Transactions on Computers, 86(4): 542-555
11.
Zurück zum Zitat Hu L H, Liu H K, Zhang* J F, et al (2021) KR-DBSCAN: A density-based clustering algorithm based on reverse nearest neighbor and influence space[J]. Expert Systems with Applications, 186: 115763 Hu L H, Liu H K, Zhang* J F, et al (2021) KR-DBSCAN: A density-based clustering algorithm based on reverse nearest neighbor and influence space[J]. Expert Systems with Applications, 186: 115763
12.
Zurück zum Zitat Fahy C, Yang S, Gongora M (2018) Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams[J]. IEEE Trans Cyber 49(6):2215–2228CrossRef Fahy C, Yang S, Gongora M (2018) Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams[J]. IEEE Trans Cyber 49(6):2215–2228CrossRef
13.
Zurück zum Zitat Xu X, Ding S, Wang Y et al (2021) A fast density peaks clustering algorithm with sparse search[J]. Inf Sci 554:61–83MathSciNetCrossRef Xu X, Ding S, Wang Y et al (2021) A fast density peaks clustering algorithm with sparse search[J]. Inf Sci 554:61–83MathSciNetCrossRef
14.
Zurück zum Zitat Cheng M, Ma T, Ma L et al (2022) Adaptive grid-based forest-like clustering algorithm[J]. Neurocomputing 481:168–181CrossRef Cheng M, Ma T, Ma L et al (2022) Adaptive grid-based forest-like clustering algorithm[J]. Neurocomputing 481:168–181CrossRef
15.
Zurück zum Zitat Mai Q, Zhang X, Pan Y, et al (2021) A doubly enhanced em algorithm for model-based tensor clustering[J]. Journal of the American Statistical Association, 1-15 Mai Q, Zhang X, Pan Y, et al (2021) A doubly enhanced em algorithm for model-based tensor clustering[J]. Journal of the American Statistical Association, 1-15
16.
Zurück zum Zitat MacQueen J (1967) Some methods for classification and analysis of multivariate observations[C]//Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1(14): 281-297 MacQueen J (1967) Some methods for classification and analysis of multivariate observations[C]//Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1(14): 281-297
17.
Zurück zum Zitat Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Min Knowl Disc 2(3):283–304CrossRefADS Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Min Knowl Disc 2(3):283–304CrossRefADS
18.
Zurück zum Zitat Huang Z (1997) Clustering large data sets with mixed numeric and categorical values[C]//Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining, (PAKDD). 21-34 Huang Z (1997) Clustering large data sets with mixed numeric and categorical values[C]//Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining, (PAKDD). 21-34
19.
Zurück zum Zitat Ji J, Bai T, Zhou C et al (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data[J]. Neurocomputing 120:590–596CrossRef Ji J, Bai T, Zhou C et al (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data[J]. Neurocomputing 120:590–596CrossRef
20.
21.
Zurück zum Zitat Du M, Ding S, Jia H (2016) Study on density peaks clustering based on k-nearest neighbors and principal component analysis[J]. Knowl-Based Syst 99:135–145CrossRef Du M, Ding S, Jia H (2016) Study on density peaks clustering based on k-nearest neighbors and principal component analysis[J]. Knowl-Based Syst 99:135–145CrossRef
22.
Zurück zum Zitat Ding S, Du M, Sun T et al (2017) An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood[J]. Knowl-Based Syst 133:294–313CrossRef Ding S, Du M, Sun T et al (2017) An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood[J]. Knowl-Based Syst 133:294–313CrossRef
23.
Zurück zum Zitat Ruitong Zhang, et al (2022) "Automating DBSCAN via deep reinforcement learning." Proceedings of the 31st ACM International Conference on Information & Knowledge Management Ruitong Zhang, et al (2022) "Automating DBSCAN via deep reinforcement learning." Proceedings of the 31st ACM International Conference on Information & Knowledge Management
24.
Zurück zum Zitat D’urso P, Massari R (2019) Fuzzy clustering of mixed data[J]. Information Sciences, 505: 513-534 D’urso P, Massari R (2019) Fuzzy clustering of mixed data[J]. Information Sciences, 505: 513-534
25.
Zurück zum Zitat Ahmad A, Khan SS (2021) initKmix-A novel initial partition generation algorithm for clustering mixed data using k-means-based clustering[J]. Expert Syst Appl 167:114149CrossRef Ahmad A, Khan SS (2021) initKmix-A novel initial partition generation algorithm for clustering mixed data using k-means-based clustering[J]. Expert Syst Appl 167:114149CrossRef
26.
Zurück zum Zitat Zhou J, Chen K, Liu J (2021) A clustering algorithm based on the weighted entropy of conditional attributes for mixed data[J]. Concurrency Comput Pract Exp 33(17):e6293CrossRef Zhou J, Chen K, Liu J (2021) A clustering algorithm based on the weighted entropy of conditional attributes for mixed data[J]. Concurrency Comput Pract Exp 33(17):e6293CrossRef
27.
Zurück zum Zitat Dinh DT, Huynh VN, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values[J]. Inf Sci 571:418–442MathSciNetCrossRef Dinh DT, Huynh VN, Sriboonchitta S (2021) Clustering mixed numerical and categorical data with missing values[J]. Inf Sci 571:418–442MathSciNetCrossRef
28.
Zurück zum Zitat Gong F, Nie Y, Xu H (2022) Gromov-Wasserstein multi-modal alignment and clustering. In: Proceedings of the 31st ACM international conference on information & knowledge management. ACM, Atlanta, GA, USA. p 11 Gong F, Nie Y, Xu H (2022) Gromov-Wasserstein multi-modal alignment and clustering. In: Proceedings of the 31st ACM international conference on information & knowledge management. ACM, Atlanta, GA, USA. p 11
29.
Zurück zum Zitat Li F, Qian Y, Wang J et al (2022) Clustering mixed type data: a space structure-based approach[J]. Int J Mach Learn Cybern 13(9):2799–2812CrossRef Li F, Qian Y, Wang J et al (2022) Clustering mixed type data: a space structure-based approach[J]. Int J Mach Learn Cybern 13(9):2799–2812CrossRef
30.
Zurück zum Zitat Masuyama N, Nojima Y, Ishibuchi H, et al (2022) Adaptive Resonance Theory-based Clustering for Handling Mixed Data[C]//2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1-8 Masuyama N, Nojima Y, Ishibuchi H, et al (2022) Adaptive Resonance Theory-based Clustering for Handling Mixed Data[C]//2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1-8
31.
Zurück zum Zitat Diop A, El Malki N, Chevalier M, et al (2022) Impact of similarity measures on clustering mixed data[C]//Proceedings of the 34th International Conference on Scientific and Statistical Database Management. 1-12 Diop A, El Malki N, Chevalier M, et al (2022) Impact of similarity measures on clustering mixed data[C]//Proceedings of the 34th International Conference on Scientific and Statistical Database Management. 1-12
32.
Zurück zum Zitat Zhou J, Chen K, Liu J (2021) A clustering algorithm based on the weighted entropy of conditional attributes for mixed data[J]. Practice and Experience, Concurrency and Computation, p e6293 Zhou J, Chen K, Liu J (2021) A clustering algorithm based on the weighted entropy of conditional attributes for mixed data[J]. Practice and Experience, Concurrency and Computation, p e6293
33.
Zurück zum Zitat Bajcsy P, Ahuja N (1998) Location-and density-based hierarchical clustering using similarity analysis[J]. IEEE Trans Pattern Anal Mach Intell 20(9):1011–1015CrossRef Bajcsy P, Ahuja N (1998) Location-and density-based hierarchical clustering using similarity analysis[J]. IEEE Trans Pattern Anal Mach Intell 20(9):1011–1015CrossRef
34.
Zurück zum Zitat Yan F, Zhang H, Kube CR (2005) A multistage adaptive thresholding method[J]. Pattern Recogn Lett 26(8):1183–1191CrossRefADS Yan F, Zhang H, Kube CR (2005) A multistage adaptive thresholding method[J]. Pattern Recogn Lett 26(8):1183–1191CrossRefADS
35.
Zurück zum Zitat Hu T, Wu W, Liu L (2014) Combination of hard and soft classification method based on adaptive threshold[C]//2014 IEEE Geoscience and Remote Sensing Symposium. IEEE, 4180-4183 Hu T, Wu W, Liu L (2014) Combination of hard and soft classification method based on adaptive threshold[C]//2014 IEEE Geoscience and Remote Sensing Symposium. IEEE, 4180-4183
36.
Zurück zum Zitat Sun H, Chen SP, Xu LP (2018) Research on cloud computing modeling based on fusion difference method and self-adaptive threshold segmentation[J]. Int J Pattern Recognit Artif Intell 32(06):1859010MathSciNetCrossRef Sun H, Chen SP, Xu LP (2018) Research on cloud computing modeling based on fusion difference method and self-adaptive threshold segmentation[J]. Int J Pattern Recognit Artif Intell 32(06):1859010MathSciNetCrossRef
37.
Zurück zum Zitat Xia C, Hsu W, Lee ML et al (2006) Border: Efficient computation of boundary points[J]. IEEE Trans Knowl Data Eng 18(3):289–303CrossRef Xia C, Hsu W, Lee ML et al (2006) Border: Efficient computation of boundary points[J]. IEEE Trans Knowl Data Eng 18(3):289–303CrossRef
38.
Zurück zum Zitat UCI Machine Learning Repository [Online], available: http://archive.ics.uci.edu/ml/datasets.html, April 21, (2018) UCI Machine Learning Repository [Online], available: http://​archive.​ics.​uci.​edu/​ml/​datasets.​html, April 21, (2018)
39.
Zurück zum Zitat Zhang J, Zhao X, Zhang S et al (2013) Interrelation analysis of celestial spectra data using constrained frequent pattern trees[J]. Knowl-Based Syst 41:77–88CrossRef Zhang J, Zhao X, Zhang S et al (2013) Interrelation analysis of celestial spectra data using constrained frequent pattern trees[J]. Knowl-Based Syst 41:77–88CrossRef
40.
Zurück zum Zitat Pang Ning, Zhang* Jifu, Zhang Chaowei, Qin Xiao, Cai Jianghui (2019) PUMA: Parallel Subspace Clustering of Categorical Data Using Multi-Attribute Weights[J]. Expert Systems with Applications, 126: 233-245 Pang Ning, Zhang* Jifu, Zhang Chaowei, Qin Xiao, Cai Jianghui (2019) PUMA: Parallel Subspace Clustering of Categorical Data Using Multi-Attribute Weights[J]. Expert Systems with Applications, 126: 233-245
Metadaten
Titel
A hybrid similarity measure-based clustering approach for mixed attribute data
verfasst von
Kexin Chu
Min Zhang
Yaling Xun
Jifu Zhang
Publikationsdatum
21.09.2023
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 4/2024
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-023-01968-6

Weitere Artikel der Ausgabe 4/2024

International Journal of Machine Learning and Cybernetics 4/2024 Zur Ausgabe

Neuer Inhalt