Top

International Journal of Machine Learning and Cybernetics

Published in:

05-07-2022 | Original Article

Clustering mixed type data: a space structure-based approach

Authors: Feijiang Li, Yuhua Qian, Jieting Wang, Furong Peng, Jiye Liang

Published in: International Journal of Machine Learning and Cybernetics | Issue 9/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Clustering mixed type data is important for the areas such as knowledge discovery and machine learning. Although many clustering algorithms have been developed for mixed type data, clustering mixed type data is still a challenging task. The challenges mainly come from the fact that the numerical attributes and categorical attributes of mixed type data are not in the same space. Most of the mixed data clustering methods handle the two types of attributes separately. The gap between the numerical attributes and categorical attributes is not handled very well. To handle the above issues, we expand the space structure representation scheme for categorical data to mixed type data. In the new scheme, all the attributes of the mixed type data are expressed as the numerical type, which is in a Euclidean space. In addition, we propose an accelerated approximate space structure based on the Nyström method, which reduces the time cost and memory cost of constructing a space structure. We then propose general frameworks based on the space structure data (SBM) and accelerated approximate space structure (Ap-SBM) for mixed type data clustering. Experimental analyses reflect the ability of the space structure to express the original mixed type data and the ability of the accelerated approximate space structure to express the space structure. The experimental results on thirteen mixed type data sets from UCI show superiority of the proposed frameworks compared with the other six representative mixed type data clustering algorithms.

previous article RD-NMSVM: neural mapping support vector machine based on parameter regularization and knowledge distillation

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information.

Order your 30-days-trial for free and without any commitment.

inform now

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik.

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

inform now

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef

Vegapons S, Ruizshulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recognit Artif Intell 25(03):337–372MathSciNetCrossRef

Li F, Qian Y, Wang J, Dang C, Liu B (2018) Cluster’s quality evaluation and selective clustering ensemble. ACM Trans Knowl Discov Data 12(5):60CrossRef

Li F, Qian Y, Wang J, Liang J (2017) Multigranulation information fusion: a dempster-shafer evidence theory-based clustering ensemble method. Inf Sci 378:389–409MATHCrossRef

Macqueen J (1965) Some methods for classification and analysis of multivariate observations. In: Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297

Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976MathSciNetMATHCrossRef

Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496CrossRef

Aggarwal CC, Procopiuc CM, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62CrossRef

Chen H, Chuang K, Chen M (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20(11):1458–1472CrossRef

10.

Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. Research Issues on Data Mining and Knowledge Discovery 1–8

11.

Qian Y, Liang J, Pedrycz W, Dang C (2010) Positive approximation: an accelerator for attribute reduction in rough set theory. Artif Intell 174(9):597–618MathSciNetMATHCrossRef

12.

Bai L, Liang J, Dang C, Cao F (2013) The impact of cluster representatives on the convergence of the k-modes type clustering. IEEE Trans Pattern Anal Mach Intell 35(6):1509–1522CrossRef

13.

Hunt L, Jorgensen M (2011) Clustering mixed data. Wiley Interdisciplinary Rev Data Mining Knowledge Discovery 1(4):352–361CrossRef

14.

Blomstedt P, Tang J, Xiong J, Granlund C, Corander J (2015) A bayesian predictive model for clustering data of mixed discrete and continuous type. IEEE Trans Pattern Anal Mach Intell 37(3):489–498CrossRef

15.

Lam D, Wei M, Wunsch D (2017) Clustering data of mixed categorical and numerical type with unsupervised feature learning. IEEE Access 3(2):1605–1613

16.

Ni X, Quadrianto N, Wang Y, Chen C (2017) Composing tree graphical models with persistent homology features for clustering mixed-type data. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2622–2631

17.

Jeris C, Jeris C, Jeris C, Jeris C, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–268

18.

Hsu C, Chen C, Su Y (2007) Hierarchical clustering of mixed data based on distance hierarchy. Inf Sci 177(20):4474–4492CrossRef

19.

Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 4:673–690CrossRef

20.

Hsu C, Huang W (2016) Integrated dimensionality reduction technique for mixed-type data involving categorical values. Appl Soft Comput 43:199–209CrossRef

21.

Manuela H, Dominic E, Annette K-S (2017) Clustering of samples and variables with mixed-type data. PLoS ONE 12(11):0188274

22.

Chen J, He H (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf Sci 345(C):271–293CrossRef

23.

Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15(9):850–863CrossRef

24.

Mao J, Jain AK (1996) A self-organizing network for hyperellipsoidal clustering. IEEE Trans Neural Networks 7(1):16CrossRef

25.

Jarvis RA, Patrick EA (2006) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput C–22(11):1025–1034CrossRef

26.

Michalski RS, Stepp RE (1983) Automated construction of classifications: conceptual clustering versus numerical taxonomy. IEEE Trans Pattern Anal Mach Intell 5(4):396–410CrossRef

27.

Wang P, Yao Y (2018) Ce3: a three-way clustering method based on mathematical morphology. Knowl-Based Syst 155:54–65CrossRef

28.

Wang P, Shi H, Yang X, Mi J (2019) Three-way k-means: integrating k-means and three-way decision. Int J Mach Learn Cybern 10:2767–2777CrossRef

29.

Qian Y, Li F, Liang J, Liu B, Dang C (2016) Space structure and clustering of categorical data. IEEE Transact Neural Networks Learn Syst 27(10):2047–2059MathSciNetCrossRef

30.

Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304CrossRef

31.

Ji J, Pang W, Zhou C, Han X, Wang Z (2012) A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowl-Based Syst 30:129–135CrossRef

32.

Zhao W, Dai W, Tang C (2007) K-centers algorithm for clustering mixed type data. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 1140–1147. Springer

33.

Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-asia Conference on Knowledge Discovery and Data Mining, pp. 21–34. Singapore

34.

Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882CrossRef

35.

Hsu C, Wang S (2006) An integrated framework for visualized and exploratory pattern discovery in mixed data. IEEE Trans Knowl Data Eng 18(2):161–173CrossRef

36.

Hsu C, Chen Y (2007) Mining of mixed data with application to catalog marketing. Expert Syst Appl 32(1):12–23CrossRef

37.

Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recogn 45(6):2251–2265MATHCrossRef

38.

Rnyi A (1961) On measures of entropy and information. Proc.fourth Berkeley Symp.on Math.statist. & Prob.univ.of Calif 1(5073):547–561

39.

Liang J, Chin K, Dang C, Yam RC (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342MathSciNetMATHCrossRef

40.

Cheung Y, Jia H (2013) Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recogn 46(8):2228–2238MATHCrossRef

41.

Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871CrossRef

42.

Wangchamhan T, Chiewchanwattana S, Sunat K (2017) Efficient algorithms based on the k-means and chaotic league championship algorithm for numeric, categorical, and mixed-type data clustering. Expert Syst Appl 90:146–167CrossRef

43.

Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 11(1):1–34MathSciNetMATHCrossRef

44.

Stanfill C, Waltz D (1986) Toward memory-based reasoning. Comm Acm 29(12):1213–1228CrossRef

45.

Yuan K, Xu W, Li W, Weiping D (2022) An incremental learning mechanism for object classification based on progressive fuzzy three-way concept. Inf Sci 584:127–147CrossRef

46.

Li M, Chen M, Xu W (2019) Double-quantitative multigranulation decision-theoretic rough fuzzy set model. Int J Mach Learn Cybern 10:3225–3244CrossRef

47.

Xu W, Yu J (2017) A novel approach to information fusion in multi-source datasets: A granular computing viewpoint. Inf Sci 378:410–423MATHCrossRef

48.

Chatzis SP (2011) A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst Appl 38(7):8684–8689CrossRef

49.

Zheng Z, Gong M, Ma J, Jiao L, Wu Q (2010) Unsupervised evolutionary clustering algorithm for mixed type data. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1–8

50.

Ji J, Bai T, Zhou C, Ma C, Wang Z (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120:590–596CrossRef

51.

Foss A, Markatou M, Ray BK, Heching AR (2016) A semiparametric method for clustering mixed data. Mach Learn 105(3):419–458MathSciNetMATHCrossRef

52.

Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 682–688

53.

Charless F, Serge B, Fan C, Jitendra M (2019) Spectral grouping using the nystrm method. IEEE Trans Pattern Anal Mach Intell 26(2):214–25

54.

Chen W-Y, Song Y, Bai H, Lin C-J, Chang EY (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586CrossRef

55.

Cai D, Chen X (2015) Large scale spectral clustering via landmark-based sparse representation. IEEE Transact Cybern 45(8):1669–1680CrossRef

56.

Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recogn 45:2251–2265MATHCrossRef

57.

Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

58.

Reshef DN, Reshef YA, Finucane HK, Grossman SR, Mcvean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524MATHCrossRef

59.

Yang Y (1999) An evaluation of statistical approaches to yext categorization. Inf Retrieval 1:69–90CrossRef

60.

Hubert LJ, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218MATHCrossRef

Title: Clustering mixed type data: a space structure-based approach
Authors: Feijiang Li
Yuhua Qian
Jieting Wang
Furong Peng
Jiye Liang
Publication date: 05-07-2022
Publisher: Springer Berlin Heidelberg
Published in: International Journal of Machine Learning and Cybernetics / Issue 9/2022
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI: https://doi.org/10.1007/s13042-022-01602-x

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

ATZelectronics worldwide

ATZelektronik

Other articles of this Issue 9/2022

A Gaussian RBM with binary auxiliary units

Unsupervised image clustering algorithm based on contrastive learning and K-nearest neighbors

A dynamic multi-swarm cooperation particle swarm optimization with dimension mutation for complex optimization problem

CIRAN: extracting crowd interaction with residual attention network for pedestrian trajectory prediction

Restricted subgradient descend method for sparse signal learning

An improved multi-population whale optimization algorithm