Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 9/2022

05.07.2022 | Original Article

Clustering mixed type data: a space structure-based approach

verfasst von: Feijiang Li, Yuhua Qian, Jieting Wang, Furong Peng, Jiye Liang

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 9/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering mixed type data is important for the areas such as knowledge discovery and machine learning. Although many clustering algorithms have been developed for mixed type data, clustering mixed type data is still a challenging task. The challenges mainly come from the fact that the numerical attributes and categorical attributes of mixed type data are not in the same space. Most of the mixed data clustering methods handle the two types of attributes separately. The gap between the numerical attributes and categorical attributes is not handled very well. To handle the above issues, we expand the space structure representation scheme for categorical data to mixed type data. In the new scheme, all the attributes of the mixed type data are expressed as the numerical type, which is in a Euclidean space. In addition, we propose an accelerated approximate space structure based on the Nyström method, which reduces the time cost and memory cost of constructing a space structure. We then propose general frameworks based on the space structure data (SBM) and accelerated approximate space structure (Ap-SBM) for mixed type data clustering. Experimental analyses reflect the ability of the space structure to express the original mixed type data and the ability of the accelerated approximate space structure to express the space structure. The experimental results on thirteen mixed type data sets from UCI show superiority of the proposed frameworks compared with the other six representative mixed type data clustering algorithms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRef
2.
Zurück zum Zitat Vegapons S, Ruizshulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recognit Artif Intell 25(03):337–372MathSciNetCrossRef Vegapons S, Ruizshulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recognit Artif Intell 25(03):337–372MathSciNetCrossRef
3.
Zurück zum Zitat Li F, Qian Y, Wang J, Dang C, Liu B (2018) Cluster’s quality evaluation and selective clustering ensemble. ACM Trans Knowl Discov Data 12(5):60CrossRef Li F, Qian Y, Wang J, Dang C, Liu B (2018) Cluster’s quality evaluation and selective clustering ensemble. ACM Trans Knowl Discov Data 12(5):60CrossRef
4.
Zurück zum Zitat Li F, Qian Y, Wang J, Liang J (2017) Multigranulation information fusion: a dempster-shafer evidence theory-based clustering ensemble method. Inf Sci 378:389–409MATHCrossRef Li F, Qian Y, Wang J, Liang J (2017) Multigranulation information fusion: a dempster-shafer evidence theory-based clustering ensemble method. Inf Sci 378:389–409MATHCrossRef
5.
Zurück zum Zitat Macqueen J (1965) Some methods for classification and analysis of multivariate observations. In: Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 Macqueen J (1965) Some methods for classification and analysis of multivariate observations. In: Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297
7.
Zurück zum Zitat Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496CrossRef Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496CrossRef
8.
Zurück zum Zitat Aggarwal CC, Procopiuc CM, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62CrossRef Aggarwal CC, Procopiuc CM, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62CrossRef
9.
Zurück zum Zitat Chen H, Chuang K, Chen M (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20(11):1458–1472CrossRef Chen H, Chuang K, Chen M (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20(11):1458–1472CrossRef
10.
Zurück zum Zitat Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. Research Issues on Data Mining and Knowledge Discovery 1–8 Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. Research Issues on Data Mining and Knowledge Discovery 1–8
11.
Zurück zum Zitat Qian Y, Liang J, Pedrycz W, Dang C (2010) Positive approximation: an accelerator for attribute reduction in rough set theory. Artif Intell 174(9):597–618MathSciNetMATHCrossRef Qian Y, Liang J, Pedrycz W, Dang C (2010) Positive approximation: an accelerator for attribute reduction in rough set theory. Artif Intell 174(9):597–618MathSciNetMATHCrossRef
12.
Zurück zum Zitat Bai L, Liang J, Dang C, Cao F (2013) The impact of cluster representatives on the convergence of the k-modes type clustering. IEEE Trans Pattern Anal Mach Intell 35(6):1509–1522CrossRef Bai L, Liang J, Dang C, Cao F (2013) The impact of cluster representatives on the convergence of the k-modes type clustering. IEEE Trans Pattern Anal Mach Intell 35(6):1509–1522CrossRef
13.
Zurück zum Zitat Hunt L, Jorgensen M (2011) Clustering mixed data. Wiley Interdisciplinary Rev Data Mining Knowledge Discovery 1(4):352–361CrossRef Hunt L, Jorgensen M (2011) Clustering mixed data. Wiley Interdisciplinary Rev Data Mining Knowledge Discovery 1(4):352–361CrossRef
14.
Zurück zum Zitat Blomstedt P, Tang J, Xiong J, Granlund C, Corander J (2015) A bayesian predictive model for clustering data of mixed discrete and continuous type. IEEE Trans Pattern Anal Mach Intell 37(3):489–498CrossRef Blomstedt P, Tang J, Xiong J, Granlund C, Corander J (2015) A bayesian predictive model for clustering data of mixed discrete and continuous type. IEEE Trans Pattern Anal Mach Intell 37(3):489–498CrossRef
15.
Zurück zum Zitat Lam D, Wei M, Wunsch D (2017) Clustering data of mixed categorical and numerical type with unsupervised feature learning. IEEE Access 3(2):1605–1613 Lam D, Wei M, Wunsch D (2017) Clustering data of mixed categorical and numerical type with unsupervised feature learning. IEEE Access 3(2):1605–1613
16.
Zurück zum Zitat Ni X, Quadrianto N, Wang Y, Chen C (2017) Composing tree graphical models with persistent homology features for clustering mixed-type data. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2622–2631 Ni X, Quadrianto N, Wang Y, Chen C (2017) Composing tree graphical models with persistent homology features for clustering mixed-type data. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2622–2631
17.
Zurück zum Zitat Jeris C, Jeris C, Jeris C, Jeris C, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–268 Jeris C, Jeris C, Jeris C, Jeris C, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 263–268
18.
Zurück zum Zitat Hsu C, Chen C, Su Y (2007) Hierarchical clustering of mixed data based on distance hierarchy. Inf Sci 177(20):4474–4492CrossRef Hsu C, Chen C, Su Y (2007) Hierarchical clustering of mixed data based on distance hierarchy. Inf Sci 177(20):4474–4492CrossRef
19.
Zurück zum Zitat Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 4:673–690CrossRef Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 4:673–690CrossRef
20.
Zurück zum Zitat Hsu C, Huang W (2016) Integrated dimensionality reduction technique for mixed-type data involving categorical values. Appl Soft Comput 43:199–209CrossRef Hsu C, Huang W (2016) Integrated dimensionality reduction technique for mixed-type data involving categorical values. Appl Soft Comput 43:199–209CrossRef
21.
Zurück zum Zitat Manuela H, Dominic E, Annette K-S (2017) Clustering of samples and variables with mixed-type data. PLoS ONE 12(11):0188274 Manuela H, Dominic E, Annette K-S (2017) Clustering of samples and variables with mixed-type data. PLoS ONE 12(11):0188274
22.
Zurück zum Zitat Chen J, He H (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf Sci 345(C):271–293CrossRef Chen J, He H (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf Sci 345(C):271–293CrossRef
23.
Zurück zum Zitat Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15(9):850–863CrossRef Huttenlocher DP, Klanderman GA, Rucklidge WJ (1993) Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Mach Intell 15(9):850–863CrossRef
24.
Zurück zum Zitat Mao J, Jain AK (1996) A self-organizing network for hyperellipsoidal clustering. IEEE Trans Neural Networks 7(1):16CrossRef Mao J, Jain AK (1996) A self-organizing network for hyperellipsoidal clustering. IEEE Trans Neural Networks 7(1):16CrossRef
25.
Zurück zum Zitat Jarvis RA, Patrick EA (2006) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput C–22(11):1025–1034CrossRef Jarvis RA, Patrick EA (2006) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput C–22(11):1025–1034CrossRef
26.
Zurück zum Zitat Michalski RS, Stepp RE (1983) Automated construction of classifications: conceptual clustering versus numerical taxonomy. IEEE Trans Pattern Anal Mach Intell 5(4):396–410CrossRef Michalski RS, Stepp RE (1983) Automated construction of classifications: conceptual clustering versus numerical taxonomy. IEEE Trans Pattern Anal Mach Intell 5(4):396–410CrossRef
27.
Zurück zum Zitat Wang P, Yao Y (2018) Ce3: a three-way clustering method based on mathematical morphology. Knowl-Based Syst 155:54–65CrossRef Wang P, Yao Y (2018) Ce3: a three-way clustering method based on mathematical morphology. Knowl-Based Syst 155:54–65CrossRef
28.
Zurück zum Zitat Wang P, Shi H, Yang X, Mi J (2019) Three-way k-means: integrating k-means and three-way decision. Int J Mach Learn Cybern 10:2767–2777CrossRef Wang P, Shi H, Yang X, Mi J (2019) Three-way k-means: integrating k-means and three-way decision. Int J Mach Learn Cybern 10:2767–2777CrossRef
29.
Zurück zum Zitat Qian Y, Li F, Liang J, Liu B, Dang C (2016) Space structure and clustering of categorical data. IEEE Transact Neural Networks Learn Syst 27(10):2047–2059MathSciNetCrossRef Qian Y, Li F, Liang J, Liu B, Dang C (2016) Space structure and clustering of categorical data. IEEE Transact Neural Networks Learn Syst 27(10):2047–2059MathSciNetCrossRef
30.
Zurück zum Zitat Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304CrossRef Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304CrossRef
31.
Zurück zum Zitat Ji J, Pang W, Zhou C, Han X, Wang Z (2012) A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowl-Based Syst 30:129–135CrossRef Ji J, Pang W, Zhou C, Han X, Wang Z (2012) A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowl-Based Syst 30:129–135CrossRef
32.
Zurück zum Zitat Zhao W, Dai W, Tang C (2007) K-centers algorithm for clustering mixed type data. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 1140–1147. Springer Zhao W, Dai W, Tang C (2007) K-centers algorithm for clustering mixed type data. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 1140–1147. Springer
33.
Zurück zum Zitat Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-asia Conference on Knowledge Discovery and Data Mining, pp. 21–34. Singapore Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-asia Conference on Knowledge Discovery and Data Mining, pp. 21–34. Singapore
34.
Zurück zum Zitat Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882CrossRef Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882CrossRef
35.
Zurück zum Zitat Hsu C, Wang S (2006) An integrated framework for visualized and exploratory pattern discovery in mixed data. IEEE Trans Knowl Data Eng 18(2):161–173CrossRef Hsu C, Wang S (2006) An integrated framework for visualized and exploratory pattern discovery in mixed data. IEEE Trans Knowl Data Eng 18(2):161–173CrossRef
36.
Zurück zum Zitat Hsu C, Chen Y (2007) Mining of mixed data with application to catalog marketing. Expert Syst Appl 32(1):12–23CrossRef Hsu C, Chen Y (2007) Mining of mixed data with application to catalog marketing. Expert Syst Appl 32(1):12–23CrossRef
37.
Zurück zum Zitat Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recogn 45(6):2251–2265MATHCrossRef Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recogn 45(6):2251–2265MATHCrossRef
38.
Zurück zum Zitat Rnyi A (1961) On measures of entropy and information. Proc.fourth Berkeley Symp.on Math.statist. & Prob.univ.of Calif 1(5073):547–561 Rnyi A (1961) On measures of entropy and information. Proc.fourth Berkeley Symp.on Math.statist. & Prob.univ.of Calif 1(5073):547–561
39.
Zurück zum Zitat Liang J, Chin K, Dang C, Yam RC (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342MathSciNetMATHCrossRef Liang J, Chin K, Dang C, Yam RC (2002) A new method for measuring uncertainty and fuzziness in rough set theory. Int J Gen Syst 31(4):331–342MathSciNetMATHCrossRef
40.
Zurück zum Zitat Cheung Y, Jia H (2013) Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recogn 46(8):2228–2238MATHCrossRef Cheung Y, Jia H (2013) Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recogn 46(8):2228–2238MATHCrossRef
41.
Zurück zum Zitat Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871CrossRef Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871CrossRef
42.
Zurück zum Zitat Wangchamhan T, Chiewchanwattana S, Sunat K (2017) Efficient algorithms based on the k-means and chaotic league championship algorithm for numeric, categorical, and mixed-type data clustering. Expert Syst Appl 90:146–167CrossRef Wangchamhan T, Chiewchanwattana S, Sunat K (2017) Efficient algorithms based on the k-means and chaotic league championship algorithm for numeric, categorical, and mixed-type data clustering. Expert Syst Appl 90:146–167CrossRef
44.
Zurück zum Zitat Stanfill C, Waltz D (1986) Toward memory-based reasoning. Comm Acm 29(12):1213–1228CrossRef Stanfill C, Waltz D (1986) Toward memory-based reasoning. Comm Acm 29(12):1213–1228CrossRef
45.
Zurück zum Zitat Yuan K, Xu W, Li W, Weiping D (2022) An incremental learning mechanism for object classification based on progressive fuzzy three-way concept. Inf Sci 584:127–147CrossRef Yuan K, Xu W, Li W, Weiping D (2022) An incremental learning mechanism for object classification based on progressive fuzzy three-way concept. Inf Sci 584:127–147CrossRef
46.
Zurück zum Zitat Li M, Chen M, Xu W (2019) Double-quantitative multigranulation decision-theoretic rough fuzzy set model. Int J Mach Learn Cybern 10:3225–3244CrossRef Li M, Chen M, Xu W (2019) Double-quantitative multigranulation decision-theoretic rough fuzzy set model. Int J Mach Learn Cybern 10:3225–3244CrossRef
47.
Zurück zum Zitat Xu W, Yu J (2017) A novel approach to information fusion in multi-source datasets: A granular computing viewpoint. Inf Sci 378:410–423MATHCrossRef Xu W, Yu J (2017) A novel approach to information fusion in multi-source datasets: A granular computing viewpoint. Inf Sci 378:410–423MATHCrossRef
48.
Zurück zum Zitat Chatzis SP (2011) A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst Appl 38(7):8684–8689CrossRef Chatzis SP (2011) A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst Appl 38(7):8684–8689CrossRef
49.
Zurück zum Zitat Zheng Z, Gong M, Ma J, Jiao L, Wu Q (2010) Unsupervised evolutionary clustering algorithm for mixed type data. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1–8 Zheng Z, Gong M, Ma J, Jiao L, Wu Q (2010) Unsupervised evolutionary clustering algorithm for mixed type data. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1–8
50.
Zurück zum Zitat Ji J, Bai T, Zhou C, Ma C, Wang Z (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120:590–596CrossRef Ji J, Bai T, Zhou C, Ma C, Wang Z (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120:590–596CrossRef
51.
52.
Zurück zum Zitat Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 682–688 Williams C, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 682–688
53.
Zurück zum Zitat Charless F, Serge B, Fan C, Jitendra M (2019) Spectral grouping using the nystrm method. IEEE Trans Pattern Anal Mach Intell 26(2):214–25 Charless F, Serge B, Fan C, Jitendra M (2019) Spectral grouping using the nystrm method. IEEE Trans Pattern Anal Mach Intell 26(2):214–25
54.
Zurück zum Zitat Chen W-Y, Song Y, Bai H, Lin C-J, Chang EY (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586CrossRef Chen W-Y, Song Y, Bai H, Lin C-J, Chang EY (2011) Parallel spectral clustering in distributed systems. IEEE Trans Pattern Anal Mach Intell 33(3):568–586CrossRef
55.
Zurück zum Zitat Cai D, Chen X (2015) Large scale spectral clustering via landmark-based sparse representation. IEEE Transact Cybern 45(8):1669–1680CrossRef Cai D, Chen X (2015) Large scale spectral clustering via landmark-based sparse representation. IEEE Transact Cybern 45(8):1669–1680CrossRef
56.
Zurück zum Zitat Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recogn 45:2251–2265MATHCrossRef Liang J, Zhao X, Li D, Cao F, Dang C (2012) Determining the number of clusters using information entropy for mixed data. Pattern Recogn 45:2251–2265MATHCrossRef
58.
Zurück zum Zitat Reshef DN, Reshef YA, Finucane HK, Grossman SR, Mcvean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524MATHCrossRef Reshef DN, Reshef YA, Finucane HK, Grossman SR, Mcvean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524MATHCrossRef
59.
Zurück zum Zitat Yang Y (1999) An evaluation of statistical approaches to yext categorization. Inf Retrieval 1:69–90CrossRef Yang Y (1999) An evaluation of statistical approaches to yext categorization. Inf Retrieval 1:69–90CrossRef
60.
Metadaten
Titel
Clustering mixed type data: a space structure-based approach
verfasst von
Feijiang Li
Yuhua Qian
Jieting Wang
Furong Peng
Jiye Liang
Publikationsdatum
05.07.2022
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 9/2022
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-022-01602-x

Weitere Artikel der Ausgabe 9/2022

International Journal of Machine Learning and Cybernetics 9/2022 Zur Ausgabe