Skip to main content
Top

07-10-2024 | Original Research

Mixed-Type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning

Authors: Jesse S. Ghashti, John R. J. Thompson

Published in: Journal of Classification

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Distance-based clustering is widely used to group mixed numeric and categorical data (mixed-type data), where a predefined metric is used to quantify dissimilarity or distance between data points for clustering data. However, many existing metrics for mixed-type data convert continuous attributes to categorical attributes, or vice versa, and treat variables collectively as a single type, or calculate a distance between each variable separately and combine them. We propose a flexible kernel metric learning approach that balances numeric and categorical data types while determining which variables are relevant to dissimilarities within a dataset. The distance using kernel product similarity (DKPS) function uses kernel functions to measure similarity, with a maximum similarity cross-validated (MSCV) bandwidth selection technique that automatically scales and selects variables relevant to the underlying dissimilarities between data points. We prove that the DKPS function is a metric and show that the DKPS metric is a shrinkage method between maximum dissimilarity between all data points to uniform dissimilarity across data points. We demonstrate that when using the DKPS metric in various distance-based clustering algorithms, we improve clustering accuracy for simulated and real-world mixed-type datasets. In the context of clustering, we show that the DKPS metric with MSCV bandwidths is able to smooth out irrelevant variables and balance variables important to dissimilarity within mixed-type datasets.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Bekka, B., de la Harpe, P., & Valette, A. (2008). Kazhdan’s Property. New Mathematical Monographs: Cambridge University Press.CrossRef Bekka, B., de la Harpe, P., & Valette, A. (2008). Kazhdan’s Property. New Mathematical Monographs: Cambridge University Press.CrossRef
go back to reference Bishnoi, S., & Hooda, B. (2020). A survey of distance measures for mixed variables. International Journal of Chemical Studies, 8(4), 338–343.CrossRef Bishnoi, S., & Hooda, B. (2020). A survey of distance measures for mixed variables. International Journal of Chemical Studies, 8(4), 338–343.CrossRef
go back to reference Bishop, C. M. (2006). Pattern recognition and machine learning, (Vol. 4). Springer. Bishop, C. M. (2006). Pattern recognition and machine learning, (Vol. 4). Springer.
go back to reference Buchta, C., & Hahsler, M. (2022). cba: clustering for business analytics. R package version 0.2-23. Buchta, C., & Hahsler, M. (2022). cba: clustering for business analytics. R package version 0.2-23.
go back to reference Choi, S.-S., Cha, S.-H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43–48. Choi, S.-S., Cha, S.-H., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43–48.
go back to reference Christoph, M., & Stier, Q. (2021). Fundamental clustering algorithms suite. SoftwareX, 13, 100642.CrossRef Christoph, M., & Stier, Q. (2021). Fundamental clustering algorithms suite. SoftwareX, 13, 100642.CrossRef
go back to reference Coombes, K. R. (2019). Thresher: Threshing and reaping for principal components. R Package Version, 1(1),3. Coombes, K. R. (2019). Thresher: Threshing and reaping for principal components. R Package Version, 1(1),3.
go back to reference de Leon, A. R., & Carriere, K. C. (2005). A generalized Mahalanobis distance for mixed data. Journal of Multivariate Analysis, 92(1), 174–185.MathSciNetCrossRef de Leon, A. R., & Carriere, K. C. (2005). A generalized Mahalanobis distance for mixed data. Journal of Multivariate Analysis, 92(1), 174–185.MathSciNetCrossRef
go back to reference Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In: Machine Learning Proceedings 1995, (pp. 194–202). Elsevier. Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In: Machine Learning Proceedings 1995, (pp. 194–202). Elsevier.
go back to reference Dua, D., & Graff, C. (2017). UCI Machine Learning Repository. Dua, D., & Graff, C. (2017). UCI Machine Learning Repository.
go back to reference Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1), 153–158.MathSciNetCrossRef Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1), 153–158.MathSciNetCrossRef
go back to reference Foss, A. H., & Markatou, M. (2018). kamila: Clustering mixed-type data in R and Hadoop. Journal of Statistical Software, 83(13), 1–45.CrossRef Foss, A. H., & Markatou, M. (2018). kamila: Clustering mixed-type data in R and Hadoop. Journal of Statistical Software, 83(13), 1–45.CrossRef
go back to reference Foss, A. H., Markatou, M., & Ray, B. (2019). Distance metrics and clustering methods for mixed-type data. International Statistical Review, 87(1), 80–109.MathSciNetCrossRef Foss, A. H., Markatou, M., & Ray, B. (2019). Distance metrics and clustering methods for mixed-type data. International Statistical Review, 87(1), 80–109.MathSciNetCrossRef
go back to reference Foss, A. H., Markatou, M., Ray, B., & Heching, A. (2016). A semiparametric method for clustering mixed data. Machine Learning, 105, 419–458.MathSciNetCrossRef Foss, A. H., Markatou, M., Ray, B., & Heching, A. (2016). A semiparametric method for clustering mixed data. Machine Learning, 105, 419–458.MathSciNetCrossRef
go back to reference Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611–631.MathSciNetCrossRef Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), 611–631.MathSciNetCrossRef
go back to reference Ghashti, J. (2024). Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data. Master’s thesis, The University of British Columbia. Ghashti, J. (2024). Similarity maximization and shrinkage approach in kernel metric learning for clustering mixed-type data. Master’s thesis, The University of British Columbia.
go back to reference Ghashti, J.S., & Thompson, J.R.J. (2023). The complexity of financial wellness: examining survey patterns via kernel metric learning and clustering of mixed-type data. In: Proceedings of the Fourth ACM International Conference on AI in Finance, (pp. 314–322). Ghashti, J.S., & Thompson, J.R.J. (2023). The complexity of financial wellness: examining survey patterns via kernel metric learning and clustering of mixed-type data. In: Proceedings of the Fourth ACM International Conference on AI in Finance, (pp. 314–322).
go back to reference Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 24(4), 857–871.CrossRef Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 24(4), 857–871.CrossRef
go back to reference Guha, S., Rastogi, R., & Shim, K. (2000). ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5), 345–366.CrossRef Guha, S., Rastogi, R., & Shim, K. (2000). ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5), 345–366.CrossRef
go back to reference Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108. Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108.
go back to reference Heinz, G., Peterson, L. J., Johnson, R. W., & Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education,11(2). Heinz, G., Peterson, L. J., Johnson, R. W., & Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education,11(2).
go back to reference Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (2015). Handbook of cluster analysis. CRC Press.CrossRef Hennig, C., Meila, M., Murtagh, F., & Rocci, R. (2015). Handbook of cluster analysis. CRC Press.CrossRef
go back to reference Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.CrossRef Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.CrossRef
go back to reference Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.CrossRef Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218.CrossRef
go back to reference Jäkel, F., Schölkopf, B., & Wichmann, F. A. (2008). Similarity, kernels, and the triangle inequality. Journal of Mathematical Psychology, 52(5), 297–303.MathSciNetCrossRef Jäkel, F., Schölkopf, B., & Wichmann, F. A. (2008). Similarity, kernels, and the triangle inequality. Journal of Mathematical Psychology, 52(5), 297–303.MathSciNetCrossRef
go back to reference Joshi, S., Kommaraji, R.V., Phillips, J.M., Venkatasubramanian, S. (2011). Comparing distributions and shapes using the kernel distance. In: Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, (pp. 47–56). Joshi, S., Kommaraji, R.V., Phillips, J.M., Venkatasubramanian, S. (2011). Comparing distributions and shapes using the kernel distance. In: Proceedings of the Twenty-Seventh Annual Symposium on Computational Geometry, (pp. 47–56).
go back to reference Kaufman, L. (1990). Partitioning around medoids (program pam). Finding Groups in Data, 344, 68–125.CrossRef Kaufman, L. (1990). Partitioning around medoids (program pam). Finding Groups in Data, 344, 68–125.CrossRef
go back to reference Lance, G. N., & Williams, W. T. (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. The Computer Journal, 9(4), 373–380.CrossRef Lance, G. N., & Williams, W. T. (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. The Computer Journal, 9(4), 373–380.CrossRef
go back to reference Li, Q., & Racine, J.S. (2023). Nonparametric econometrics: Theory and practice. Princeton University Press. Li, Q., & Racine, J.S. (2023). Nonparametric econometrics: Theory and practice. Princeton University Press.
go back to reference Lindsay, B. G., Markatou, M., Ray, S., Yang, K., & Chen, S.-C. (2008). Quadratic distances on probabilities: A unified foundation. The Annals of Statistics, 36(2), 983–1006.MathSciNetCrossRef Lindsay, B. G., Markatou, M., Ray, S., Yang, K., & Chen, S.-C. (2008). Quadratic distances on probabilities: A unified foundation. The Annals of Statistics, 36(2), 983–1006.MathSciNetCrossRef
go back to reference Macnaughton-Smith, P. N. M. (1965). Some statistical and other numerical techniques for classifying individuals. London: H.M.S.O. Macnaughton-Smith, P. N. M. (1965). Some statistical and other numerical techniques for classifying individuals. London: H.M.S.O.
go back to reference Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2022). Cluster: Cluster analysis basics and extensions. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2022). Cluster: Cluster analysis basics and extensions.
go back to reference McParland, D., & Gormley, I. C. (2017). clustMD: Model based clustering for mixed data. R Package Version, 1(2), 1. McParland, D., & Gormley, I. C. (2017). clustMD: Model based clustering for mixed data. R Package Version, 1(2), 1.
go back to reference Modha, D. S., & Spangler, W. S. (2003). Feature weighting in k-means clustering. Machine Learning, 52, 217–237.CrossRef Modha, D. S., & Spangler, W. S. (2003). Feature weighting in k-means clustering. Machine Learning, 52, 217–237.CrossRef
go back to reference Mousavi, E., & Sehhati, M. (2023). A generalized multi-aspect distance metric for mixed-type data clustering. Pattern Recognition, 138, 109353.CrossRef Mousavi, E., & Sehhati, M. (2023). A generalized multi-aspect distance metric for mixed-type data clustering. Pattern Recognition, 138, 109353.CrossRef
go back to reference Ontañón, S. (2020). An overview of distance and similarity functions for structured data. Artificial Intelligence Review, 53(7), 5309–5351.CrossRef Ontañón, S. (2020). An overview of distance and similarity functions for structured data. Artificial Intelligence Review, 53(7), 5309–5351.CrossRef
go back to reference Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065–1076.MathSciNetCrossRef Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3), 1065–1076.MathSciNetCrossRef
go back to reference Podani, J. (1999). Extending Gower’s general coefficient of similarity to ordinal characters. Taxon, 48(2), 331–340.CrossRef Podani, J. (1999). Extending Gower’s general coefficient of similarity to ordinal characters. Taxon, 48(2), 331–340.CrossRef
go back to reference R Core Team. (2024). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. R Core Team. (2024). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
go back to reference Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3), 832–837.MathSciNetCrossRef Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3), 832–837.MathSciNetCrossRef
go back to reference Sain, S. R., Baggerly, K. A., & Scott, D. W. (1994). Cross-validation of multivariate densities. Journal of the American Statistical Association, 89(427), 807–817.MathSciNetCrossRef Sain, S. R., Baggerly, K. A., & Scott, D. W. (1994). Cross-validation of multivariate densities. Journal of the American Statistical Association, 89(427), 807–817.MathSciNetCrossRef
go back to reference Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289.CrossRef Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289.CrossRef
go back to reference Silverman, B. W. (1986). Density estimation for statistics and data analysis. Routledge. Silverman, B. W. (1986). Density estimation for statistics and data analysis. Routledge.
go back to reference Steinley, D. (2004). Properties of the Hubert-arable adjusted rand index. Psychological Methods, 9(3), 386.CrossRef Steinley, D. (2004). Properties of the Hubert-arable adjusted rand index. Psychological Methods, 9(3), 386.CrossRef
go back to reference Szepannek, G. (2018). clustMixType: User-friendly clustering of mixed-type data in R. The R Journal, 10(2), 200–208.CrossRef Szepannek, G. (2018). clustMixType: User-friendly clustering of mixed-type data in R. The R Journal, 10(2), 200–208.CrossRef
go back to reference Thompson, J. R.J., & Ghashti, J.S. (2024). kdml: Kernel distance metric learning for mixed-type data. R package version 1.0.0. Thompson, J. R.J., & Ghashti, J.S. (2024). kdml: Kernel distance metric learning for mixed-type data. R package version 1.0.0.
go back to reference Tortora, C., & Palumbo, F. (2022). Clustering mixed-type data using a probabilistic distance algorithm. Applied Soft Computing, 130, 109704.CrossRef Tortora, C., & Palumbo, F. (2022). Clustering mixed-type data using a probabilistic distance algorithm. Applied Soft Computing, 130, 109704.CrossRef
go back to reference van de Velden, M., D’Enza, A. I., Markos, A., & Cavicchia, C. (2024). A general framework for implementing distances for categorical variables. Pattern Recognition, 153, 110547.CrossRef van de Velden, M., D’Enza, A. I., Markos, A., & Cavicchia, C. (2024). A general framework for implementing distances for categorical variables. Pattern Recognition, 153, 110547.CrossRef
go back to reference Wang, M.-C., & Van Ryzin, J. (1981). A class of smooth estimators for discrete distributions. Biometrika, 68(1), 301–309.MathSciNetCrossRef Wang, M.-C., & Van Ryzin, J. (1981). A class of smooth estimators for discrete distributions. Biometrika, 68(1), 301–309.MathSciNetCrossRef
go back to reference Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244.MathSciNetCrossRef Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244.MathSciNetCrossRef
go back to reference Weihs, C., Ligges, U., Luebke, K., & Raabe, N. (2005). klaR: Analyzing German business cycles. In D. Baier, R. Decker, & L. Schmidt-Thieme (Eds.), Data Analysis and Decision Support, Berlin (pp. 335–343). Springer-Verlag.CrossRef Weihs, C., Ligges, U., Luebke, K., & Raabe, N. (2005). klaR: Analyzing German business cycles. In D. Baier, R. Decker, & L. Schmidt-Thieme (Eds.), Data Analysis and Decision Support, Berlin (pp. 335–343). Springer-Verlag.CrossRef
go back to reference Wishart, D. (2003). K-means clustering with outlier detection, mixed variables and missing values. In: Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft für Klassifikation eV, University of Munich, March 14–16, 2001, (pp. 216–226). Springer. Wishart, D. (2003). K-means clustering with outlier detection, mixed variables and missing values. In: Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft für Klassifikation eV, University of Munich, March 14–16, 2001, (pp. 216–226). Springer.
Metadata
Title
Mixed-Type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning
Authors
Jesse S. Ghashti
John R. J. Thompson
Publication date
07-10-2024
Publisher
Springer US
Published in
Journal of Classification
Print ISSN: 0176-4268
Electronic ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-024-09493-z

Premium Partner