Skip to main content
Top
Published in: Data Mining and Knowledge Discovery 2/2023

18-01-2023

Generalized density attractor clustering for incomplete data

Authors: Richard Leibrandt, Stephan Günnemann

Published in: Data Mining and Knowledge Discovery | Issue 2/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Mean shift is a popular and powerful clustering method for implementing density attractor clustering (DAC). However, DAC is underdeveloped in terms of modeling definitions and methods for incomplete data. Due to DAC’s importance, solving this common issue is crucial. This work makes DAC more versatile by making it applicable to incomplete data: First, using formal modeling definitions, we propose a unifying framework for DAC. Second, we propose new methods that implement the definitions and perform DAC for incomplete data more efficiently and stably than others. We discuss and compare our methods and the closest competitor using theoretical analyses. We quantify the performance of our methods using synthetic datasets with known structures and real-life business data for three missing value types. Finally, we analyze Stack Overflow’s 2021 survey to extract clusters of programmers from India and the USA. The experiments verify our methods’ superiority to six alternatives. Code, Data: https://​bit.​ly/​genDAC

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Footnotes
1
Additionally to this acronym, warp kernels get their name in analogy to celestial bodies in astronomy: they describe how their objects’ (information) mass “warp” the target space (see Fig. 1) causing https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figcr_HTML.gif to “gravitate” to the modes, which are mass accumulations.
 
2
An affine space projection https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figea_HTML.gif (a vector-valued function) of a point https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_IEq93_HTML.gif onto the affine space https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figeb_HTML.gif is the unique point https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_IEq94_HTML.gif , such that https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_IEq95_HTML.gif is an element of the complementary subspace of https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figec_HTML.gif in https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figed_HTML.gif . In other words, every vector of https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figee_HTML.gif can be decomposed uniquely as the sum of an element of https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figef_HTML.gif and an element of the complementary subspace. Here, the implementation is https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_IEq96_HTML.gif .
 
3
These formula are verbatim from the original publication. Their symbols do not adhere to our notation; so, to avoid confusion, we use the typewriter font for their symbols and notation. Symbols denoting their mathematical objects might overlap with ours regarding letters, but are distinguishable from ours via font.
 
4
see footnote on page 25.
 
5
see footnote on page 27
 
Literature
go back to reference Abdallah L, Shimshoni I (2014) Mean shift clustering algorithm for data with missing values. In: International Conference on Data Warehousing and Knowledge Discovery, vol 8646. Springer, pp 426–438 Abdallah L, Shimshoni I (2014) Mean shift clustering algorithm for data with missing values. In: International Conference on Data Warehousing and Knowledge Discovery, vol 8646. Springer, pp 426–438
go back to reference Agamennoni G (2013) Bayesian clustering with outliers and missing values. Report ACFR-TR-2013-001, Australian Centre for Field Robotics Agamennoni G (2013) Bayesian clustering with outliers and missing values. Report ACFR-TR-2013-001, Australian Centre for Field Robotics
go back to reference Bacher J, Pöge A, Wenzig K (2000) Clusteranalyse 3.A.: anwendungsorientierte einführung in klassifikationsverfahren. Oldenbourg Wissenschaftsverlag Bacher J, Pöge A, Wenzig K (2000) Clusteranalyse 3.A.: anwendungsorientierte einführung in klassifikationsverfahren. Oldenbourg Wissenschaftsverlag
go back to reference Banerjee A, Dhillon I, Ghosh J et al (2007) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986MathSciNetMATH Banerjee A, Dhillon I, Ghosh J et al (2007) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986MathSciNetMATH
go back to reference Biessmann F, Rukat T, Schmidt P et al (2019) Datawig: missing value imputation for tables. J Mach Learn Res 20(175):1–6MATH Biessmann F, Rukat T, Schmidt P et al (2019) Datawig: missing value imputation for tables. J Mach Learn Res 20(175):1–6MATH
go back to reference van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18(6):681–94CrossRef van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18(6):681–94CrossRef
go back to reference Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Advances in knowledge discovery and data mining. Springer, pp 160–172 Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Advances in knowledge discovery and data mining. Springer, pp 160–172
go back to reference Carreira-Perpiñán MÁ (2015) A review of mean-shift algorithms for clustering. In: CRC Handbook of cluster analysis. CRC Press, Boca Raton, Florida Carreira-Perpiñán MÁ (2015) A review of mean-shift algorithms for clustering. In: CRC Handbook of cluster analysis. CRC Press, Boca Raton, Florida
go back to reference Chacón JE, Duong T (2020) Multivariate kernel smoothing and its applications, Monogr. Stat. Appl. Probab., vol 160. Chapman and Hall/CRC Chacón JE, Duong T (2020) Multivariate kernel smoothing and its applications, Monogr. Stat. Appl. Probab., vol 160. Chapman and Hall/CRC
go back to reference Chau VTN, Loc PH, Tran VTN (2015) A robust mean shift-based approach to effectively clustering incomplete educational data. In: International conference on advanced computing and applications (ACOMP), pp 12–19 Chau VTN, Loc PH, Tran VTN (2015) A robust mean shift-based approach to effectively clustering incomplete educational data. In: International conference on advanced computing and applications (ACOMP), pp 12–19
go back to reference Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619CrossRef Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619CrossRef
go back to reference Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923CrossRef Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923CrossRef
go back to reference Fashing M, Tomasi C (2005) Mean shift is a bound optimization. IEEE Trans Pattern Anal Mach Intell 27(3):471–474CrossRef Fashing M, Tomasi C (2005) Mean shift is a bound optimization. IEEE Trans Pattern Anal Mach Intell 27(3):471–474CrossRef
go back to reference Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40MathSciNetCrossRefMATH Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40MathSciNetCrossRefMATH
go back to reference Günnemann S, Müller E, Raubach S, et al. (2011) Flexible fault tolerant subspace clustering for data with missing values. In: 11th IEEE International Conference on Data Mining, pp 231–240 Günnemann S, Müller E, Raubach S, et al. (2011) Flexible fault tolerant subspace clustering for data with missing values. In: 11th IEEE International Conference on Data Mining, pp 231–240
go back to reference Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Cybern 31(5):735–744 Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Cybern 31(5):735–744
go back to reference Helm MS, Dankovich TM, Mandad S et al (2021) A large-scale nanoscopy and biochemistry analysis of postsynaptic dendritic spines. Nat Neurosci 24:1151–1162CrossRef Helm MS, Dankovich TM, Mandad S et al (2021) A large-scale nanoscopy and biochemistry analysis of postsynaptic dendritic spines. Nat Neurosci 24:1151–1162CrossRef
go back to reference Himmelspach L, Conrad S (2010) Clustering approaches for data with missing values: comparison and evaluation. In: 5th International conference on digital information management (ICDIM) Himmelspach L, Conrad S (2010) Clustering approaches for data with missing values: comparison and evaluation. In: 5th International conference on digital information management (ICDIM)
go back to reference Jadhav A, Pramod D, Ramanathan K (2019) Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 33(1):913–933CrossRef Jadhav A, Pramod D, Ramanathan K (2019) Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 33(1):913–933CrossRef
go back to reference Jäger S, Allhorn A, Bießmann F (2021) A benchmark for data imputation methods. Frontiers in Big Data 4 Jäger S, Allhorn A, Bießmann F (2021) A benchmark for data imputation methods. Frontiers in Big Data 4
go back to reference Leibrandt K, Lorenz T, Nierhoff T, et al. (2013) Modelling human gameplay at pool and countering it with an anthropomorphic robot. In: Social robotics. Springer, pp 30–39 Leibrandt K, Lorenz T, Nierhoff T, et al. (2013) Modelling human gameplay at pool and countering it with an anthropomorphic robot. In: Social robotics. Springer, pp 30–39
go back to reference Leibrandt R, Günnemann S (2018) Making kernel density estimation robust towards missing values in highly incomplete multivariate data without imputation. In: SIAM International Conference on Data Mining Leibrandt R, Günnemann S (2018) Making kernel density estimation robust towards missing values in highly incomplete multivariate data without imputation. In: SIAM International Conference on Data Mining
go back to reference Leibrandt R, Günnemann S (2020) Gauss shift: Density attractor clustering faster than mean shift. In: Eur. Conf. Princ. Pract. Knowl. Discov. Databases Leibrandt R, Günnemann S (2020) Gauss shift: Density attractor clustering faster than mean shift. In: Eur. Conf. Princ. Pract. Knowl. Discov. Databases
go back to reference Liao L, Li K, Li K, et al. (2018) A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics. BMC Syst Biol 12(111) Liao L, Li K, Li K, et al. (2018) A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics. BMC Syst Biol 12(111)
go back to reference Muzellec B, Josse J, Boyer C et al. (2020) Missing data imputation using optimal transport. International Conference on Machine Learning PMLR, pp 7130–7140 Muzellec B, Josse J, Boyer C et al. (2020) Missing data imputation using optimal transport. International Conference on Machine Learning PMLR, pp 7130–7140
go back to reference Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetMATH Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetMATH
go back to reference Poulos J, Valle R (2018) Missing data imputation for supervised learning. Appl Artif Intell 32(2):186–196CrossRef Poulos J, Valle R (2018) Missing data imputation for supervised learning. Appl Artif Intell 32(2):186–196CrossRef
go back to reference Romano S, Bailey J, Nguyen V, et al. (2014) Standardized mutual information for clustering comparisons: one step further in adjustment for chance. In: International Conference on Machine Learning, pp 1143–1151 Romano S, Bailey J, Nguyen V, et al. (2014) Standardized mutual information for clustering comparisons: one step further in adjustment for chance. In: International Conference on Machine Learning, pp 1143–1151
go back to reference Romano S, Vinh NX, Bailey J et al (2016) Adjusting for chance clustering comparison measures. J Mach Learn Res 17(134):1–32MathSciNetMATH Romano S, Vinh NX, Bailey J et al (2016) Adjusting for chance clustering comparison measures. J Mach Learn Res 17(134):1–32MathSciNetMATH
go back to reference Schelter S, Rukat T, Biessmann F (2020) Learning to validate the predictions of black box classifiers on unseen data. In: ACM SIGMOD International Conference on Management of Data, p 1289-1299 Schelter S, Rukat T, Biessmann F (2020) Learning to validate the predictions of black box classifiers on unseen data. In: ACM SIGMOD International Conference on Management of Data, p 1289-1299
go back to reference Schnupp P, Leibrandt U (1988) Expertensysteme: Nicht nur für Informatiker. Springer, Springer CompassCrossRef Schnupp P, Leibrandt U (1988) Expertensysteme: Nicht nur für Informatiker. Springer, Springer CompassCrossRef
go back to reference Steinley D, Brusco MJ, Hubert L (2016) The variance of the adjusted rand index. Psychol Methods 21(2):261–72CrossRef Steinley D, Brusco MJ, Hubert L (2016) The variance of the adjusted rand index. Psychol Methods 21(2):261–72CrossRef
go back to reference Timm H, Döring C, Kruse R (2002) Fuzzy cluster analysis of partially missing datasets. In: 2nd Int. W. on Hybr. Meth. for Adap. Sys. I, pp 426–431 Timm H, Döring C, Kruse R (2002) Fuzzy cluster analysis of partially missing datasets. In: 2nd Int. W. on Hybr. Meth. for Adap. Sys. I, pp 426–431
go back to reference Wagstaff KL (2004) Clustering with missing values: no imputation required. In: Meet. Int. Fed. Classif. Soc., pp 649–658 Wagstaff KL (2004) Clustering with missing values: no imputation required. In: Meet. Int. Fed. Classif. Soc., pp 649–658
go back to reference Wand M, Jones MC (1995) Kernel Smoothing. Chapman and Hall/CRC Wand M, Jones MC (1995) Kernel Smoothing. Chapman and Hall/CRC
go back to reference Xue Z, Wang H (2021) Effective density-based clustering algorithms for incomplete data. Big Data Min Anal 4(3):183–194CrossRef Xue Z, Wang H (2021) Effective density-based clustering algorithms for incomplete data. Big Data Min Anal 4(3):183–194CrossRef
go back to reference Yang L, Hou K (2018) A method of incomplete data three-way clustering based on density peaks. In: International conference on computer-aided design, manufacturing, Modeling and Simulation, p 020008 Yang L, Hou K (2018) A method of incomplete data three-way clustering based on density peaks. In: International conference on computer-aided design, manufacturing, Modeling and Simulation, p 020008
Metadata
Title
Generalized density attractor clustering for incomplete data
Authors
Richard Leibrandt
Stephan Günnemann
Publication date
18-01-2023
Publisher
Springer US
Published in
Data Mining and Knowledge Discovery / Issue 2/2023
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-022-00904-6

Other articles of this Issue 2/2023

Data Mining and Knowledge Discovery 2/2023 Go to the issue

Premium Partner