Top

Data Mining and Knowledge Discovery

Published in:

18-01-2023

Generalized density attractor clustering for incomplete data

Authors: Richard Leibrandt, Stephan Günnemann

Published in: Data Mining and Knowledge Discovery | Issue 2/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Mean shift is a popular and powerful clustering method for implementing density attractor clustering (DAC). However, DAC is underdeveloped in terms of modeling definitions and methods for incomplete data. Due to DAC’s importance, solving this common issue is crucial. This work makes DAC more versatile by making it applicable to incomplete data: First, using formal modeling definitions, we propose a unifying framework for DAC. Second, we propose new methods that implement the definitions and perform DAC for incomplete data more efficiently and stably than others. We discuss and compare our methods and the closest competitor using theoretical analyses. We quantify the performance of our methods using synthetic datasets with known structures and real-life business data for three missing value types. Finally, we analyze Stack Overflow’s 2021 survey to extract clusters of programmers from India and the USA. The experiments verify our methods’ superiority to six alternatives. Code, Data: https://bit.ly/genDAC

previous article Difference embedding for recommender systems

next article Scalable classifier-agnostic channel selection for multivariate time series classification

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Additionally to this acronym, warp kernels get their name in analogy to celestial bodies in astronomy: they describe how their objects’ (information) mass “warp” the target space (see Fig. 1) causing

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figcr_HTML.gif

to “gravitate” to the modes, which are mass accumulations.

An affine space projection

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figea_HTML.gif

(a vector-valued function) of a point

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_IEq93_HTML.gif

onto the affine space

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figeb_HTML.gif

is the unique point

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_IEq94_HTML.gif

, such that

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_IEq95_HTML.gif

is an element of the complementary subspace of

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figec_HTML.gif

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figed_HTML.gif

. In other words, every vector of

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figee_HTML.gif

can be decomposed uniquely as the sum of an element of

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_Figef_HTML.gif

and an element of the complementary subspace. Here, the implementation is

https://static-content.springer.com/image/art%3A10.1007%2Fs10618-022-00904-6/MediaObjects/10618_2022_904_IEq96_HTML.gif

These formula are verbatim from the original publication. Their symbols do not adhere to our notation; so, to avoid confusion, we use the typewriter font for their symbols and notation. Symbols denoting their mathematical objects might overlap with ours regarding letters, but are distinguishable from ours via font.

see footnote on page 25.

see footnote on page 27

Abdallah L, Shimshoni I (2014) Mean shift clustering algorithm for data with missing values. In: International Conference on Data Warehousing and Knowledge Discovery, vol 8646. Springer, pp 426–438

Agamennoni G (2013) Bayesian clustering with outliers and missing values. Report ACFR-TR-2013-001, Australian Centre for Field Robotics

Bacher J, Pöge A, Wenzig K (2000) Clusteranalyse 3.A.: anwendungsorientierte einführung in klassifikationsverfahren. Oldenbourg Wissenschaftsverlag

Banerjee A, Dhillon I, Ghosh J et al (2007) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986MathSciNetMATH

Biessmann F, Rukat T, Schmidt P et al (2019) Datawig: missing value imputation for tables. J Mach Learn Res 20(175):1–6MATH

van Buuren S, Boshuizen HC, Knook DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18(6):681–94CrossRef

Campello RJGB, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Advances in knowledge discovery and data mining. Springer, pp 160–172

Carreira-Perpiñán MÁ (2015) A review of mean-shift algorithms for clustering. In: CRC Handbook of cluster analysis. CRC Press, Boca Raton, Florida

Chacón JE, Duong T (2020) Multivariate kernel smoothing and its applications, Monogr. Stat. Appl. Probab., vol 160. Chapman and Hall/CRC

Chau VTN, Loc PH, Tran VTN (2015) A robust mean shift-based approach to effectively clustering incomplete educational data. In: International conference on advanced computing and applications (ACOMP), pp 12–19

Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619CrossRef

Dietterich TG (1998) Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput 10(7):1895–1923CrossRef

Fashing M, Tomasi C (2005) Mean shift is a bound optimization. IEEE Trans Pattern Anal Mach Intell 27(3):471–474CrossRef

Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40MathSciNetCrossRefMATH

Günnemann S, Müller E, Raubach S, et al. (2011) Flexible fault tolerant subspace clustering for data with missing values. In: 11th IEEE International Conference on Data Mining, pp 231–240

Hathaway RJ, Bezdek JC (2001) Fuzzy c-means clustering of incomplete data. IEEE Cybern 31(5):735–744

Helm MS, Dankovich TM, Mandad S et al (2021) A large-scale nanoscopy and biochemistry analysis of postsynaptic dendritic spines. Nat Neurosci 24:1151–1162CrossRef

Himmelspach L, Conrad S (2010) Clustering approaches for data with missing values: comparison and evaluation. In: 5th International conference on digital information management (ICDIM)

Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218CrossRefMATH

Jadhav A, Pramod D, Ramanathan K (2019) Comparison of performance of data imputation methods for numeric dataset. Appl Artif Intell 33(1):913–933CrossRef

Jäger S, Allhorn A, Bießmann F (2021) A benchmark for data imputation methods. Frontiers in Big Data 4

Leibrandt K, Lorenz T, Nierhoff T, et al. (2013) Modelling human gameplay at pool and countering it with an anthropomorphic robot. In: Social robotics. Springer, pp 30–39

Leibrandt R, Günnemann S (2018) Making kernel density estimation robust towards missing values in highly incomplete multivariate data without imputation. In: SIAM International Conference on Data Mining

Leibrandt R, Günnemann S (2020) Gauss shift: Density attractor clustering faster than mean shift. In: Eur. Conf. Princ. Pract. Knowl. Discov. Databases

Liao L, Li K, Li K, et al. (2018) A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics. BMC Syst Biol 12(111)

Loader CR (1999) Bandwidth selection: classical or plug-in? Ann Stat 27(2):415–438MathSciNetCrossRefMATH

Muzellec B, Josse J, Boyer C et al. (2020) Missing data imputation using optimal transport. International Conference on Machine Learning PMLR, pp 7130–7140

Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetMATH

Poulos J, Valle R (2018) Missing data imputation for supervised learning. Appl Artif Intell 32(2):186–196CrossRef

Romano S, Bailey J, Nguyen V, et al. (2014) Standardized mutual information for clustering comparisons: one step further in adjustment for chance. In: International Conference on Machine Learning, pp 1143–1151

Romano S, Vinh NX, Bailey J et al (2016) Adjusting for chance clustering comparison measures. J Mach Learn Res 17(134):1–32MathSciNetMATH

Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592MathSciNetCrossRefMATH

Schelter S, Rukat T, Biessmann F (2020) Learning to validate the predictions of black box classifiers on unseen data. In: ACM SIGMOD International Conference on Management of Data, p 1289-1299

Schnupp P, Leibrandt U (1988) Expertensysteme: Nicht nur für Informatiker. Springer, Springer CompassCrossRef

Shortliffe EH, Buchanan BG (1975) A model of inexact reasoning in medicine. Math Biosci 23(3–4):351–379MathSciNetCrossRef

Stack Overflow (2021) Stack Overflow developer survey 2021. https://insights.stackoverflow.com/survey

Steinley D, Brusco MJ, Hubert L (2016) The variance of the adjusted rand index. Psychol Methods 21(2):261–72CrossRef

Timm H, Döring C, Kruse R (2002) Fuzzy cluster analysis of partially missing datasets. In: 2nd Int. W. on Hybr. Meth. for Adap. Sys. I, pp 426–431

Wagstaff KL (2004) Clustering with missing values: no imputation required. In: Meet. Int. Fed. Classif. Soc., pp 649–658

Wand M, Jones MC (1995) Kernel Smoothing. Chapman and Hall/CRC

Xue Z, Wang H (2021) Effective density-based clustering algorithms for incomplete data. Big Data Min Anal 4(3):183–194CrossRef

Yang L, Hou K (2018) A method of incomplete data three-way clustering based on density peaks. In: International conference on computer-aided design, manufacturing, Modeling and Simulation, p 020008

Title: Generalized density attractor clustering for incomplete data
Authors: Richard Leibrandt
Stephan Günnemann
Publication date: 18-01-2023
Publisher: Springer US
Published in: Data Mining and Knowledge Discovery / Issue 2/2023
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-022-00904-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 2/2023

On computing exact means of time series using the move-split-merge metric

Difference embedding for recommender systems

Functional classwise principal component analysis: a classification framework for functional data analysis

MERLIN++: parameter-free discovery of time series anomalies

HARPA: hierarchical attention with relation paths for knowledge graph embedding adversarial learning

Forecast evaluation for data scientists: common pitfalls and best practices

Premium Partner