Top

Advances in Data Analysis and Classification

Published in:

25-08-2022 | Regular Article

Determinantal consensus clustering

Authors: Serge Vicente, Alejandro Murua-Sazo

Published in: Advances in Data Analysis and Classification | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Random restart of a given algorithm produces many partitions that can be aggregated to yield a consensus clustering. Ensemble methods have been recognized as more robust approaches for data clustering than single clustering algorithms. We propose the use of determinantal point processes or DPPs for the random restart of clustering algorithms based on initial sets of center points, such as k-medoids or k-means. The relation between DPPs and kernel-based methods makes DPPs suitable to describe and quantify similarity between objects. DPPs favor diversity of the center points in initial sets, so that sets with similar points have less chance of being generated than sets with very distinct points. Most current inital sets are generated with center points sampled uniformly at random. We show through extensive simulations that, contrary to DPPs, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets. The latter are two key properties that make DPPs achieve good performance. Simulations with artificial datasets and applications to real datasets show that determinantal consensus clustering outperforms consensus clusterings which are based on uniform random sampling of center points.

previous article Editorial for ADAC issue 4 of volume 17 (2023)

next article Attraction-repulsion clustering: a way of promoting diversity linked to demographic parity in fair clustering

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. ACM SIGMOD Rec 28(2):49–60

Ao SI, Yip K, Ng M, Cheung D, Fong P, Melhado I, Sham P (2005) Clustag: hierarchical clustering and graph methods for selecting tag SNPS. Bioinformatics 21:1735–6. https://doi.org/10.1093/bioinformatics/bti201CrossRef

Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, USA, SODA ’07. Society for Industrial and Applied Mathematics, pp 1027–1035

Aurenhammer F (1991) Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Comput Surv 23(3):345–405. https://doi.org/10.1145/116873.116880CrossRef

Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803–821MathSciNetMATH

Ben Hough J, Krishnapur M, Peres Y, Virág B (2006) Determinantal processes and independence. Probab Surv [electronic only] 3:206–229MathSciNetMATH

Bicego M, Baldo S (2016) Properties of the Box–Cox transformation for pattern classification. Neurocomputing 218:390–400

Bien J, Tibshirani R (2011) Hierarchical clustering with prototypes via minimax linkage. J Am Stat Assoc 106:1075–1084. https://doi.org/10.1198/jasa.2011.tm10183MathSciNetCrossRefMATH

Bilodeau M, Nangue AG (2017) Tests of mutual or serial independence of random vectors with applications. J Mach Learn Res 18(1):2518–2557MathSciNetMATH

Blatt M, Wiseman S, Domany E (1996) Superparamagnetic clustering of data. Phys Rev Lett 76:3251–3254

Blatt M, Wiseman S, Domany E (1997) Data clustering using a model granular magnet. Neural Comput 9(8):1805–1842

Borodin A, Olshanski G (2000) Distributions on partitions, point processes, and the hypergeometric kernel. Commun Math Phys 211:335–358. https://doi.org/10.1007/s002200050815arXiv:math/9904010MathSciNetCrossRefMATH

Box GEP, Cox DR (1964) An analysis of transformations. J R Stat Soc B 26:211–252MATH

Budiaji W, Leisch F (2019) Simple k-medoids partitioning algorithm for mixed variable data. Algorithms 12(9):177

Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the k-means clustering for massive data. Knowl-Based Syst 117:56–69

Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210. https://doi.org/10.1016/j.eswa.2012.07.021CrossRef

Chaudhuri A, Kakde D, Sadek C, Gonzalez L, Kong S (2017) The mean and median criteria for kernel bandwidth selection for support vector data description. In: 2017 IEEE international conference on data mining workshops (ICDMW). IEEE, pp 842–849

Chen D, Xing K, Henson D, Sheng L, Schwartz AM, Cheng X (2009) Developing prognostic systems of cancer patients by ensemble clustering. J Biomed Biotechnol 2009:632786

Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca RatonMATH

Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD 96:226–231

Fan Z, Jiang X, Xu B, Jiang Z (2010) An automatic index validity for clustering. In: Tan Y, Shi Y, Tan KC (eds) Advances in swarm intelligence. Springer, Berlin, pp 359–366

Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recogn 41(1):176–190MATH

Florek K, Łukaszewicz J, Perkal J, Steinhaus H, Zubrzycki S (1951) Sur la liaison et la division des points d’un ensemble fini. Colloq Math 2:282–285MathSciNet

Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769

Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588MATH

Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recogn 93:95–112

Girolami M (2002) Mercer kernel-based clustering in feature space. IEEE Trans Neural Netw 13(3):780–784

Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306MathSciNetMATH

Hafiz Affandi R, Fox EB, Taskar B (2013) Approximate inference in continuous determinantal point processes. ArXiv e-prints arXiv:1311.2971

Hafiz Affandi R, Fox EB, Adams RP, Taskar B (2014) Learning the parameters of determinantal point process kernels. arXiv e-prints arXiv:1402.4862

Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San FranciscoMATH

Hennig C (2019) Cluster validation by measurement of clustering characteristics relevant to the user, Chap 1. Wiley, Hoboken, pp 1–24

Herbrich R (2001) Learning kernel classifiers: theory and algorithms. MIT Press, CambridgeMATH

Hinneburg A, Keim DA (1999) Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: 25th International conference on very large databases, pp 506–517

Howley T, Madden MG (2006) An evolutionary approach to automatic kernel construction. In: Kollias S, Stafylopatis A, Duch W, Oja E (eds) Artificial Neural Networks—ICANN 2006. Springer, Berlin, pp 417–426

Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075CrossRefMATH

Ibrahim LF, Harbi MHA (2013) Using modified partitioning around medoids clustering technique in mobile network planning. arXiv preprint arXiv:1302.6602

Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle RiverMATH

Kang B (2013) Fast determinantal point process sampling with application to clustering. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems, vol 26. Curran Associates, Inc., New York, pp 2319–2327

Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an s4 package for kernel methods in r. J Stat Softw 11(9):1–20

Katsavounidis I, Kuo CCJ, Zhang Z (1994) A new initialization technique for generalized Lloyd iteration. IEEE Signal Process Lett 1(10):144–146

Kaufmann L, Rousseeuw P (1987) Clustering by means of medoids. In: Proceedings of the statistical data analysis based on the L1 norm conference, Neuchatel, Switzerland, vol 31, pp 405–416

Kulesza A, Taskar B (2012) Determinantal point processes for machine learning. Found Trends Mach Learn 5(2–3):123–286. https://doi.org/10.1561/2200000044CrossRefMATH

Lago-Fernández LF, Corbacho F (2010) Normality-based validation for crisp clustering. Pattern Recogn 43(3):782–795MATH

Lewis D (1997) Reuters-21578 text categorization collection, distribution 1.0. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 296–304

Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–136MathSciNetMATH

Maitra R (2009) Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinf 6(1):144–157

Melnykov V, Chen WC, Maitra R (2012) Mixsim: an r package for simulating data to study performance of clustering algorithms. J Stat Softw 51(12):1–25. https://doi.org/10.18637/jss.v051.i12CrossRef

Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1):91–118MATH

Muñoz J, Murua A (2018) Building cancer prognosis systems with survival function clusters. Stat Anal Data Min ASA Data Sci J 11(3):98–110. https://doi.org/10.1002/sam.11373MathSciNetCrossRefMATH

Murua A, Wicker N (2014) The conditional-Potts clustering model. J Comput Graph Stat 23(3):717–739MathSciNet

Murua A, Stanberry L, Stuetzle W (2008) On Potts model clustering, kernel k-means, and density estimation. J Comput Graph Stat 17(3):629–658MathSciNet

Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856

Okabe A, Boots B, Sugihara K, Chiu SN (2000) Spatial Tessellations: Concepts and Applications of Voronoi Diagrams, 2nd edn. Series in Probability and Statistics. Wiley, HobokenMATH

Park HS, Jun CH (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36(2):3336–3341

Pena JM, Lozano JA, Larranaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recogn Lett 20(10):1027–1040

Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

Saad D (1998) Online algorithms and stochastic approximations. Online Learn 5:3–6

Schölkopf B, Smola A, Müller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319

Schölkopf B, Tsuda K, Vert JP (2004) Kernel methods in computational biology. MIT Press, Cambridge

Schubert E, Rousseeuw PJ (2019) Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In: International conference on similarity search and applications. Springer, pp 171–187

Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K (2013) Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann Stat 41:2263–2291MathSciNetMATH

Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905

Smyth P (1997) Clustering sequences with hidden Markov models. In: Proceedings of the 9th international conference on neural information processing systems (NIPS 1996). MIT Press, Cambridge, pp 648–654

Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. https://doi.org/10.1162/153244303321897735MathSciNetCrossRefMATH

Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20(1):25–47MathSciNetMATH

Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19(2):397–418MathSciNet

Thygesen HH, Zwinderman AH (2004) Comparing transformation methods for DNA microarray data. BMC Bioinform 5(1):77

Vanschoren J, van Rijn JN, Bischl B, Torgo L (2013) Openml: networked science in machine learning. SIGKDD Explor 15(2):49–60. https://doi.org/10.1145/2641190.2641198CrossRef

Vapnik VN (1995) The nature of statistical learning theory. Springer, BerlinMATH

Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recogn Artif Intell 25(03):337–372. https://doi.org/10.1142/S0218001411008683MathSciNetCrossRef

Verleysen M, François D (2005) The curse of dimensionality in data mining and time series prediction. In: International work-conference on artificial neural networks. Springer, pp 758–770

Vert JP, Tsuda K, Schölkopf B (2004) A primer on kernel methods. Kernel Methods Comput Biol 47:35–70

Wang F, Landau DP (2001) Efficient, multiple-range random walk algorithm to calculate the density of states. Phys Rev Lett 86(10):2050

Wang W, Yang J, Muntz R et al (1997) Sting: a statistical information grid approach to spatial data mining. VLDB 97:186–195

Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244MathSciNet

Watkins C (1999) Dynamic alignment kernels. In: Smola AJ, Bartlett PL, Schölkopf B, Schuurmans D (eds) Advances in large margin classifiers. MIT Press, pp 39–50

Xuan L, Zhigang C, Fan Y (2013) Exploring of clustering algorithm on class-imbalanced data. In: 8th International conference on computer science and education, ICCSE 2013, pp 89–93. https://doi.org/10.1109/ICCSE.2013.6553890

Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17:977–987

Title: Determinantal consensus clustering
Authors: Serge Vicente
Alejandro Murua-Sazo
Publication date: 25-08-2022
Publisher: Springer Berlin Heidelberg
Published in: Advances in Data Analysis and Classification / Issue 4/2023
Print ISSN: 1862-5347
Electronic ISSN: 1862-5355
DOI: https://doi.org/10.1007/s11634-022-00514-6

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2023

Attraction-repulsion clustering: a way of promoting diversity linked to demographic parity in fair clustering

LASSO regularization within the LocalGLMnet architecture

Monitoring photochemical pollutants based on symbolic interval-valued data analysis

Robust instance-dependent cost-sensitive classification

Proximal methods for sparse optimal scoring and discriminant analysis

A power-controlled reliability assessment for multi-class probabilistic classifiers

Premium Partner