nach oben

Journal of Classification

Erschienen in:

22.11.2021

High-Dimensional Clustering via Random Projections

verfasst von: Laura Anderlucci, Francesca Fortunato, Angela Montanari

Erschienen in: Journal of Classification | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This work addresses the unsupervised classification issue for high-dimensional data by exploiting the general idea of Random Projection Ensemble. Specifically, we propose to generate a set of low-dimensional independent random projections and to perform model-based clustering on each of them. The top B^∗ projections, i.e., the projections which show the best grouping structure, are then retained. The final partition is obtained by aggregating the clusters found in the projections via consensus. The performances of the method are assessed on both real and simulated datasets. The obtained results suggest that the proposal represents a promising tool for high-dimensional clustering.

Vorheriger Artikel Chimeral Clustering

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

See the R Documentation for the sort function with default settings.

Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687.MathSciNetMATHCrossRef

Ahfock, D. C., Astle, W. J., & Richardson, S. (2020). Statistical properties of sketching algorithms. Biometrika. asaa062.

Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., & et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769), 503–511.CrossRef

Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821.MathSciNetMATHCrossRef

Bellman, Richard. (1957). Dynamic programming. Princeton: Princeton University Press.MATH

Bergé, L., Bouveyron, C., & Girard, S. (2012). HDclassif: An R package for model-based clustering and discriminant analysis of high-dimensional data. Journal of Statistical Software, 46(6), 1–29.CrossRef

Bhattacharya, A., Kar, P., & Pal, M. (2009). On low distortion embeddings of statistical distance measures into low dimensional spaces. In International Conference on Database and Expert Systems Applications, Springer (pp. 164–172).

Biernacki, C., & Lourme, A. (2014). Stable and visualizable Gaussian parsimonious clustering models. Statistics and Computing, 24(6), 953–969.MathSciNetMATHCrossRef

Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. https://doi.org/10.1145/502512.502546 (pp. 245–250).

Bodenhofer, U., Kothmeier, A., & Hochreiter, S. (2011). Apcluster: an R package for affinity propagation clustering. Bioinformatics, 27, 2463–2464.CrossRef

Boongoen, T., & Iam-On, N. (2018). Cluster ensembles: A survey of approaches with recent extensions and applications. Computer Science Review, 28, 1–25.MathSciNetMATHCrossRef

Boutsidis, C., Zouzias, A., & Drineas, P. (2010). Random projections for k-means clustering. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, & A. Culotta (Eds.). Advances in Neural Information Processing Systems, (Vol. 23 pp. 298–306). Curran Associates, Inc.

Bouveyron, C., & Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis, 71, 52–78.MathSciNetMATHCrossRef

Bouveyron, C., Celeux, G., Murphy, T. B., & Raftery, A. E. (2019). Model-based clustering and classification for data science: with applications in R Vol. 50. Cambridge: Cambridge University Press.MATHCrossRef

Bouveyron, C., Girard, S., & Schmid, C. (2007). High-dimensional data clustering. Computational Statistics & Data Analysis, 52(1), 502–519.MathSciNetMATHCrossRef

Cannings, T. I. (2021). Random projections: Data perturbation for classification problems. WIREs Computational Statistics, 13(1), e1499.MathSciNetCrossRef

Cannings, T. I., & Samworth, R. J. (2017). Random-projection ensemble classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4), 959–1035.MathSciNetMATHCrossRef

Cattell, R. B. (1966). The scree test for the number of factors. Multivariate behavioral research, 1(2), 245–276.CrossRef

Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781–793.CrossRef

Chang, W.-C. (1983). On using principal components before separating a mixture of two multivariate normal distributions. Journal of the Royal Statistical Society: Series C (Applied Statistics), 32(3), 267–275.MathSciNetMATH

Chung, D., Chun, H., & Keles, S. (2019). spls: Sparse partial least squares (spls) regression and classification. R package version 2.2-3.

Chung, D., & Keles, S. (2010). Sparse partial least squares classification for high dimensional data. Statistical applications in genetics and molecular biology, 9(1).

Dasgupta, S. (1999). Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039) (pp. 634–644).

Dasgupta, S. (2000). Experiments with random projection. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, UAI’00 (pp. 143–151). San Francisco: Morgan Kaufmann Publishers Inc.

Dean, N., Murphy, T. B., & Downey, G. (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society, Series C: Applied Statistics, 55(1), 1–14.MathSciNetMATHCrossRef

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.MathSciNetMATH

Dettling, M. (2004). Bagboosting for tumor classification with gene expression data. Bioinformatics, 20(18), 3583–3593.CrossRef

Dettling, M., & Bühlmann, P. (2002). Supervised clustering of genes. Genome Biology, 3(12), research0069–1.CrossRef

Dimitriadou, E., Weingessel, A., & Hornik, K. (2002). A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence, 16(07), 901–912.MATHCrossRef

Downey, G., McElhinney, J., & Fearn, T. (2000). Species identification in selected raw homogenized meats by reflectance spectroscopy in the mid-infrared, near-infrared, and visible ranges. Applied Spectroscopy, 54(6), 894–899.CrossRef

Durrant, R. J., & Kabán, A (2015). Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions. Machine Learning, 99(2), 257–286.MathSciNetMATHCrossRef

Fern, X. Z., & Brodley, C. E. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th international conference on machine learning (ICML-03) (pp. 186–193).

Fop, M., & Murphy, T. B. (2018). Variable selection methods for model-based clustering. Statistics Surveys, 12, 18–65.MathSciNetMATHCrossRef

Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976.MathSciNetMATHCrossRef

Galimberti, G., Manisi, A., & Soffritti, G. (2018). Modelling the role of variables in model-based cluster analysis. Statistics and Computing, 28 (1), 145–169.MathSciNetMATHCrossRef

Gataric, M., Wang, T., & Samworth, R. J. (2020). Sparse principal component analysis via axis-aligned random projections. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2), 329–359. Ghahramani,Z.,&Hinton,G.E.(1997).TheEMalgorithmforfactoranalyzers. UniversityofTorontoToronto.

Golub,G.H.,&VanLoan,C.F.(1996).Matrixcomputations,3rd edn. Baltimore:TheJohnsHopkinsUniversityPress.

Haar,A.(1933).DerMassbegriffinderTheoriederkontinuierlichenGruppen. AnnalsofMathematics,34,147–169.

Hartigan,J.A.,&Wong,M.A.(1979).Algorithmas136:Ak-meansclustering algorithm.JournaloftheRoyalStatisticalSociety.SeriesC(Applied Statistics),28(1),100–108.

Hennig,C.(2019).Clustervalidationbymeasurementofclusteringcharacteristics relevanttotheuser.DataAnalysisandApplications1:Clusteringand Regression,Modeling-estimating,ForecastingandDataMining,2,1–24.

Hennig,C.,&Liao,T.F.(2013).Howtofindanappropriateclusteringformixed-type variableswithapplicationtosocio-economicstratification.Journalofthe RoyalStatisticalSociety:SeriesC(AppliedStatistics),62(3),309–369.

Hennig,C.,Meila,M.,Murtagh,F.,&Rocci,R.(2015).Handbookofcluster analysis. BocaRaton:CRCPress.

Hornik,K.(2005).ACLUEforCLUsterensembles.Journalof StatisticalSoftware,14(12),1–25.

Hubert,L.,&Arabie,P.(1985).Comparingpartitions.Journalof Classification,2(1),193–218.CrossRef

Johnson,W.B.,&Lindenstrauss,J.(1984).ExtensionsofLipschitzmappingsinto aHilbertspace.Contemporarymathematics,26(189-206),1.

Karatzoglou,A.,Smola,A.,Hornik,K.,&Zeileis,A.(2004).kernlab–an S4packageforkernelmethodsinR.JournalofStatisticalSoftware,11 (9),1–20.

Kaufman,L.,&Rousseeuw,PJ.(2009).Findinggroupsindata:anintroduction toclusteranalysis Vol. 344. NewYork,:Wiley.

Kittler,J.,Hatef,M.,Duin,R.P.W.,&Matas,J.(1998).On combiningclassifiers.IEEETransactionsonPatternAnalysisandMachine Intelligence,20(3),226–239.CrossRef

MacQueen,J.,etal.(1967).Somemethodsforclassificationandanalysis ofmultivariateobservations.In ProceedingsoftheFifthBerkeleysymposiumon MathematicalStatisticsandProbability, (Vol. 1pp. 281–297).Oakland.

Maechler,M.,Rousseeuw,P.,Struyf,A.,Hubert,M.,&Hornik,K.(2019). cluster:Clusteranalysisbasicsandextensions.Rpackageversion2.1.0.

Maugis,C.,Celeux,G.,&Martin-Magniette,M.-L.(2009).Variableselectionfor clusteringwithGaussianMixtureModels.Biometrics,65(3),701–709.MathSciNetCrossRef

Maugis,C.,Celeux,G.,&Martin-Magniette,M.-L.(2009).Variableselectionin model-basedclustering:Ageneralvariablerolemodeling.Computational Statistics&DataAnalysis,53(11),3872–3882.MATH

McLachlan,G.J.,Lee,S.X.,&Rathnayake,S.I.(2019).Finitemixturemodels. Annualreviewofstatisticsanditsapplication,6,355–378.

McLachlan,G.J.,Peel,D.,&Bean,R.W.(2003).Modellinghigh-dimensionaldata bymixturesoffactoranalyzers.ComputationalStatistics&DataAnalysis, 41(3-4),379–388.

McLachlan,G.J.,&Peel,D.(2000).FiniteMixtureModels. NewYork:Wiley.

McNicholas,P.D.(2016).Model-basedclustering.Journalof Classification,33(3),331–373.CrossRef

McNicholas,P.D.,&Murphy,T.B.(2008).ParsimoniousGaussianmixture models.StatisticsandComputing,18(3),285–296.

Montanari,A.,&Viroli,C.(2011).Maximumlikelihoodestimationofmixtures offactoranalyzers.ComputationalStatistics&DataAnalysis,55(9), 2712–2723.

Murphy,T.B.,Dean,N.,&Raftery,A.E.(2010).Variableselectionandupdating inmodel-baseddiscriminantanalysisforhighdimensionaldatawithfoodauthenticity applications.TheAnnalsofAppliedStatistics,4(1),396–421.

Ng,A.Y.,Jordan,M.I.,&Weiss,Y.(2002).Onspectralclustering:Analysisand analgorithm.In Advancesinneuralinformationprocessingsystems(pp. 849–856).

R.CoreTeam.(2020).R:Alanguageandenvironmentforstatisticalcomputing. Vienna,Austria:RFoundationforStatisticalComputing.

Raftery,A.E.,&Dean,N.(2006).Variableselectionformodel-basedclustering. JournaloftheAmericanStatisticalAssociation,101(473),168–178.

Ramey,J.A.(2012).clusteval:Evaluationofclusteringalgorithms.Rpackage version0.1.

Ruiz,F.E.,Pérez,P.S.,&Bonev,B.I.(2009).Informationtheoryincomputer visionandpatternrecognition. Berlin:SpringerScience&BusinessMedia.

Scrucca,L.(2016).Geneticalgorithmsforsubsetselectioninmodel-basedclustering (pp. 55–70). Springer.

Scrucca,L.,Fop,M.,Murphy,T.B.,&Raftery,A.E.(2016).mclust 5:clustering,classificationanddensityestimationusingGaussianfinitemixturemodels. TheRJournal,8(1),289–317.

Scrucca,L.,&Raftery,A.E.(2018).clustvarsel:Apackageimplementingvariable selectionforGaussianmodel-basedclusteringinR.JournalofStatistical Software,84(1),1–28.

Slawski,M.,etal.(2018).Onprincipalcomponentsregression,random projections,andcolumnsubsampling.ElectronicJournalofStatistics,12 (2),3673–3712.

Strehl,A.,&Ghosh,J.(2002).Clusterensembles–aknowledgereuseframework forcombiningmultiplepartitions.JournalofMachineLearningResearch, 3(Dec),583–617.

Thanei,G.-A.,Heinze,C.,&Meinshausen,N.(2017).Randomprojectionsfor large-scaleregression.In Bigandcomplexdataanalysis(pp. 51–68). Springer.

Vempala,S.S.(2004).Therandomprojectionmethod.Volume65ofDIMACSSeries inDiscreteMathematicsandTheoreticalComputerScience.AmericanMathematicalSoc.

Ward,J.H.,Jr.(1963).Hierarchicalgroupingtooptimizeanobjectivefunction. JournaloftheAmericanStatisticalAssociation,58(301),236–244.

Xu,D.,&Tian,Y.(2015).Acomprehensivesurveyofclusteringalgorithms. AnnalsofDataScience,2(2),165–193.MathSciNet

Titel: High-Dimensional Clustering via Random Projections
verfasst von: Laura Anderlucci
Francesca Fortunato
Angela Montanari
Publikationsdatum: 22.11.2021
Verlag: Springer US
Erschienen in: Journal of Classification / Ausgabe 1/2022
Print ISSN: 0176-4268
Elektronische ISSN: 1432-1343
DOI: https://doi.org/10.1007/s00357-021-09403-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2022

Erratum to: On Finite Mixture Modeling of Change-Point Processes

Chimeral Clustering

Editorial: Journal of Classification Vol. 39-1

MatTransMix: an R Package for Matrix Model-Based Clustering and Parsimonious Mixture Modeling

Partition of Interval-Valued Observations Using Regression

Comparing Boosting and Bagging for Decision Trees of Rankings

Premium Partner