nach oben

Erschienen in:

01.05.2016 | Regular Paper

On strategies for building effective ensembles of relative clustering validity criteria

verfasst von: Pablo A. Jaskowiak, Davoud Moulavi, Antonio C. S. Furtado, Ricardo J. G. B. Campello, Arthur Zimek, Jörg Sander

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single partitions and the comparison of partition pairs in relative terms based only on the data under analysis. There is a plethora of relative validity measures described in the clustering literature, thus making it difficult to choose an appropriate measure for a given application. One reason for such a variety is that no single measure can capture all different aspects of the clustering problem and, as such, each of them is prone to fail in particular application scenarios. In the present work, we take advantage of the diversity in relative validity measures from the clustering literature. Previous work showed that when randomly selecting different relative validity criteria for an ensemble (from an initial set of 28 different measures), one can expect with great certainty to only improve results over the worst criterion included in the ensemble. In this paper, we propose a method for selecting measures with minimum effectiveness and some degree of complementarity (from the same set of 28 measures) into ensembles, which show superior performance when compared to any single ensemble member (and not just the worst one) over a variety of different datasets. One can also expect greater stability in terms of evaluation over different datasets, even when considering different ensemble strategies. Our results are based on more than a thousand datasets, synthetic and real, from different sources.

Vorheriger Artikel TKAP: Efficiently processing top-k query on massive data by adaptive pruning

Nächster Artikel Enhancing web search by using query-based clusters and multi-document summaries

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

They are used only in very particular applications, such as evaluation of clustering stability via resampling [9] or assessment of diversity in clustering ensembles [44].

Albalate A, Suendermann D (2009) A combination approach to cluster validation based on statistical quantiles. In: International joint conference on bioinformatics, systems biology and intelligent computing—IJCBS, pp 549–555

Baya AE, Granitto PM (2013) How many clusters: a validation index for arbitrary-shaped clusters. IEEE/ACM Trans Comput Biol Bioinf 10(2):401–414CrossRef

Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybern B 28(3):301–315CrossRef

Bolshakova N, Azuaje F (2003) Cluster validation techniques for genome expression data. Sig Process 83(4):825–833CrossRefMATH

Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27MathSciNetCrossRefMATH

Cormack GV, Clarke CLA, Buettcher S (2009) Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09, pp 758–759

Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227CrossRef

de Borda JC (1781) Mémoire sur les élections au scrutin. Histoire de l’Academie Royale des Sciences, pp 657–665

Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):0036.1–0036.21CrossRef

10.

Dunn JC (1974) Well separated clusters and optimal fuzzy partitions. J Cybern 4:95–104MathSciNetCrossRefMATH

11.

Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the web. In: Proceedings of the 10th international conference on World Wide Web, pp 613–622

12.

Estivill-Castro V (2002) Why so many clustering algorithms: a position paper. ACM SIGKDD Explor 4(1):65–75MathSciNetCrossRef

13.

Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701CrossRefMATH

14.

Färber I, Günnemann S, Kriegel HP, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010) On using class-labels in evaluation of clusterings. In: MultiClust: 1st international workshop on discovering, summarizing and using multiple clusterings held in conjunction with KDD 2010, Washington, DC

15.

Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications. ASA-SIAM

16.

Geusebroek JM, Burghouts GJ, Smeulders AWM (2005) The Amsterdam library of object images. Int J Comput Vision 61(1):103–112CrossRef

17.

Ghosh J, Acharya A (2011) Cluster ensembles. Wiley Interdiscip Rev Data Mining Knowl Discov 1(4):305–315CrossRef

18.

Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17:107–145CrossRefMATH

19.

Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH

20.

Hill RS (1980) A stopping rule for partitioning dendrograms. Bot Gaz 141:321–324CrossRef

21.

Horta D, Campello RJGB (2012) Automatic aspect discrimination in data clustering. Pattern Recogn 45(12):4370–4388CrossRefMATH

22.

Hruschka ER, Campello RJGB, Castro LN (2004) Improving the efficiency of a clustering genetic algorithm. In: Ibero-American conference on artificial intelligence—IBERAMIA, vol 3315, pp 861–870

23.

Hruschka ER, Campello RJGB, Castro LN (2006) Evolving clusters in gene-expression data. Inf Sci 176:1898–1927MathSciNetCrossRef

24.

Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefMATH

25.

Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 10:1072–1080CrossRef

26.

Jaccard P (1901) Distribution de la florine alpine dans la bassin de dranses et dans quelques regiones voisines. Bull Soc Vaudoise Sci Nat 37:241–272

27.

Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31:651–666CrossRef

28.

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood CliffsMATH

29.

Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31:264–323CrossRef

30.

Kaufman L, Rousseeuw P (1990) Finding groups in data. Wiley, New YorkCrossRef

31.

Klementiev A, Roth D, Small K (2007) An unsupervised learning algorithm for rank aggregation. In: Proceedings of the 18th European conference on machine learning (ECML), Warsaw, Poland, pp 616–623

32.

Kolde R, Laur S, Adler P, Vilo J (2012) Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28(4):573–580CrossRef

33.

Kriegel HP, Kröger P, Sander J, Zimek A (2011a) Density-based clustering. Wiley Interdiscip Rev Data Mining Knowl Discov 1(3):231–240CrossRef

34.

Kriegel HP, Kröger P, Schubert E, Zimek A (2011b) Interpreting and unifying outlier scores. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, pp 13–24

35.

Kuncheva L, Whitaker C (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207CrossRefMATH

36.

Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the 11th ACM International conference on knowledge discovery and data mining (SIGKDD), Chicago, IL, pp 157–166

37.

Machado JB, Campello RJGB, Amaral WC (2007) Design of OBF-TS fuzzy models based on multiple clustering validity criteria. In: International conference on tools with artificial intelligence—ICTAI, pp 336–339

38.

Marquis de Condorcet MJANC (1785) Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. L’Imprimerie Royale, Paris

39.

Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654CrossRef

40.

McQueen JB (1967) Some methods of classification and analysis of multivariate observations. 5th Berkeley symposium on mathematical statistics and probability, pp 281–297

41.

Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199MathSciNetCrossRefMATH

42.

Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179CrossRef

43.

Moulavi D, Jaskowiak PA, Campello RJGB, Zimek A, Sander J (2014) Density-based clustering validation. In: Proceedings of the 14th SIAM International conference on data mining (SDM), Philadelphia, PA, pp 839–847

44.

Naldi M, Carvalho ACPLF, Campello RJGB (2013) Cluster ensemble selection based on relative validity indexes. Data Min Knowl Disc 27(2):259–289MathSciNetCrossRefMATH

45.

Nemenyi PB (1963) Distribution-free multiple comparisons. PhD thesis, Princeton University

46.

Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37:487–501CrossRefMATH

47.

Pihur V, Datta S, Datta S (2007) Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics 23(13):1607–1615CrossRef

48.

Pihur V, Datta S, Datta S (2009) Rankaggreg, an R package for weighted rank aggregation. BMC Bioinf 10(1):62CrossRef

49.

Polikar R (2012) Ensemble learning. In: Ma Y, Zhang C (eds) Ensemble machine learning. Springer, Berlin, pp 1–34CrossRef

50.

Rabbany R, Takaffoli M, Fagnan J, Zaiane OR, Campello RJGB (2012) Relative validity criteria for community mining algorithms. IEEE/ACM international conference on advances in social networks analysis and mining—ASONAM, pp 258–265

51.

Ratkowsky DA, Lance GN (1978) A criterion for determining the number of groups in a classification. Aust Comput J 10:115–117

52.

Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33:1–39CrossRef

53.

Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH

54.

Schalekamp F, van Zuylen A (2009) Rank aggregation: together we’re strong. In: Proceedings of the workshop on algorithm engineering and experiments (ALENEX) SIAM, New York, NY, pp 38–51

55.

Schubert E, Wojdanowski R, Zimek A, Kriegel HP (2012) On evaluation of outlier rankings and outlier scores. In: Proceedings of the 12th SIAM international conference on data mining (SDM), Anaheim, CA, pp 1047–1058

56.

Sheng W, Swift S, Zhang L, Liu X (2005) A weighted sum validity function for clustering with a hybrid niching genetic algorithm. IEEE Trans Syst Man Cybern B 35(6):1156–1167CrossRef

57.

Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 100(3/4):441–471CrossRef

58.

Vendramin L, Campello RJGB, Hruschka ER (2009) On the comparison of relative clustering validity criteria. In: Proceedings of the 9th SIAM international conference on data mining (SDM). Sparks, NV, pp 733–744

59.

Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Mining 3(4):209–335MathSciNet

60.

Vendramin L, Jaskowiak PA, Campello RJGB (2013) On the combination of relative clustering validity criteria. In: Proceedings of the 25th international conference on scientific and statistical database management (SSDBM), Baltimore, MD, pp 4:1–4:12

61.

Xu R, Wunsch DC II (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678CrossRef

62.

Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987CrossRef

63.

Zimek A, Campello RJGB, Sander J (2013) Ensembles for unsupervised outlier detection: challenges and research questions. ACM SIGKDD Explor 15(1):11–22CrossRef

64.

Zimek A, Campello RJGB, Sander J (2014) Data perturbation for outlier detection ensembles. In: Proceedings of the 26th international conference on scientific and statistical database management (SSDBM), Aalborg, Denmark, pp 13:1–13:12

Titel: On strategies for building effective ensembles of relative clustering validity criteria
verfasst von: Pablo A. Jaskowiak
Davoud Moulavi
Antonio C. S. Furtado
Ricardo J. G. B. Campello
Arthur Zimek
Jörg Sander
Publikationsdatum: 01.05.2016
Verlag: Springer London
Erschienen in: Knowledge and Information Systems / Ausgabe 2/2016
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-015-0851-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 2/2016

Discovering compressing serial episodes from event sequences

Context-aware location recommendation by using a random walk-based approach

Mining and clustering mobility evolution patterns from social media for urban informatics

Enhancing web search by using query-based clusters and multi-document summaries

Data clustering using side information dependent Chinese restaurant processes

TKAP: Efficiently processing top-k query on massive data by adaptive pruning

Premium Partner