Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 3/2013

01.11.2013

A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies

verfasst von: R. J. G. B. Campello, D. Moulavi, A. Zimek, J. Sander

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We introduce a framework for the optimal extraction of flat clusterings from local cuts through cluster hierarchies. The extraction of a flat clustering from a cluster tree is formulated as an optimization problem and a linear complexity algorithm is presented that provides the globally optimal solution to this problem in semi-supervised as well as in unsupervised scenarios. A collection of experiments is presented involving clustering hierarchies of different natures, a variety of real data sets, and comparisons with specialized methods from the literature.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
These methods should be distinguished from those that use hierarchical clustering and partially labeled data to categorize unlabeled data into predefined categories (semi-supervised categorization; e.g., see Skarmeta et al. 2000; Benkhalifa et al. 2001).
 
2
Note that such a reduction may be even more noticeable for higher values of \(m_{ clSize }\).
 
3
In the example of Fig. 3b–d, these nodes would be virtual children of \(\mathbf{C}_1\) and \(\mathbf{C}_2\), which have been omitted for the sake of clarity.
 
4
An alternative setting, in which both objectives in Eqs. (1) and (3) are combined into a single, mixed one, will be discussed further, in Sect. 3.2.
 
5
Notice that \(\varGamma \) has the same properties as \(S\) discussed in Sect. 2.3: it can be computed locally for each node and it is additive w.r.t. the objects in the node, so it is compatible with the aggregation operator (sum) used in the objective function in Eqs. (1) and (5).
 
Literatur
Zurück zum Zitat Ahmed EB, Nabli A, Gargouri F (2012) Shacun: semi-supervised hierarchical active clustering based on ranking constraints. In: Proceedings of the 12th industrial conference on data mining. Springer, Berlin, pp 194–208 Ahmed EB, Nabli A, Gargouri F (2012) Shacun: semi-supervised hierarchical active clustering based on ranking constraints. In: Proceedings of the 12th industrial conference on data mining. Springer, Berlin, pp 194–208
Zurück zum Zitat Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec 28:49–60CrossRef Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec 28:49–60CrossRef
Zurück zum Zitat Bade K, Nürnberger A (2006) Personalized hierarchical clustering. In: IEEE/WIC/ACM international conference on web intelligence (WI) Bade K, Nürnberger A (2006) Personalized hierarchical clustering. In: IEEE/WIC/ACM international conference on web intelligence (WI)
Zurück zum Zitat Bade K, Nürnberger A (2008) Creating a cluster hierarchy under constraints of a partially known hierarchy. In: SIAM international conference on data mining (SDM), Atlanta Bade K, Nürnberger A (2008) Creating a cluster hierarchy under constraints of a partially known hierarchy. In: SIAM international conference on data mining (SDM), Atlanta
Zurück zum Zitat Bade K, Hermkes M, Nürnberger A (2007) User oriented hierarchical information organization and retrieval. In: European conference on machine Learning (ECML), Corvallis, pp 518–526 Bade K, Hermkes M, Nürnberger A (2007) User oriented hierarchical information organization and retrieval. In: European conference on machine Learning (ECML), Corvallis, pp 518–526
Zurück zum Zitat Basu S, Davidson I, Wagstaff K (eds) (2008) Constrained clustering: advances in algorithms applications and theory. CRC Press, Boca Raton Basu S, Davidson I, Wagstaff K (eds) (2008) Constrained clustering: advances in algorithms applications and theory. CRC Press, Boca Raton
Zurück zum Zitat Benkhalifa M, Mouradi A, Bouyakhf H (2001) Integrating wordnet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization. Int J Intell Syst 16:929–947MATHCrossRef Benkhalifa M, Mouradi A, Bouyakhf H (2001) Integrating wordnet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization. Int J Intell Syst 16:929–947MATHCrossRef
Zurück zum Zitat Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: International conference on machine learning (ICML), pp 55–63 Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: International conference on machine learning (ICML), pp 55–63
Zurück zum Zitat Böhm C, Plant C (2008) Hissclu: a hierarchical density-based method for semi-supervised clustering. In: International conference on extending database technology (EDBT) Böhm C, Plant C (2008) Hissclu: a hierarchical density-based method for semi-supervised clustering. In: International conference on extending database technology (EDBT)
Zurück zum Zitat Boudaillier E, Hébrail G (1997) Interactive interpretation of hierarchical clustering. In: Principles of data mining and knowledge discovery, LNCS, vol 1263, Springer, Heidelberg, pp 288–298 Boudaillier E, Hébrail G (1997) Interactive interpretation of hierarchical clustering. In: Principles of data mining and knowledge discovery, LNCS, vol 1263, Springer, Heidelberg, pp 288–298
Zurück zum Zitat Boudaillier E, Hébrail G (1998) Interactive interpretation of hierarchical clustering. Intell Data Anal 2:229–244CrossRef Boudaillier E, Hébrail G (1998) Interactive interpretation of hierarchical clustering. Intell Data Anal 2:229–244CrossRef
Zurück zum Zitat Brecheisen S, Kriegel HP, Kröger P, Pfeifle M (2004) Visually mining through cluster hierarchies. In: SIAM international conference on data mining (SDM) Brecheisen S, Kriegel HP, Kröger P, Pfeifle M (2004) Visually mining through cluster hierarchies. In: SIAM international conference on data mining (SDM)
Zurück zum Zitat Davidson I, Ravi S (2005) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: European conference on principles and practice of knowledge discovery in databases (PKDD) Davidson I, Ravi S (2005) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: European conference on principles and practice of knowledge discovery in databases (PKDD)
Zurück zum Zitat Davidson I, Ravi S (2009) Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min Knowl Disc 18:257–282MathSciNetCrossRef Davidson I, Ravi S (2009) Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min Knowl Disc 18:257–282MathSciNetCrossRef
Zurück zum Zitat Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: European conference on principles and practice of knowledge in databases (PKDD) Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: European conference on principles and practice of knowledge in databases (PKDD)
Zurück zum Zitat Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining (KDD) Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining (KDD)
Zurück zum Zitat Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, London Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, London
Zurück zum Zitat Ferraretti D, Gamberoni G, Lamma E (2009) Automatic cluster selection using index driven search strategy. In: International conference of the Italian association for artificial intelligence (AI*IA) Ferraretti D, Gamberoni G, Lamma E (2009) Automatic cluster selection using index driven search strategy. In: International conference of the Italian association for artificial intelligence (AI*IA)
Zurück zum Zitat Geusebroek JM, Burghouts G, Smeulders A (2005) The Amsterdam library of object images. Int J Comput Vis 61:103–112 Geusebroek JM, Burghouts G, Smeulders A (2005) The Amsterdam library of object images. Int J Comput Vis 61:103–112
Zurück zum Zitat Gilpin S, Davidson I (2011) Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD) Gilpin S, Davidson I (2011) Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Zurück zum Zitat Gupta G, Liu A, Ghosh J (2006) Hierarchical density shaving: a clustering and visualization framework for large biological datasets. In: IEEE ICDM workshop on data mining in bioinformatics (DMB) Gupta G, Liu A, Ghosh J (2006) Hierarchical density shaving: a clustering and visualization framework for large biological datasets. In: IEEE ICDM workshop on data mining in bioinformatics (DMB)
Zurück zum Zitat Gupta G, Liu A, Ghosh J (2010) Automated hierarchical density shaving: a robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Trans Comput Biol Bioinformatics 7(2):223–237CrossRef Gupta G, Liu A, Ghosh J (2010) Automated hierarchical density shaving: a robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Trans Comput Biol Bioinformatics 7(2):223–237CrossRef
Zurück zum Zitat Hamasuna Y, Endo Y, Miyamoto S (2012) On agglomerative hierarchical clustering using clusterwise tolerance based pairwise constraints. J Adv Comput Intell Intell Inform 16(1):174–179 Hamasuna Y, Endo Y, Miyamoto S (2012) On agglomerative hierarchical clustering using clusterwise tolerance based pairwise constraints. J Adv Comput Intell Intell Inform 16(1):174–179
Zurück zum Zitat Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATH
Zurück zum Zitat Herbin M, Bonnet N, Vautrot P (2001) Estimation of the number of clusters and influence zones. Pattern Recognit Lett 22(14):1557–1568MATHCrossRef Herbin M, Bonnet N, Vautrot P (2001) Estimation of the number of clusters and influence zones. Pattern Recognit Lett 22(14):1557–1568MATHCrossRef
Zurück zum Zitat Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: International conference on knowledge discovery and data mining (KDD) Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: International conference on knowledge discovery and data mining (KDD)
Zurück zum Zitat Horta D, Campello RJGB (2012) Automatic aspect discrimination in data clustering. Pattern Recognit 45:4370–4388MATHCrossRef Horta D, Campello RJGB (2012) Automatic aspect discrimination in data clustering. Pattern Recognit 45:4370–4388MATHCrossRef
Zurück zum Zitat Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRef Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRef
Zurück zum Zitat Jain AK, Dubes RC (1988) Algorithms for clustering data. Englewood Cliffs, Prentice Hall Jain AK, Dubes RC (1988) Algorithms for clustering data. Englewood Cliffs, Prentice Hall
Zurück zum Zitat Kestler H, Kraus J, Palm G, Schwenker F (2006) On the effects of constraints in semi-supervised hierarchical clustering. In: IAPR workshop on artificial neural networks in pattern recognition (ANNPR) Kestler H, Kraus J, Palm G, Schwenker F (2006) On the effects of constraints in semi-supervised hierarchical clustering. In: IAPR workshop on artificial neural networks in pattern recognition (ANNPR)
Zurück zum Zitat Kim HJ, Lee SG (2002) An effective document clustering method using user-adaptable distance metrics. In: ACM symposium on applied computing (SAC) Kim HJ, Lee SG (2002) An effective document clustering method using user-adaptable distance metrics. In: ACM symposium on applied computing (SAC)
Zurück zum Zitat Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: International conference on machine learning (ICML) Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: International conference on machine learning (ICML)
Zurück zum Zitat Kraus JM, Palm G, Kestler HA (2007) On the robustness of semi-supervised hierarchical graph clustering in functional genomics. In: 5th International workshop on mining and learning with graphs (MLG), pp 1–4 Kraus JM, Palm G, Kestler HA (2007) On the robustness of semi-supervised hierarchical graph clustering in functional genomics. In: 5th International workshop on mining and learning with graphs (MLG), pp 1–4
Zurück zum Zitat Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdiscip Rev Data Min Knowl Discov 1(3):231–240CrossRef Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdiscip Rev Data Min Knowl Discov 1(3):231–240CrossRef
Zurück zum Zitat Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: International conference on knowledge discovery and data mining (KDD) Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: International conference on knowledge discovery and data mining (KDD)
Zurück zum Zitat Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: International conference on data mining (ICDM) Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: International conference on data mining (ICDM)
Zurück zum Zitat Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179CrossRef Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179CrossRef
Zurück zum Zitat Miyamoto S, Terami A (2010) Semi-supervised agglomerative hierarchical clustering algorithms with pairwise constraints. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), pp 1–6 Miyamoto S, Terami A (2010) Semi-supervised agglomerative hierarchical clustering algorithms with pairwise constraints. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), pp 1–6
Zurück zum Zitat Naldi M, Campello R, Hruschka E, Carvalho A (2011) Efficiency issues of evolutionary k-means. Appl Soft Comput 11(2):1938–1952CrossRef Naldi M, Campello R, Hruschka E, Carvalho A (2011) Efficiency issues of evolutionary k-means. Appl Soft Comput 11(2):1938–1952CrossRef
Zurück zum Zitat Paulovich F, Nonato L, Minghim R, Levkowitz H (2008) Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans Vis Comput Graph 14(3):564–575CrossRef Paulovich F, Nonato L, Minghim R, Levkowitz H (2008) Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans Vis Comput Graph 14(3):564–575CrossRef
Zurück zum Zitat Sander J, Qin X, Lu Z, Niu N, Kovarsky A (2003) Automatic extraction of clusters from hierarchical clustering representations. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD) Sander J, Qin X, Lu Z, Niu N, Kovarsky A (2003) Automatic extraction of clusters from hierarchical clustering representations. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD)
Zurück zum Zitat Skarmeta AG, Bensaid A, Tazi N (2000) Data mining for text categorization with semi-supervised agglomerative hierarchical clustering. Int J Intell Syst 15(7):633–646MATHCrossRef Skarmeta AG, Bensaid A, Tazi N (2000) Data mining for text categorization with semi-supervised agglomerative hierarchical clustering. Int J Intell Syst 15(7):633–646MATHCrossRef
Zurück zum Zitat Struyf J, Džeroski S (2007) Clustering trees with instance level constraints. In: European conference on machine learning (ECML), pp 359–370 Struyf J, Džeroski S (2007) Clustering trees with instance level constraints. In: European conference on machine learning (ECML), pp 359–370
Zurück zum Zitat Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20:25–47MathSciNetMATHCrossRef Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20:25–47MathSciNetMATHCrossRef
Zurück zum Zitat Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19(2):397–418MathSciNetCrossRef Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19(2):397–418MathSciNetCrossRef
Zurück zum Zitat Sun H, Huang J, Han J, Deng H, Zhao P, Feng B (2010) gSkeletonClu: density-based network clustering via structure-connected tree division or agglomeration. In: IEEE international conference on data mining (ICDM) Sun H, Huang J, Han J, Deng H, Zhao P, Feng B (2010) gSkeletonClu: density-based network clustering via structure-connected tree division or agglomeration. In: IEEE international conference on data mining (ICDM)
Zurück zum Zitat Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
Zurück zum Zitat Wagstaff KL (2002) Intelligent clustering with instance-level constraints. Ph.D. thesis, Department of Computer Science, Cornell University Wagstaff KL (2002) Intelligent clustering with instance-level constraints. Ph.D. thesis, Department of Computer Science, Cornell University
Zurück zum Zitat Xiong T, Wang S, Mayers A, Monga E (2011) Semi-supervised parameter-free divisive hierarchical clustering of categorical data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 265–276 Xiong T, Wang S, Mayers A, Monga E (2011) Semi-supervised parameter-free divisive hierarchical clustering of categorical data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 265–276
Zurück zum Zitat Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987CrossRef Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987CrossRef
Zurück zum Zitat Yeung KY, Medvedovic M, Bumgarner R (2003) Clustering gene-expression data with repeated measurements. Genome Biol 4(5):R34 Yeung KY, Medvedovic M, Bumgarner R (2003) Clustering gene-expression data with repeated measurements. Genome Biol 4(5):R34
Zurück zum Zitat Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: International conference on knowledge discovery and data mining (WKDD) Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: International conference on knowledge discovery and data mining (WKDD)
Zurück zum Zitat Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10:141–168MathSciNetCrossRef Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10:141–168MathSciNetCrossRef
Zurück zum Zitat Zheng L, Li T (2011) Semi-supervised hierarchical clustering. In: IEEE international conference on data mining (ICDM) Zheng L, Li T (2011) Semi-supervised hierarchical clustering. In: IEEE international conference on data mining (ICDM)
Metadaten
Titel
A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies
verfasst von
R. J. G. B. Campello
D. Moulavi
A. Zimek
J. Sander
Publikationsdatum
01.11.2013
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 3/2013
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-013-0311-4

Weitere Artikel der Ausgabe 3/2013

Data Mining and Knowledge Discovery 3/2013 Zur Ausgabe