Skip to main content
Erschienen in: Knowledge and Information Systems 2/2016

01.02.2016 | Regular Paper

HICC: an entropy splitting-based framework for hierarchical co-clustering

verfasst von: Wei Cheng, Xiang Zhang, Feng Pan, Wei Wang

Erschienen in: Knowledge and Information Systems | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Two-dimensional contingency tables or co-occurrence matrices arise frequently in various important applications such as text analysis and web-log mining. As a fundamental research topic, co-clustering aims to generate a meaningful partition of the contingency table to reveal hidden relationships between rows and columns. Traditional co-clustering algorithms usually produce a predefined number of flat partition of both rows and columns, which do not reveal relationship among clusters. To address this limitation, hierarchical co-clustering algorithms have attracted a lot of research interests recently. Although successful in various applications, the existing hierarchical co-clustering algorithms are usually based on certain heuristics and do not have solid theoretical background. In this paper, we present a new co-clustering algorithm, HICC, with solid theoretical background. It simultaneously constructs a hierarchical structure of both row and column clusters, which retains sufficient mutual information between rows and columns of the contingency table. An efficient and effective greedy algorithm is developed, which grows a co-cluster hierarchy by successively performing row-wise or column-wise splits that lead to the maximal mutual information gain. Extensive experiments on both synthetic and real datasets demonstrate that our algorithm can reveal essential relationships of row (and column) clusters and has better clustering precision than existing algorithms. Moreover, the experiments on real dataset show that HICC can effectively reveal hidden relationships between rows and columns in the contingency table.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ailon N, Charikar M (2005) Fitting tree metrics: hierarchical clustering and phylogeny. In: FOCS: IEEE symposium on foundations of computer science (FOCS) Ailon N, Charikar M (2005) Fitting tree metrics: hierarchical clustering and phylogeny. In: FOCS: IEEE symposium on foundations of computer science (FOCS)
2.
Zurück zum Zitat Anagnostopoulos A et al (2008) Approximation algorithms for co-clustering. In: Lenzerini M, Lembo D (eds.) Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems: PODS’08, Vancouver, BC, Canada, 9–11 June 2008, pp 201–210. ACM Press Anagnostopoulos A et al (2008) Approximation algorithms for co-clustering. In: Lenzerini M, Lembo D (eds.) Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems: PODS’08, Vancouver, BC, Canada, 9–11 June 2008, pp 201–210. ACM Press
3.
Zurück zum Zitat Bandyopadhyay S, Coyle EJ (2003) An energy efficient hierarchical clustering algorithm for wireless sensor networks. In: INFOCOM 2003 Bandyopadhyay S, Coyle EJ (2003) An energy efficient hierarchical clustering algorithm for wireless sensor networks. In: INFOCOM 2003
4.
Zurück zum Zitat Banerjee A et al (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: SIGKDD’04 conference proceedings Banerjee A et al (2004) A generalized maximum entropy approach to bregman co-clustering and matrix approximation. In: SIGKDD’04 conference proceedings
5.
Zurück zum Zitat Banerjee A et al (2007) A generalized maximum entropy approach to bregman co-clustering and matrix approximation Banerjee A et al (2007) A generalized maximum entropy approach to bregman co-clustering and matrix approximation
6.
Zurück zum Zitat Brunet JP et al (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 101(12):4164–4169CrossRef Brunet JP et al (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci USA 101(12):4164–4169CrossRef
7.
Zurück zum Zitat Chakrabarti D et al (2004) Fully automatic cross-associations. In: ACM SIGKDD’04 conference proceedings Chakrabarti D et al (2004) Fully automatic cross-associations. In: ACM SIGKDD’04 conference proceedings
9.
Zurück zum Zitat Deodhar M, Ghosh J (2010) SCOAL: a framework for simultaneous co-clustering and learning from complex data. ACM Trans Knowl Discov Data 4(3):11:1–11:31 Deodhar M, Ghosh J (2010) SCOAL: a framework for simultaneous co-clustering and learning from complex data. ACM Trans Knowl Discov Data 4(3):11:1–11:31
10.
Zurück zum Zitat Dhillon IS et al (2003a) A divisive information-theoretic feature clustering algorithm for text classification. J Mach Learn Res 3:1265–1287MathSciNetMATH Dhillon IS et al (2003a) A divisive information-theoretic feature clustering algorithm for text classification. J Mach Learn Res 3:1265–1287MathSciNetMATH
11.
Zurück zum Zitat Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: SIGKDD ’01 conference proceedings Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: SIGKDD ’01 conference proceedings
12.
Zurück zum Zitat Dhillon IS et al (2003b) Information-theoretic co-clustering. In: SIGKDD ’03 conference proceedings Dhillon IS et al (2003b) Information-theoretic co-clustering. In: SIGKDD ’03 conference proceedings
13.
Zurück zum Zitat El-Yaniv R, Souroujon O (2001) Iterative double clustering for unsupervised and semi-supervised learning. In: ECML, pp 121–132 El-Yaniv R, Souroujon O (2001) Iterative double clustering for unsupervised and semi-supervised learning. In: ECML, pp 121–132
14.
Zurück zum Zitat Goldberger J, Roweis ST (2004) Hierarchical clustering of a mixture model. In: NIPS Goldberger J, Roweis ST (2004) Hierarchical clustering of a mixture model. In: NIPS
15.
Zurück zum Zitat Hochreiter S et al (2010) FABIA: factor analysis for bicluster acquisition. Bioinformatics 26(12):1520–1527CrossRef Hochreiter S et al (2010) FABIA: factor analysis for bicluster acquisition. Bioinformatics 26(12):1520–1527CrossRef
16.
Zurück zum Zitat Hosseini M, Abolhassani H (2007) Hierarchical co-clustering for web queries and selected urls. In: Web information systems engineering-WISE 2007, pp 653–662 Hosseini M, Abolhassani H (2007) Hierarchical co-clustering for web queries and selected urls. In: Web information systems engineering-WISE 2007, pp 653–662
17.
Zurück zum Zitat Ienco RPD, Meo R (2009) Parameter-free hierarchical co-clustering by n-ary splits. Mach Learn Knowl Discov Databases 5781:580–595 Ienco RPD, Meo R (2009) Parameter-free hierarchical co-clustering by n-ary splits. Mach Learn Knowl Discov Databases 5781:580–595
18.
Zurück zum Zitat Karayannidis N, Sellis T (2008) Hierarchical clustering for OLAP: the CUBE file approach. VLDB J Very Large Data Bases 17(4):621–655CrossRef Karayannidis N, Sellis T (2008) Hierarchical clustering for OLAP: the CUBE file approach. VLDB J Very Large Data Bases 17(4):621–655CrossRef
19.
Zurück zum Zitat Lee D, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791CrossRef Lee D, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791CrossRef
20.
Zurück zum Zitat Li J et al (2011) Hierarchical co-clustering: a new way to organize the music data. IEEE Trans Multimed 14:1–25MATH Li J et al (2011) Hierarchical co-clustering: a new way to organize the music data. IEEE Trans Multimed 14:1–25MATH
21.
Zurück zum Zitat Li J, Li T (2010) HCC: a hierarchical co-clustering algorithm. In: SIGIR’10, pp 861–862 Li J, Li T (2010) HCC: a hierarchical co-clustering algorithm. In: SIGIR’10, pp 861–862
22.
Zurück zum Zitat Long B et al (2005a) Co-clustering by block value decomposition. In: SIGKDD’05 conference proceedings Long B et al (2005a) Co-clustering by block value decomposition. In: SIGKDD’05 conference proceedings
23.
Zurück zum Zitat Long B et al (2005b) Co-clustering by block value decomposition. In: KDD, pp 635–640. ACM Long B et al (2005b) Co-clustering by block value decomposition. In: KDD, pp 635–640. ACM
24.
Zurück zum Zitat Mishra N et al (2003) On finding large conjunctive clusters. In: COLT’03 conference proceedings Mishra N et al (2003) On finding large conjunctive clusters. In: COLT’03 conference proceedings
25.
Zurück zum Zitat Murtach F (1983) A survey of recent advances in hierarchical clustering algorithms. Comput J 26(4): 354–359CrossRef Murtach F (1983) A survey of recent advances in hierarchical clustering algorithms. Comput J 26(4): 354–359CrossRef
27.
Zurück zum Zitat Pan F et al (2008) CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In: SIGMOD conference, pp 173–184. ACM Pan F et al (2008) CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In: SIGMOD conference, pp 173–184. ACM
28.
Zurück zum Zitat Pensa RG, Lenco D, Meo R (2014) Hierarchical co-clustering: off-line and incremental approaches. Data Min Knowl Discov 28(1):31–64 Pensa RG, Lenco D, Meo R (2014) Hierarchical co-clustering: off-line and incremental approaches. Data Min Knowl Discov 28(1):31–64
29.
Zurück zum Zitat Pensa RG, Boulicaut J-F (2008) Constrained co-clustering of gene expression data. In: SDM, pp 25–36. SIAM Pensa RG, Boulicaut J-F (2008) Constrained co-clustering of gene expression data. In: SDM, pp 25–36. SIAM
30.
31.
Zurück zum Zitat Segal E, Koller D (2002) Probabilistic hierarchical clustering for biological data. In: RECOMB, pp 273–280 Segal E, Koller D (2002) Probabilistic hierarchical clustering for biological data. In: RECOMB, pp 273–280
32.
Zurück zum Zitat Shan H, Banerjee A (2008) Bayesian co-clustering. In: ICDM, pp 530–539. IEEE Computer Society Shan H, Banerjee A (2008) Bayesian co-clustering. In: ICDM, pp 530–539. IEEE Computer Society
33.
Zurück zum Zitat Shao B et al (2008) Quantify music artist similarity based on style and mood. In: WIDM ’08, pp 119–124 Shao B et al (2008) Quantify music artist similarity based on style and mood. In: WIDM ’08, pp 119–124
34.
Zurück zum Zitat Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: ACM SIGIR Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: ACM SIGIR
35.
Zurück zum Zitat Song Y et al (2013) Constrained text coclustering with supervised and unsupervised constraints. IEEE Trans Knowl Data Eng 25:1227–1239CrossRef Song Y et al (2013) Constrained text coclustering with supervised and unsupervised constraints. IEEE Trans Knowl Data Eng 25:1227–1239CrossRef
36.
Zurück zum Zitat Székely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending Ward’s minimum variance method. J Classif 22(2):151–183CrossRef Székely GJ, Rizzo ML (2005) Hierarchical clustering via joint between-within distances: extending Ward’s minimum variance method. J Classif 22(2):151–183CrossRef
37.
Zurück zum Zitat Wang P et al (2011) Nonparametric Bayesian co-clustering ensembles. In: SDM, pp 331–342. SIAM/Omnipress Wang P et al (2011) Nonparametric Bayesian co-clustering ensembles. In: SDM, pp 331–342. SIAM/Omnipress
38.
Zurück zum Zitat Wu M-L et al (2013) Co-clustering with augmented matrix. Appl Intell 39(1):153–164CrossRef Wu M-L et al (2013) Co-clustering with augmented matrix. Appl Intell 39(1):153–164CrossRef
39.
Zurück zum Zitat Xu G, Ma WY (2006) Building implicit links from content for forum search. In: SIGIR ’06, pp 300–307 Xu G, Ma WY (2006) Building implicit links from content for forum search. In: SIGIR ’06, pp 300–307
40.
Zurück zum Zitat Zhang L (2012) Locally discriminative coclustering. IEEE Trans Knowl Data Eng 24(6):1025–1035CrossRef Zhang L (2012) Locally discriminative coclustering. IEEE Trans Knowl Data Eng 24(6):1025–1035CrossRef
Metadaten
Titel
HICC: an entropy splitting-based framework for hierarchical co-clustering
verfasst von
Wei Cheng
Xiang Zhang
Feng Pan
Wei Wang
Publikationsdatum
01.02.2016
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 2/2016
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-015-0823-x

Weitere Artikel der Ausgabe 2/2016

Knowledge and Information Systems 2/2016 Zur Ausgabe

Premium Partner