Skip to main content
Erschienen in: Advances in Data Analysis and Classification 3/2018

16.11.2016 | Regular Article

Mutual information, phi-squared and model-based co-clustering for contingency tables

verfasst von: Gérard Govaert, Mohamed Nadif

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173CrossRef Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173CrossRef
Zurück zum Zitat Arabie P, Hubert LJ (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20:268–274CrossRef Arabie P, Hubert LJ (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20:268–274CrossRef
Zurück zum Zitat Arabie P, Schleutermann S, Daws J, Hubert L (1988) Marketing applications of sequencing and partitioning of nonsymmetric and/or two-mode matrices. In: Data, expert knowledge and decisions. Springer, pp 215–224 Arabie P, Schleutermann S, Daws J, Hubert L (1988) Marketing applications of sequencing and partitioning of nonsymmetric and/or two-mode matrices. In: Data, expert knowledge and decisions. Springer, pp 215–224
Zurück zum Zitat Baier D, Gaul W, Schader M (1997) Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In: Classification and knowledge organization. Springer, pp 557–566 Baier D, Gaul W, Schader M (1997) Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In: Classification and knowledge organization. Springer, pp 557–566
Zurück zum Zitat Benzecri JP (1973) L’analyse des données, tome 2: l’analyse des correspondances. Dunod, ParisMATH Benzecri JP (1973) L’analyse des données, tome 2: l’analyse des correspondances. Dunod, ParisMATH
Zurück zum Zitat Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203 Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203
Zurück zum Zitat Bock HH (1992) A clustering technique for maximizing \(\varphi \)-divergence, noncentrality and discriminating power. In: Analyzing and modeling data and knowledge. Springer, pp 19–36 Bock HH (1992) A clustering technique for maximizing \(\varphi \)-divergence, noncentrality and discriminating power. In: Analyzing and modeling data and knowledge. Springer, pp 19–36
Zurück zum Zitat Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H (ed) First US/Japan conference on the frontiers of statistical modeling: an informational approach. Kluwer Academic Publishers, Dordrecht, pp 115–147CrossRef Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H (ed) First US/Japan conference on the frontiers of statistical modeling: an informational approach. Kluwer Academic Publishers, Dordrecht, pp 115–147CrossRef
Zurück zum Zitat Bock HH (2004) Convexity-based clustering criteria: theory, algorithms, and applications in statistics. Stat Methods Appl 12(3):293–317MathSciNetMATH Bock HH (2004) Convexity-based clustering criteria: theory, algorithms, and applications in statistics. Stat Methods Appl 12(3):293–317MathSciNetMATH
Zurück zum Zitat Bryant PG (1988) On characterizing optimization-based clustering criteria. J Classif 5:81–84CrossRef Bryant PG (1988) On characterizing optimization-based clustering criteria. J Classif 5:81–84CrossRef
Zurück zum Zitat Castillo W, Trejos J (2002) Two-mode partitioning: review of methods and application of tabu search. In: Bock HH (ed) Classification, clustering, and data analysis. Springer, Heidelberg, pp 43–51CrossRef Castillo W, Trejos J (2002) Two-mode partitioning: review of methods and application of tabu search. In: Bock HH (ed) Classification, clustering, and data analysis. Springer, Heidelberg, pp 43–51CrossRef
Zurück zum Zitat Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332MathSciNetCrossRef Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332MathSciNetCrossRef
Zurück zum Zitat Cheng Y, Church GM (2000) Biclustering of expression data. In: ISMB2000, 8th international conference on intelligent systems for molecular biology, vol 8, pp 93–103 Cheng Y, Church GM (2000) Biclustering of expression data. In: ISMB2000, 8th international conference on intelligent systems for molecular biology, vol 8, pp 93–103
Zurück zum Zitat Cho H, Dhillon I (2008) Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):385–400CrossRef Cho H, Dhillon I (2008) Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):385–400CrossRef
Zurück zum Zitat Cramer H (1946) Mathematical methods of statistics. Princeton University Press, Princeton Cramer H (1946) Mathematical methods of statistics. Princeton University Press, Princeton
Zurück zum Zitat Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRef Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRef
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38MathSciNetMATH Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38MathSciNetMATH
Zurück zum Zitat Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 269–274CrossRef Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 269–274CrossRef
Zurück zum Zitat Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175CrossRef Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175CrossRef
Zurück zum Zitat Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), pp 89–98 Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), pp 89–98
Zurück zum Zitat Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM data mining conference Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM data mining conference
Zurück zum Zitat Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, p 135 Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, p 135
Zurück zum Zitat Govaert G (1977) Algorithme de classification d’un tableau de contingence. First international symposium on data analysis and informatics. INRIA, Versailles, pp 487–500 Govaert G (1977) Algorithme de classification d’un tableau de contingence. First international symposium on data analysis and informatics. INRIA, Versailles, pp 487–500
Zurück zum Zitat Govaert G (1983) Classification croisée. Thèse d’état, Université Paris 6, France Govaert G (1983) Classification croisée. Thèse d’état, Université Paris 6, France
Zurück zum Zitat Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24(4):437–458MATH Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24(4):437–458MATH
Zurück zum Zitat Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36:463–473CrossRef Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36:463–473CrossRef
Zurück zum Zitat Govaert G, Nadif M (2005) An EM algorithm for the block mixture model. IEEE Trans Pattern Anal Mach Intell 27(4):643–647CrossRef Govaert G, Nadif M (2005) An EM algorithm for the block mixture model. IEEE Trans Pattern Anal Mach Intell 27(4):643–647CrossRef
Zurück zum Zitat Govaert G, Nadif M (2007) Clustering of contingency table and mixture model. Eur J Oper Res 183(3):1055–1066MathSciNetCrossRef Govaert G, Nadif M (2007) Clustering of contingency table and mixture model. Eur J Oper Res 183(3):1055–1066MathSciNetCrossRef
Zurück zum Zitat Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233–3245MathSciNetCrossRef Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233–3245MathSciNetCrossRef
Zurück zum Zitat Govaert G, Nadif M (2010) Latent block model for contingency table. Commun Stat Theory Methods 39(3):416–425MathSciNetCrossRef Govaert G, Nadif M (2010) Latent block model for contingency table. Commun Stat Theory Methods 39(3):416–425MathSciNetCrossRef
Zurück zum Zitat Gupta N, Aggarwal S (2010) Mib: using mutual information for biclustering gene expression data. Pattern Recognit 43(8):2692–2697CrossRef Gupta N, Aggarwal S (2010) Mib: using mutual information for biclustering gene expression data. Pattern Recognit 43(8):2692–2697CrossRef
Zurück zum Zitat Hanczar B, Nadif M (2011) Using the bagging approach for biclustering of gene expression data. Neurocomputing 74(10):1595–1605CrossRef Hanczar B, Nadif M (2011) Using the bagging approach for biclustering of gene expression data. Neurocomputing 74(10):1595–1605CrossRef
Zurück zum Zitat Hanczar B, Nadif M (2012) Ensemble methods for biclustering tasks. Pattern Recognit 45(11):3938–3949CrossRef Hanczar B, Nadif M (2012) Ensemble methods for biclustering tasks. Pattern Recognit 45(11):3938–3949CrossRef
Zurück zum Zitat Hanczar B, Nadif M (2013) Precision-recall space to correct external indices for biclustering. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 136–144 Hanczar B, Nadif M (2013) Precision-recall space to correct external indices for biclustering. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 136–144
Zurück zum Zitat Harris RR, Kanji GK (1983) On the use of minimum chi-square estimation. The Statistician, pp 379–394 Harris RR, Kanji GK (1983) On the use of minimum chi-square estimation. The Statistician, pp 379–394
Zurück zum Zitat Hartigan JA (1972) Direct clustering of a data matrix. JASA 67(337):123–129CrossRef Hartigan JA (1972) Direct clustering of a data matrix. JASA 67(337):123–129CrossRef
Zurück zum Zitat Hathaway RJ (1986) Another interpretation of the em algorithm for mixture distributions. Stat Probab Lett 4(2):53–56MathSciNetCrossRef Hathaway RJ (1986) Another interpretation of the em algorithm for mixture distributions. Stat Probab Lett 4(2):53–56MathSciNetCrossRef
Zurück zum Zitat Hofmann T (1999) Probabilistic latent semantic indexing. SIGIR ’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 50–57CrossRef Hofmann T (1999) Probabilistic latent semantic indexing. SIGIR ’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 50–57CrossRef
Zurück zum Zitat Labiod L, Nadif M (2011a) Co-clustering for binary and categorical data with maximum modularity. In: 2011 IEEE 11th international conference on data mining, pp 1140–1145 Labiod L, Nadif M (2011a) Co-clustering for binary and categorical data with maximum modularity. In: 2011 IEEE 11th international conference on data mining, pp 1140–1145
Zurück zum Zitat Labiod L, Nadif M (2011b) Co-clustering under nonnegative matrix tri-factorization. In: Neural information processing—18th international conference. ICONIP, pp 709–717 Labiod L, Nadif M (2011b) Co-clustering under nonnegative matrix tri-factorization. In: Neural information processing—18th international conference. ICONIP, pp 709–717
Zurück zum Zitat Labiod L, Nadif M (2015) A unified framework for data visualization and coclustering. IEEE Trans Neural Netw Learn Syst 26(9):2194–2199MathSciNetCrossRef Labiod L, Nadif M (2015) A unified framework for data visualization and coclustering. IEEE Trans Neural Netw Learn Syst 26(9):2194–2199MathSciNetCrossRef
Zurück zum Zitat Li L, Guo Y, Wu W, Shi Y, Cheng J, Tao S (2012) A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data. BioData Min 5(1):1CrossRef Li L, Guo Y, Wu W, Shi Y, Cheng J, Tao S (2012) A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data. BioData Min 5(1):1CrossRef
Zurück zum Zitat Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. KDD ’05: proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, New York, pp 635–640CrossRef Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. KDD ’05: proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, New York, pp 635–640CrossRef
Zurück zum Zitat Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45CrossRef Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45CrossRef
Zurück zum Zitat Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91CrossRef Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91CrossRef
Zurück zum Zitat Neal RM, Hinton GE (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in graphical models. Springer, pp 355–368 Neal RM, Hinton GE (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in graphical models. Springer, pp 355–368
Zurück zum Zitat Neyman J (1949) Contribution to the theory of Chi-square test. Proceedings of the Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 239–273 Neyman J (1949) Contribution to the theory of Chi-square test. Proceedings of the Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 239–273
Zurück zum Zitat Pearson K (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci 50(302):157–175CrossRef Pearson K (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci 50(302):157–175CrossRef
Zurück zum Zitat Pötzelberger K, Strasser H (1997) Data compression by unsupervised classification Pötzelberger K, Strasser H (1997) Data compression by unsupervised classification
Zurück zum Zitat Pötzelberger K, Strasser H (2001) Clustering and quantization by MSP-partitions. Stat Decis Int J Stoch Methods Models 19(4):331–372MathSciNetMATH Pötzelberger K, Strasser H (2001) Clustering and quantization by MSP-partitions. Stat Decis Int J Stoch Methods Models 19(4):331–372MathSciNetMATH
Zurück zum Zitat Santamaría R, Quintales L, Therón R (2007) Methods to bicluster validation and comparison in microarray data. In: Intelligent data engineering and automated learning-IDEAL 2007. Springer, pp 780–789 Santamaría R, Quintales L, Therón R (2007) Methods to bicluster validation and comparison in microarray data. In: Intelligent data engineering and automated learning-IDEAL 2007. Springer, pp 780–789
Zurück zum Zitat Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATH Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATH
Zurück zum Zitat Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9(1–20):122–124 Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9(1–20):122–124
Zurück zum Zitat Trejos J, Castillo W (2000) Simulated annealing optimization for two-mode partitioning. In: Decker R, Gaul W (eds) Classification and information processing at the turn of the millennium. Springer, Heidelberg, pp 135–142CrossRef Trejos J, Castillo W (2000) Simulated annealing optimization for two-mode partitioning. In: Decker R, Gaul W (eds) Classification and information processing at the turn of the millennium. Springer, Heidelberg, pp 135–142CrossRef
Zurück zum Zitat Van Mechelen I, Schepers J (2006) A unifying model for biclustering. In: Compstat 2006-proceedings in computational statistics. Springer, pp 81–88 Van Mechelen I, Schepers J (2006) A unifying model for biclustering. In: Compstat 2006-proceedings in computational statistics. Springer, pp 81–88
Zurück zum Zitat Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394MathSciNetCrossRef Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394MathSciNetCrossRef
Zurück zum Zitat Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. Advances in classification and data analysis. Springer, Heidelberg, pp 43–52 Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. Advances in classification and data analysis. Springer, Heidelberg, pp 43–52
Metadaten
Titel
Mutual information, phi-squared and model-based co-clustering for contingency tables
verfasst von
Gérard Govaert
Mohamed Nadif
Publikationsdatum
16.11.2016
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 3/2018
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-016-0274-6

Weitere Artikel der Ausgabe 3/2018

Advances in Data Analysis and Classification 3/2018 Zur Ausgabe

Premium Partner