nach oben

Advances in Data Analysis and Classification

Erschienen in:

16.11.2016 | Regular Article

Mutual information, phi-squared and model-based co-clustering for contingency tables

verfasst von: Gérard Govaert, Mohamed Nadif

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.

Vorheriger Artikel Editorial for issue 3/2018

Nächster Artikel Model selection for Gaussian latent block clustering with the integrated classification likelihood

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173CrossRef

Arabie P, Hubert LJ (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20:268–274CrossRef

Arabie P, Schleutermann S, Daws J, Hubert L (1988) Marketing applications of sequencing and partitioning of nonsymmetric and/or two-mode matrices. In: Data, expert knowledge and decisions. Springer, pp 215–224

Baier D, Gaul W, Schader M (1997) Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In: Classification and knowledge organization. Springer, pp 557–566

Benzecri JP (1973) L’analyse des données, tome 2: l’analyse des correspondances. Dunod, ParisMATH

Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203

Bock HH (1992) A clustering technique for maximizing \(\varphi \)-divergence, noncentrality and discriminating power. In: Analyzing and modeling data and knowledge. Springer, pp 19–36

Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H (ed) First US/Japan conference on the frontiers of statistical modeling: an informational approach. Kluwer Academic Publishers, Dordrecht, pp 115–147CrossRef

Bock HH (2004) Convexity-based clustering criteria: theory, algorithms, and applications in statistics. Stat Methods Appl 12(3):293–317MathSciNetMATH

Bryant PG (1988) On characterizing optimization-based clustering criteria. J Classif 5:81–84CrossRef

Castillo W, Trejos J (2002) Two-mode partitioning: review of methods and application of tabu search. In: Bock HH (ed) Classification, clustering, and data analysis. Springer, Heidelberg, pp 43–51CrossRef

Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332MathSciNetCrossRef

Cheng Y, Church GM (2000) Biclustering of expression data. In: ISMB2000, 8th international conference on intelligent systems for molecular biology, vol 8, pp 93–103

Cho H, Dhillon I (2008) Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):385–400CrossRef

Cramer H (1946) Mathematical methods of statistics. Princeton University Press, Princeton

Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRef

Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38MathSciNetMATH

Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 269–274CrossRef

Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175CrossRef

Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), pp 89–98

Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM data mining conference

Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, p 135

Duffy DE, Quiroz AJ (1991) A permutation-based algorithm for block clustering. J Classif 8:65–91MathSciNetCrossRef

Govaert G (1977) Algorithme de classification d’un tableau de contingence. First international symposium on data analysis and informatics. INRIA, Versailles, pp 487–500

Govaert G (1983) Classification croisée. Thèse d’état, Université Paris 6, France

Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24(4):437–458MATH

Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36:463–473CrossRef

Govaert G, Nadif M (2005) An EM algorithm for the block mixture model. IEEE Trans Pattern Anal Mach Intell 27(4):643–647CrossRef

Govaert G, Nadif M (2007) Clustering of contingency table and mixture model. Eur J Oper Res 183(3):1055–1066MathSciNetCrossRef

Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233–3245MathSciNetCrossRef

Govaert G, Nadif M (2010) Latent block model for contingency table. Commun Stat Theory Methods 39(3):416–425MathSciNetCrossRef

Govaert G, Nadif M (2013) Co-clustering. Wiley, New YorkCrossRef

Greenacre M (1988) Clustering the rows and columns of a contingency table. J Classif 5:39–51MathSciNetCrossRef

Gupta N, Aggarwal S (2010) Mib: using mutual information for biclustering gene expression data. Pattern Recognit 43(8):2692–2697CrossRef

Hanczar B, Nadif M (2011) Using the bagging approach for biclustering of gene expression data. Neurocomputing 74(10):1595–1605CrossRef

Hanczar B, Nadif M (2012) Ensemble methods for biclustering tasks. Pattern Recognit 45(11):3938–3949CrossRef

Hanczar B, Nadif M (2013) Precision-recall space to correct external indices for biclustering. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 136–144

Harris RR, Kanji GK (1983) On the use of minimum chi-square estimation. The Statistician, pp 379–394

Hartigan JA (1972) Direct clustering of a data matrix. JASA 67(337):123–129CrossRef

Hathaway RJ (1986) Another interpretation of the em algorithm for mixture distributions. Stat Probab Lett 4(2):53–56MathSciNetCrossRef

Hofmann T (1999) Probabilistic latent semantic indexing. SIGIR ’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 50–57CrossRef

Labiod L, Nadif M (2011a) Co-clustering for binary and categorical data with maximum modularity. In: 2011 IEEE 11th international conference on data mining, pp 1140–1145

Labiod L, Nadif M (2011b) Co-clustering under nonnegative matrix tri-factorization. In: Neural information processing—18th international conference. ICONIP, pp 709–717

Labiod L, Nadif M (2015) A unified framework for data visualization and coclustering. IEEE Trans Neural Netw Learn Syst 26(9):2194–2199MathSciNetCrossRef

Li L, Guo Y, Wu W, Shi Y, Cheng J, Tao S (2012) A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data. BioData Min 5(1):1CrossRef

Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. KDD ’05: proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, New York, pp 635–640CrossRef

Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45CrossRef

Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91CrossRef

Neal RM, Hinton GE (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in graphical models. Springer, pp 355–368

Neyman J (1949) Contribution to the theory of Chi-square test. Proceedings of the Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 239–273

Pearson K (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci 50(302):157–175CrossRef

Pötzelberger K, Strasser H (1997) Data compression by unsupervised classification

Pötzelberger K, Strasser H (2001) Clustering and quantization by MSP-partitions. Stat Decis Int J Stoch Methods Models 19(4):331–372MathSciNetMATH

Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52(4):1984–2003MathSciNetCrossRef

Santamaría R, Quintales L, Therón R (2007) Methods to bicluster validation and comparison in microarray data. In: Intelligent data engineering and automated learning-IDEAL 2007. Springer, pp 780–789

Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATH

Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9(1–20):122–124

Trejos J, Castillo W (2000) Simulated annealing optimization for two-mode partitioning. In: Decker R, Gaul W (eds) Classification and information processing at the turn of the millennium. Springer, Heidelberg, pp 135–142CrossRef

Van Mechelen I, Schepers J (2006) A unifying model for biclustering. In: Compstat 2006-proceedings in computational statistics. Springer, pp 81–88

Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394MathSciNetCrossRef

Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. Advances in classification and data analysis. Springer, Heidelberg, pp 43–52

Windham MP (1987) Parameter modification for clustering criteria. J Classif 4:191–214MathSciNetCrossRef

Titel: Mutual information, phi-squared and model-based co-clustering for contingency tables
verfasst von: Gérard Govaert
Mohamed Nadif
Publikationsdatum: 16.11.2016
Verlag: Springer Berlin Heidelberg
Erschienen in: Advances in Data Analysis and Classification / Ausgabe 3/2018
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI: https://doi.org/10.1007/s11634-016-0274-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2018

A divisive clustering method for functional data with special consideration of outliers

Archetypal shapes based on landmarks and extension to handle missing data

Signal classification with a point process distance on the space of persistence diagrams

Model selection for Gaussian latent block clustering with the integrated classification likelihood

Outlier detection in interval data

Discovering patterns in time-varying graphs: a triclustering approach

Premium Partner