A structured view on pattern mining-based biclustering
Introduction
The clustering of data matrices groups rows according to their overall values across columns. However, in real-world contexts, the correlation of a subset of rows is typically only significant and meaningful for a subset of the overall columns [114]. Biclustering seeks to find sub-matrices (biclusters), subsets of rows with a coherent pattern across subsets of columns. Illustrating, given a matrix that captures the expression of a set of genes (rows) across a set of conditions (columns), a bicluster defines a group of genes with coherent expression for a subset of conditions. The biclustering task in this domain is critical for the discovery of putative transcriptional modules of genes that participate in a cellular process that is only active in specific conditions [46], [40]. Table 1 provides additional applications in biomedical and social domains, synthesizing the meaning and relevance of discovering biclusters for pattern recognition.
Recent findings from biomedical domains show that exhaustive and flexible approaches to biclustering provide an unprecedented opportunity for an unbiased assessment of the native structure and modular organization of biological networks [14], new insights on the molecular units involved in cellular functions [60], [111], and discriminative high-order combinations of single-nucleotide polymorphisms (SNPs) [42]. However, due to the complexity of the biclustering task1, most of the existing algorithms are either based on greedy or stochastic approaches, potentially producing sub-optimal and constrained biclustering solutions [82], [67]. Illustrative constraints that prevent the flexibility of the biclustering task include the search for a fixed number of biclusters, non-overlapping structures and biclusters with differential-values only (binary settings) or sequential constraints [81], [82], [119]. In this context, the survey of efficient optimal searches for flexible biclustering scenarios is the target task in this work.
The attempts to perform biclustering based on pattern mining (PM) techniques [86], [111], [97], referred in this work as PM-based biclustering, show solid results for efficient and flexible exhaustive searches. In fact, since pattern mining research is driven by scalability requirements [54], its integration with biclustering defines a new promising direction. Contributions of PM-based approaches for biclustering include:
- •
efficient exhaustive searches: PM algorithms as-is allow for the efficient analysis of large matrices (over 10.000 400 elements). Additional PM principles can be used to foster scalability, including searches in distributed/partitioned data settings or targeting approximate patterns [54], [52].
- •
dealing with missing and noisy values [62], [63]: PM methods can mine transactions with varying length, and therefore a specific element from the input matrix can be associated with zero or multiple values, allowing the removal or bounded estimations of a missing or noisy value.
- •
inherent orientation to learn constant models, yet recently extended to also learn additive, multiplicative, symmetric, order-preserving and plaid models [62], [60], [63];
- •
capturing biclusters from patterns with multiple levels of expression [96], [101]. This contrasts with the majority of existing approaches that rely on differential values or fixed coherency strength [119];
- •
flexible structures of biclusters (arbitrary positioning of biclusters) and searches (no need to fix the number of biclusters apriori) [96], [111];
- •
annotating the significance of biclusters with PM principles to assess the relevance of patterns [72];
- •
easy extension for multi-class settings using discriminative PM or classification rules [43], [95];
- •
easy incorporation of PM-based constraints that can be effectively used to guide the search, promoting both efficiency, by pruning the search space, and a focus on non-trivial biclusters [116].
These properties of PM-based biclustering approaches are critical to tackle the problems highlighted in Table 1. Although the latest biclustering advances for pattern recognition are increasingly deterministic [89], [110], [128], [137], [47], [35], [131], they fail to meet several of the enumerated properties of PM-based biclustering. Table 2 pinpoints the benefits of using PM-based biclustering for pattern recognition.
Despite these listed potentialities, recent surveys on biclustering [46], [40], [28], [114] fail to explore the opportunities associated with PM-based biclustering. Additionally, the existing efforts towards PM-based biclustering provide critical principles that are not yet integrated [14], [86], [111]. As such, there is still space for new approaches that benefit from the integration of principles provided by these existing contributions as well as from other fields of research. In this context, this work provides three major contributions:
- •
motivates, formalizes and provides a qualitative and quantitative assessment of the state-of-the-art algorithms for PM-based biclustering;
- •
offers a structured view on how to define, parameterize and extend PM-based biclustering by coherently integrating the available yet dispersed contributions;
- •
further surveys PM principles as well as adequate preprocessing and postprocessing criteria to guarantee the robustness, flexibility and scalability of PM-based biclustering across domains.
The paper is organized as follows. The remainder of this section provides background on pattern mining and biclustering, and surveys the contributions from existing PM-based biclustering approaches. Section 2 introduces a consistent set of principles to guide the definition of PM-based biclustering approaches. In particular, 2.1 Mining options: discovery of biclusters using pattern mining, 2.2 Mapping options: preprocessing input data, 2.3 Closing options: postprocessing biclustering solutions cover principles according to three major decision dimensions (mining, mapping and closing), and Section 2.4 compares the behavior of state-of-the-art PM-based biclustering approaches and proposes a set of principles to address their current challenges. Section 3 provides initial empirical evidence of the relevance of the proposed principles. Finally, the implications of this work are synthesized.
Pattern mining: Frequent patterns are itemsets, rules, subsequences, or substructures that appear in a dataset with frequency no less than a user-specified threshold. Let be a finite set of items, and P be an itemset . A transaction t is a pair with . An itemset database D over is a finite set of transactions . A transaction contains , denoted , if . The coverage ΦP of an itemset P is the set of all transactions in D in which the itemset P occurs: . The support of an itemset P in D, denoted supP, can either be absolute, being its coverage size , or a relative threshold given by .
An association rule is defined as an implication of the form , where and . The left-hand side of the rule is named antecedent and the right-hand side consequent. Given an itemset database D, the support of a rule, , is given by , and the confidence of a rule, , is given by . Confidence reveals the strength of the rule (the conditional probability that a transaction that contains the items in the antecedent also contains the items in consequent). Definition 1.1 Given an itemset database D and a minimum support and confidence thresholds, θ and δ: frequent itemset mining (FIM) problem consists of computing the set ; association rule mining aims to compute .
A frequent itemset or a pattern is an itemset with . To illustrate these concepts, consider the following itemset database, , with =12. We have = and ==. An illustrative rule in Dex is with =0.5 and =0.75. For θ=4, the FIM tasks returns .
Consider two itemsets P and , where , and a predicate M. M is monotonic when and anti-monotonic when . FIM approaches rely on these properties: the support of P is bounded by the support of and, if is not frequent, then P is also not frequent. Table 3 shows three major search variants that rely on these properties.
Since FIM proposal [2], multiple extensions have been proposed, including principles to enhance the scalability of pattern miners, and condensed and approximate pattern representations [24], [54].
Pattern mining has been additionally applied over structured datasets, leading to contributions in different fields, including sequential pattern mining [79], graph mining [129] and cube computation [55].
Biclustering: Biclustering allows the discovery of subspaces, each defining a subset of rows that show a coherent pattern that is observed for a subset of the overall columns. Definition 1.2 Given a matrix, A=(X,Y), with a set of rows X=, a set of columns Y=, and elements relating row i and column j: A bicluster is a submatrix of A, where is a subset of rows and is a subset of columns; The biclustering task is to identify a structure of biclusters such that each bicluster satisfies specific criteria of homogeneity and significance.
The homogeneity criteria is commonly guaranteed through the use of a merit function to guide the search [98]. An illustrative merit function is the variance of values in the rows or columns in the bicluster. Merit functions can either define the homogeneity of each bicluster (intra-bicluster homogeneity) or the homogeneity of a set of biclusters (inter-bicluster homogeneity), allowing some biclusters to deviate from the expected homogeneity as long as the overall criterion is preserved. The merit function is the simplest way to affect the coherency, quality and structure. The coherency of a bicluster is defined by the observed correlation of values (Definition 1.3). Biclusters can follow dense, constant, additive, multiplicative, plaid or order-preserving coherencies, either across rows or columns [82]. The quality of a bicluster is defined by the type and amount of accommodated noise. The structure is defined by the number,2 size and positioning of biclusters. Flexible structures are characterized by an arbitrary-high set of (possibly overlapping) biclusters. The statistical significance of a bicluster determines how its probability of occurrence deviates from expectations. Following the taxonomy proposed by Madeira and Oliveira [82], Table 4 synthesizes the main biclustering approaches acccording to their search paradigm. Definition 1.3 Let the elements in a bicluster have coherency across rows given by , where kj is the expected value for column j, γi is the adjustment for row i, and ηij is the noise factor. Given a dataset A and a specific coherency strength , where +. The γ factors define the coherency assumption: constant when γ=0, multiplicative if aij is better described by , and additive otherwise. A plaid assumption considers the cumulative contributions from multiple biclusters on areas where their rows and columns overlap.
PM-based biclustering: While traditional biclustering approaches rely on flexible merit functions to guide the space exploration, PM-based approaches require these functions to be defined in terms of support and, eventually, confidence or other interestingness metrics. This restriction enables a scalable exhaustive space search that produces an arbitrarily high number of biclusters within a flexible structure. Definition 1.4 Let A be a matrix whose values in are assigned to a set of items . A bicluster under a constant model can either follow: an overall orientation where ; a column-based orientation where aij=kj and ; or a row-based orientation where and . A bicluster following an additive (or multiplicative) model has (or ), where and define the column and row contributions. A bicluster under a symmetric model either considers symmetries on rows or columns , where . Definition 1.5 Given a matrix A whose elements are the concatenation of the observed values with their column (or row) indexes. Let ΨP of an itemset P in A be its set of indexes. set of biclusters can be derived from a set of frequent itemsets by mapping =Bk, where Bk=, to compose biclusters with coherency across rows, or = for column-coherency.
Two classes of PM-based biclustering approaches can be considered: (1) a first class targeting discrete matrices by using as-is pattern miners, and (2) a second class targeting numeric matrices by extending methods based on the introduced monotonic (or Apriori) property [2]. The first class of methods rely on an itemization step followed by the application of FIM under a low support threshold. The itemization step maps a real-value or discrete matrix into an itemset database. For real-value matrices, normalization and discretization procedures are applied. Then, the discrete value of each element is concatenated with its column index. Each transaction of the target itemset database corresponds to a row with these new values. FIM is then applied over this database to mine frequent patterns for composing biclusters with coherency across rows. The second class of methods relies on variants of the FIM task to learn frequent patterns directly from the real-valued matrix. In both classes, the coherency strength δ is implicitly defined by the number of items or the maximum allowed distance. Biclusters with coherency across columns can be mined using the transpose matrix. Finally, biclusters with coherent values overall can be discovered by mining one item (or range of values) at a time. Fig. 1 illustrates how to deliver these different types of biclusters using frequent patterns when considering the constant model.
To our knowledge, BicPAM [62], BiModule [96], DeBi [111], Bellay׳s et al. [14], GenMiner [86] and BiP [60] are the state-of-the-art methods for the first class of PM-based biclustering. BiModule [96], [97] allows a parameterized multi-value itemization of the input matrix to discover constant biclusters derived from (closed) frequent patterns using the LCM miner [125]. DeBi [111] derives biclusters from (maximal) frequent patterns mined over binarized matrices using the MAFIA miner [22], and places key post-processing principles to adjust them in order to guarantee their statistical significance. The recently proposed BicPAM [62], parameterized with the F2G miner [65] by default, extends the constant assumption of previous approaches to find biclusters with symmetric, additive and multiplicative factors by performing iterative corrections on the input matrix. BicPAM also surpasses discretization problems by introducing the possibility to assign multiple discrete values to a single element, and offers new strategies to robustly handle noise and missing values. Bellay׳s et al. method [14] uses the Apriori miner [2] with additional principles to evaluate the functional coherency of the discovered biclusters against the background noise. This is one of diverse PM-based attempts to exhaustively discover dense biclusters in either unweighted networks [13], [90], [133], [80] or, more interestingly, in scored networks [32], [30]. GenMiner [86] includes external knowledge within the input matrix to derive biclusters from association rules that relate annotations (external grouping of rows or columns) with clusters derived from (closed) frequent patterns using CLOSE [102]. BiP [60] is prepared to discover plaid models by relying on noise-tolerant association rules for the recovery of apparent noisy areas due to the presence of cumulative effects on the overlapping areas between biclusters.
The itemization step is optional for the second class of methods [8]. To our knowledge, RAP [101], RCB discovery [8] and ET-bicluster [52] are state-of-the-art methods here. RAP [101] plugs an adapted range-based metric to mine constant biclusters on rows (or columns), while RCB discovery targets biclusters with constant values overall [8]. ET-bicluster extends the previous approaches to discover noisy biclusters, although an exhaustive enumeration of biclusters is not guaranteed [52]. Alternative support metrics with dedicated Apriori-based searches have been additionally proposed [69], [115], [53].
Section snippets
PM-based biclustering
We propose a structured view of PM-based biclustering according to a set of dimensions of decision. We rely on state-of-the-art literature to characterize each dimension. These dimensions gather principles on different steps with impact on the biclusters’ type, structure and quality, as illustrated in Fig. 2, Fig. 3. Throughout this paper we define a set of principles for each step.
Different options for PM-based biclustering can be grouped according to its three major steps: mapping
Performance evaluation of PM-based biclustering approaches
This section evaluates the performance of PM-based biclustering approaches. We first describe the quality evaluation methodology and then present preliminary results on synthetic and real data.
Conclusions
This work provides a structure view on pattern mining-based approaches to biclustering as they are increasingly positioned as the means to perform exhaustive searches under relaxed conditions (flexible structures of biclusters with parameterizable coherency and quality) with heightened efficiency. In this context, this work surveys and integrates the contributions of existing PM-based biclustering approaches, evaluates their performance, and discusses their relevance for pattern recognition
Conflict of interest
None declared.
Acknowledgments
This work was supported by Fundação para a Ciência e a Tecnologia under the projects UID/CEC/50021/2013 and the PhD grant SFRH/BD/75924/2011 to RH.
Rui Henriques received a M.Sc. degree in computer science and engineering from Instituto Superior Tècnico (IST), Universidade de Lisboa. He is developing his Ph.D. studies in the field of learning from high-dimensional and structured data at IST and INESC-ID. He had received distinctions for his academic achievements by IST between 2006 and 2008, and a National Award for his merits by Caixa Geral de Depósitos, in 2009. He has also been a Business Analyst at McKinsey with wide exposure to
References (137)
- et al.
A tree projection algorithm for generation of frequent item sets
J. Parallel Distrib. Comput.
(2001) - et al.
Enhanced soft subspace clustering integrating within-cluster and between-cluster information
Pattern Recognit.
(2010) - et al.
Reviewa gentle introduction to imputation of missing values
J. Clin. Epidemiol.
(2006) - et al.
A convergence theorem for the fuzzy subspace clustering (fsc) algorithm
Pattern Recognit.
(2008) - et al.
Ensemble methods for biclustering tasks
Pattern Recognit.
(2012) - et al.
Multi-objective evolutionary biclustering of gene expression data
Pattern Recognit.
(2006) - et al.
Mining association rules between sets of items in large databases
SIGMOD Rec.
(1993) - H.A. Ahmed, P. Mahanta, D.K. Bhattacharyya, J.K. Kalita, A. Ghosh, Intersected coexpressed subcube miner: An effective...
- Faris Alqadah, Joel S. Bader, Rajul Anand, Chandan K. Reddy, Query-based biclustering using formal concept analysis,...
- et al.
Gene association analysisa survey of frequent pattern mining from gene expression data
Brief. Bioinform.
(2010)
Machine learning and knowledge discovery in databases
Binary particle swarm optimization based biclustering of web usage data
CoRR
Bicata biclustering analysis toolbox
Bioinformatics
Efficiently mining long patterns from databases
SIGMOD Rec.
Putting genetic interactions in context through a global modular decomposition
Genome Res.
Discovering local structure in gene expression datathe order-preserving submatrix problem
RECOMB
Characterizing gene sets with FuncAssociate
Bioinformatics
Clustering formal concepts to discover biologically relevant knowledge from gene expression data
In Silico Biol.
Comparative analysis of biclustering algorithms
Bioinformatics and Computational Biology
Biclustering EEG data from epileptic patients treated with vagus nerve stimulation
Integrated analysis of gene expression by association rules discovery
BMC Bioinform.
Towards a classification approach using meta-biclusteringimpact of discretization in the analysis of expression time series
J. Integr. Bioinf.
Simultaneous clustering: a survey, Pattern Recognition and Machine Intelligence
Biclustering of expression data
Intelligent Systems for Molecular Biology
Module discovery by exhaustive search for densely connected, co-expressed regions in biomolecular interaction networks
PLoS One
Mining gene expression databases for association rules
Bioinformatics
Inferring cancer subnetwork markers using density-constrained biclustering
Bioinformatics
Applying biclustering to perform collaborative filtering
Intell. Syst. Des. Appl.
A comparative analysis of biclustering algorithms for gene expression data
Brief. Bioinf.
High-order SNP combinations associated with complex diseasesefficient discovery, statistical power and functional interactions
Plos One
Survey on biclustering of gene expression data
Biological Knowledge Discovery Handbook
Analyzing microarray data using quantitative association rules
Bioinformatics
Cited by (0)
Rui Henriques received a M.Sc. degree in computer science and engineering from Instituto Superior Tècnico (IST), Universidade de Lisboa. He is developing his Ph.D. studies in the field of learning from high-dimensional and structured data at IST and INESC-ID. He had received distinctions for his academic achievements by IST between 2006 and 2008, and a National Award for his merits by Caixa Geral de Depósitos, in 2009. He has also been a Business Analyst at McKinsey with wide exposure to real-life projects.
Cláudia Antunes received her Ph.D. from Instituto Superior Tècnico (IST, University of Lisbon, Portugal) in the domain of data mining and machine learning, proposing new methods to deal with temporal data, in particular for mining event sequential patterns. She is currently a Professor at DEI department at IST and the scientific coordinator of two projects funded by FCT in the areas of domain-driven data mining and educational data mining. Clàudia has been working on methods for general pattern mining, from transactional to structured data. Her main interests are centered on mining complex knowledge from complex data, with emphasis on the incorporation of background knowledge in the pattern mining process.
Sara C. Madeira received a (5-year) B.Sc. degree in computer science from the University of Beira Interior, Covilh, Portugal, in 2000, and the M.Sc. and Ph.D. degrees in computer science and engineering (CSE) at Instituto Superior Tcnico (IST), Technical University of Lisbon, in 2002 and 2008. She is currently an Assistant Professor, at the CSE department at IST, and a Senior Researcher at INESC-ID, Lisbon. Her research interests include algorithms and data structures, data mining, machine learning, bioinformatics and medical informatics.