Elsevier

Neural Networks

Volume 24, Issue 7, September 2011, Pages 709-716
Neural Networks

BARTMAP: A viable structure for biclustering

https://doi.org/10.1016/j.neunet.2011.03.020Get rights and content

Abstract

Clustering has been used extensively in the analysis of high-throughput messenger RNA (mRNA) expression profiling with microarrays. Furthermore, clustering has proven elemental in microRNA expression profiling, which demonstrates enormous promise in the areas of cancer diagnosis and treatment, gene function identification, therapy development and drug testing, and genetic regulatory network inference. However, such a practice is inherently limited due to the existence of many uncorrelated genes with respect to sample or condition clustering, or many unrelated samples or conditions with respect to gene clustering. Biclustering offers a solution to such problems by performing simultaneous clustering on both dimensions, or automatically integrating feature selection to clustering without any prior information, so that the relations of clusters of genes (generally, features) and clusters of samples or conditions (data objects) are established. However, the NP-complete computational complexity raises a great challenge to computational methods for identifying such local relations. Here, we propose and demonstrate that a neural-based classifier, ARTMAP, can be modified to perform biclustering in an efficient way, leading to a biclustering algorithm called Biclustering ARTMAP (BARTMAP). Experimental results on multiple human cancer data sets show that BARTMAP can achieve clustering structures with higher qualities than those achieved with other commonly used biclustering or clustering algorithms, and with fast run times.

Introduction

Clustering Bezdek (1981), Oliveira and Pedrycz (2007), Sato-Ilic and Jain (2006), Gokcay and Principe (2002), Xu and Wunsch (2005) and Xu and Wunsch (2009) has been used extensively in the analysis of high-throughput messenger RNA (mRNA) or microRNA expression profiling with microarrays, which demonstrate enormous promise in the areas of cancer diagnosis and treatment, gene function identification, therapy development and drug testing, and genetic regulatory network inference (Eisen et al., 1998, Golub et al., 1999, Jiang et al., 2004, McLachlan et al., 2004, Moreau et al., 2002, Shamir and Sharan, 2002, Xu and Wunsch, 2010, Xu and Wunsch, 2009, Xu and Wunsch, 2005). Usually, the expression levels of many genes are measured across a relatively small set of conditions or samples, and the obtained gene expression data are organized as a data matrix with rows corresponding to genes and columns corresponding to samples or conditions.

Let a gene expression data set be represented as a data matrix E=(G,S), with G={g1,,gN} representing a set of N genes or rows and S={s1,,sM} representing a set of M samples (conditions)1 or columns (see Table 1). An element eijE then represents the expression level of gene i under condition j. A gene or row cluster is a subset of rows defined on all columns, denoted as CXS=(X,S), where XG is a subset of genes. Similarly, a sample or column cluster is a subset of columns defined across all rows, denoted as CGY=(G,Y), where YS is a subset of samples.

Such a data matrix and the corresponding row and column cluster definition can be generalized as a data matrix for many other applications. Such a practice, however, is inherently limited in analysis. This is because our general understanding of cellular processes elucidates that only a subset of genes is involved with a specific cellular process that becomes active only under some experimental conditions. However, microarrays generally are not specifically designed to meet the requirements of a particular experiment. Consider, for example, that in gene expression profile-based cancer diagnosis, only a subset of genes is related to some cancer type while numerous genes are considered irrelevant. In this case, the inclusion of all genes in sample clustering or all samples in gene clustering not only increases the computational burden, but could impair clustering performance due to the effect of these unrelated genes or samples, which are treated as noise.

Biclustering, first used by Cheng and Church (2000) in the community of bioinformatics, addresses this problem by performing clustering simultaneously on both row (gene) and column (sample) dimensions instead of clustering these two dimensions separately (Busygin et al., 2008, Madeira and Oliveira, 2004). In essence, biclustering can be regarded as a combination of clustering and automatic feature selection if we treat one dimension (e.g., column) as data objects and the other dimension (e.g., row) as description features. This task becomes particularly challenging without ground truth.

Note that the feature selection in biclustering is different from the feature selection that we usually consider in supervised classification. This difference lies in the fact that biclustering selects different subsets of features for different clusters of data objects, while standard feature selection chooses a subset of features from the candidate pool for all data objects. As such, the local relationship between subsets of genes and subsets of samples or conditions is identified with biclustering. Biclustering indicates gene groups that display similar patterns across a set of conditions (important to gene functional annotation and coregulated gene identification DiMaggio et al., 2008a, Lazzeroni and Owen, 2002, Li et al., 2009, Madeira et al., 2010, Segal et al., 2001, Tanay et al., 2002) or gene groups that are related to certain cancer types (for cancer classification discovery and diagnosis Getz et al., 2003, Kluger et al., 2003, Murali and Kasif, 2003, Tang and Zhang, 2005, Tchagang et al., 2008). In contrast, clustering focuses on uncovering global relationships between genes and samples or conditions. In fields other than bioinformatics, biclustering is also known as subspace clustering, co-clustering, or block clustering, among other names Busygin et al. (2008), Kriegel, Kröger, and Zimek (2009) and Madeira and Oliveira (2004). See the informative survey paper Kriegel et al. (2009).

Specifically, let IG and JS be subsets of the rows and columns for the gene expression data matrix E;EIJ=(I,J) is then the submatrix with rows I and columns J. A bicluster corresponds to such a submatrix that exhibits certain homogeneity. The goal of a biclustering algorithm, then, is to identify a set of biclusters with pairs of row and column subsets. The complexity of the biclustering problem has been shown to be NP-complete (Madeira and Oliveira, 2004, Peeters, 2003), which leads to many heuristics falling into the five major categories summarized in Table 2 according to Madeira and Oliveira (2004). For example, Cheng and Church (2000) used the mean squared residue to measure the coherence of the rows and columns in a bicluster, which is defined as, H(I,J)=1|I||J|iI,jJ(eijeiJ¯eIj¯+eIJ¯)2, where eiJ¯ is the mean of the ith row, eIj¯ is the mean of the jth column, and eIJ¯ is the mean of the bicluster. A submatrix EIJ is called a δ-bicluster if H(I,J)δ for some δ0. Possible ways to find the largest δ-biclusters, which should also have relatively high row variance Cheng and Church (2000), include the evolutionary algorithm (Divina & Aguilar-Ruiz, 2006), multi-objective particle swarm optimization (Liu, Li, Hu, & Chen, 2009), and other greedy iterative search approaches, such as the FLOC (FLexible Overlapped biClustering) algorithm (Yang, Wang, Wang, & Yu, 2003). Lazzeroni and Owen (2002) developed a plaid model treating the matrix as the sum of additive layers. This model was further extended to combine external grouping information and to generate biclusters of profiles of repeated measures (Turner, Bailey, Krzanowski, & Hemingway, 2005). Gu and Liu (2007) proposed a Bayesian biclustering model with a Gibbs sampling procedure for statistical inference. DiMaggio et al. (2008b) considered modeling biclustering as either a network flow problem or a traveling salesman problem. A systematic comparison of five greedy search-based biclustering algorithms, together with a reference method (Bimax), was presented in Prelić et al. (2006).

Early cancer diagnosis is critical for prolonging human life. As aforementioned, biclustering overcomes the disadvantage of clustering in dealing with unrelated genes and focuses only on the cancer type-related genes, thus providing a more effective method of cancer classification and diagnosis. Biclustering algorithms that are used for this purpose include the interrelated two-way clustering (ITWC) algorithm (Tang & Zhang, 2005), the coupled two-way clustering (CTWC) algorithm (Getz et al., 2003), xMOTIF (Murali & Kasif, 2003), Binary State Pattern Clustering (BSPC) (Beattie & Robinson, 2006), the qualitative biclustering algorithm (QUBIC) (Li et al., 2009), the spectral biclustering algorithm (Kluger et al., 2003), the double conjugated clustering (DCC) algorithm (Busygin, Jacobsen, & Kramer, 2002), and fuzzy adaptive subspace iteration-based two-way clustering (FASIC) (Shaik & Yeasin, 2009), among others. For example, ITWC generates biclusters by combining clustering results from each dimension of the data matrix in an iterative way (Tang & Zhang, 2005). Within each iteration, a set of gene clusters is first created using standard clustering algorithms, e.g., K-means or self-organizing maps, followed by the independent clustering of the sample dimension based on each gene cluster. The clustering results from the previous steps are then combined, and heterogeneous groups, which are pairs of columns that do not share gene features used for clustering, are identified. Finally, the genes are sorted in descending order of the cosine distance. Only the first one third of sorted genes is kept so that a reduced gene set is obtained for each heterogeneous group. The above steps are then repeated using the reduced gene set until some stopping conditions are satisfied.

CTWC uses hierarchical clustering with the Euclidean distance (Getz et al., 2003), while DCC relies on self-organizing maps with the dot product for similarity (Busygin et al., 2002). FASIC adopts the concept of the fuzzy set theory and applies the fuzzy-adaptive subspace iteration algorithm to generate gene clusters under a progressive clustering framework. Those clusters are then scored to generate p-values indicating the significance of the differential expression (Shaik & Yeasin, 2009). Tchagang et al. demonstrated the effectiveness of biclustering for detecting and diagnosing ovarian cancer (Tchagang et al., 2008). Mitra et al. showed a biclustering application for microRNA expression profile-based cancer classification of multiple types, including melanoma, ovarian, renal, colon, leukemia, and so on (Mitra, Bandyopadhyay, Maulik, & Zhang, 2010).

Here, we present BARTMAP (Biclustering ARTMAP) to perform biclustering on gene expression data, particularly for cancer classification discovery, although BARTMAP can be used to other types of data that have high dimensionalities, e.g., document clustering. BARTMAP is adapted to and modified from a neural-based classifier, Fuzzy ARTMAP (Carpenter, Grossberg, Markuzon, Reynolds, & Rosen, 1992), for supervised classification. Fuzzy ARTMAP, together with its other variants, such as Ellipsoid ARTMAP (Anagnostopoulos & Georgiopoulos, 2001) and default ARTMAP (Carpenter, 2003), already have shown great effectiveness in gene expression data analysis and other biomedical applications (Xu et al., 2007, Xu et al., 2009). Similar to Fuzzy ARTMAP, BARTMAP is based on Adaptive Resonance Theory (ART) (Carpenter and Grossberg, 1987, Grossberg, 1976). ART is a learning theory that hypothesizes that resonance in neural circuits can trigger fast learning; it was developed as a solution to the plasticity-stability dilemma.

BARTMAP displays many attractive characteristics that are also inherent and general in the ART family. First, BARTMAP scales very well with large-scale data analysis while maintaining efficiency. As the computational complexity for its ART modules is O(NlogN) or O(N) for one-pass variant (Mulder & Wunsch, 2003), the overall computational cost for BARTMAP is relatively low. Each ART module can dynamically and adaptively generate clusters without the requirement of specifying the number of clusters in advance as is required in many other commonly used clustering algorithms, such as K-means and self-organizing feature maps. Furthermore, BARTMAP is an exemplar-based, transparent learning model. During its learning, the architecture summarizes data via the use of exemplars in order to accomplish its learning objective. This ability contrasts with other, opaque neural network architectures for which it is generally difficult to explain why an input produces a particular output. Another important feature of BARTMAP is its ability to detect atypical patterns during its learning. The detection of such patterns is accomplished via the employment of a match-based criterion that decides to which degree a particular pattern matches the characteristics of an already-formed category in BARTMAP. Finally, BARTMAP is far simpler to implement than, for example, backpropagation for feed-forward neural networks and the training algorithm of support vector machines.

The remainder of this paper is organized as follows. Section 2 begins with a description of Fuzzy ART and Fuzzy ARTMAP, followed by how BARTMAP performs biclustering. The experimental data, design, and results are presented and discussed in Section 3. We conclude the paper in Section 4.

Section snippets

Fuzzy ART

Because Fuzzy ART (Carpenter, Grossberg, & Rosen, 1991) serves as the basic module for both Fuzzy ARTMAP and BARTMAP, we commence this section with a brief introduction of Fuzzy ART.

The basic Fuzzy ART architecture consists of two-layer neurons, the feature representation field F1, and the category representation field F2, as shown in Fig. 1. The neurons in layer F1 are activated by the input pattern that is normalized with the complement coding rule to avoid category proliferation (Carpenter

Data sets

The proposed method is applied to three benchmark data sets in gene expression profile-based cancer research and one data set in microRNA profile-based cancer research.

The first data set is the leukemia data set consists of 72 samples, including bone marrow samples, peripheral blood samples and childhood AML cases (Golub et al., 1999). Twenty-five of these samples are acute myeloid leukemia (AML), and 47 are acute lymphoblastic leukemia (ALL), which is composed of two subcategories due to the

Conclusions

Biclustering has been demonstrated to be a promising method of high dimensional gene expression data analysis, or other types of high-dimensional data analysis, by automatically combining feature selection with clustering without any prior information. Here, we show that a well-known supervised classification system, ARTMAP, can be modified to achieve biclustering in a fast, stable, and adaptive way. Experimental results on benchmark gene expression or microRNA expression data sets indicate the

Acknowledgements

Partial support for this research from the National Science Foundation, the Missouri University of Science & Technology Intelligent Systems Center, and the M.K. Finley Missouri endowment, is gratefully acknowledged.

References (59)

  • J. Bezdek

    Pattern recognition with fuzzy objective function algorithms

    (1981)
  • Busygin, S., Jacobsen, G., & Kramer, E. (2002). Double conjugated clustering applied to leukemia microarray data. In:...
  • Carpenter, G. (2003). Default ARTMAP. In Proceedings of the international conference on neural networks (pp....
  • G. Carpenter et al.

    Fuzzy ARTMAP: a neural network architecture for incremental supervised learning of analog multidimensional maps

    IEEE Transactions on Neural Networks

    (1992)
  • Cheng, Y., & Church, G. (2000). Biclustering of expression data. In Proceedings of eighth international conference on...
  • P. DiMaggio et al.

    Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies

    BMC Bioinformatics

    (2008)
  • P. DiMaggio et al.

    Biclustering via optimal re-ordering of data matrices in systems biology: rigorous methods and comparative studies

    BMC Bioinformatics

    (2008)
  • F. Divina et al.

    Biclustering of expression data with evolutionary computation

    IEEE Transactions on Knowledge and Data Engineering

    (2006)
  • D. Duffy et al.

    A permutation based algorithm for block clustering

    Journal of Classification

    (1991)
  • M. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proceedings of National Academic Science USA

    (1998)
  • G. Getz et al.

    Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data

    Bioinformatics

    (2003)
  • E. Gokcay et al.

    Information theoretic clustering

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • T. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • S. Grossberg

    Adaptive pattern recognition and universal encoding II: feedback, expectation, olfaction, and illusions

    Biological Cybernetics

    (1976)
  • J. Gu et al.

    Bayesian biclustering of gene expression data

    BMC Genomics

    (2007)
  • J. Hartigan

    Direct clustering of a data matrix

    Journal of American Statistical Association

    (1972)
  • L. Hubert et al.

    Comparing partitions

    Journal of Classification

    (1985)
  • D. Jiang et al.

    Cluster analysis for gene expression data: a survey

    IEEE Transactions on Knowledge and Data Engineering

    (2004)
  • J. Khan et al.

    Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

    Nature Medicine

    (2001)
  • Cited by (41)

    • Topological biclustering ARTMAP for identifying within bicluster relationships

      2023, Neural Networks
      Citation Excerpt :

      Such visualization aims to provide a way to observe associations between different biclusters. This section presents a compendium on TopoART (Tscherepanow, 2010) and BARTMAP (Xu & Wunsch II, 2011) while briefly introducing Adaptive Resonance Theory thus furnishing principal details for explicating TopoBARTMAP. The reader is referred to Brito da Silva, Elnabarawy, and Wunsch I.I. (2019), Carpenter and Grossberg (2010), Carpetner, Grossberg, and Rosen (1991), Grossberg (2013), Wunsch II (2019) for a rigorous exploration of “Winner-Take-All” (WTA) learning models based on Adaptive Resonance Theory (ART) and their applications.

    • Self-organizing subspace clustering for high-dimensional and multi-view data

      2020, Neural Networks
      Citation Excerpt :

      They take advantage of a neural network’s capacity for performing a relevant number of categorization tasks. Considering a timeline, the first algorithms are Generalized Relevance Learning Vector Quantization (GLVQ) (Hammer & Villmann, 2002) and Biclustering ARTMAP (BARTMAP) (Xu & Wunsch I.I., 2011). Then several SOM-based approaches were proposed: The Dimension Selective Self-organizing Map (DSSOM) (Bassani & Araújo, 2012) and the Local Adaptive Receptive Field Dimension Selective Self-organizing Map (LARFDSSOM) (Bassani & Araújo, 2015).

    • Distributed dual vigilance fuzzy adaptive resonance theory learns online, retrieves arbitrarily-shaped clusters, and mitigates order dependence

      2020, Neural Networks
      Citation Excerpt :

      ART templates have specific properties and governing equations based on their internal representation, e.g., hyperboxes (Carpenter, Grossberg and Rosen, 1991); Gaussians (Vigdor & Lerner, 2007; Williamson, 1996); hyperspheres (Anagnostopoulos & Georgiopulos, 2000); hyperellipses (Anagnostopoulos & Georgiopoulos, 2001); and others. Numerous ART-based architectures have been conceived, such as predictive ART (ARTMAP) for supervised mappings (Carpenter, Grossberg, Markuzon, Reynolds, & Rosen, 1992; Carpenter, Grossberg and Reynolds, 1991); fusion ART (Tan, Carpenter, & Grossberg, 2007), whose variants have been effectively used for semi-supervised (Meng, Tan, & Xu, 2014), supervised (Tan, 1995), and reinforcement learning applications (Tan, 2004, 2006; Tan, Lu, & Xiao, 2008); Biclustering ARTMAP (BARTMAP) (Xu & Wunsch II, 2011) for biclustering applications, such as gene expression analysis (Xu & Wunsch II, 2011) and collaborative filtering (Elnabarawy, Wunsch II, & Abdelbar, 2016); and ART networks endowed with multiple vigilance tests (Brito da Silva & Wunsch II, 2017; Gomez-Sanchez, Dimitriadis, Cano-Izquierdo, & Lopez-Coronado, 2001; Huang, Cheng, Shih, & Chen, 2014; Seiffertt & Wunsch II, 2010). A brief review of ART networks related to the contributions of this work is provided next, where emphasis is given to the architectures used for comparison purposes, thereby making this paper self-contained.

    View all citing articles on Scopus
    View full text