Chapter 3 Matrix Factorization for Recovery of Biological Processes from Microarray Data

https://doi.org/10.1016/S0076-6879(09)67003-8Get rights and content

Abstract

We explore a number of matrix factorization methods in terms of their ability to identify signatures of biological processes in a large gene expression study. We focus on the ability of these methods to find signatures in terms of gene ontology enhancement and on the interpretation of these signatures in the samples. Two Bayesian approaches, Bayesian Decomposition (BD) and Bayesian Factor Regression Modeling (BFRM), perform best. Differences in the strength of the signatures between the samples suggest that BD will be most useful for systems modeling and BFRM for biomarker discovery.

Introduction

Microarray technology introduced a new complexity into biological studies through the simultaneous measurement of thousands of variables, replacing a technique (the Northern blot) that typical measured at most tens of variables. Traditional analysis focused on measurements with minimal statistical complexity, but direct application of such tests (e.g., the t-test) to microarrays resulted in massive numbers of “significant” differentially regulated genes, when reality suggested far fewer. There were a number of reasons for the failure of these tests, including the small number of replicates leading to chance detection when tens of thousands of variables were measured (Tusher et al., 2001), the unmodeled covariance arising from coordinated expression (Kerr et al., 2002), and non-gene-specific error models (Hughes et al., 2000). While a number of statistical issues have now been successfully addressed (Allison et al., 2006), two aspects of the biology of gene expression raise difficulties for many analyses.

The issues can be noted in a simple model of signaling in the yeast S. cerevisiae. In Fig. 3.1, the three overlapping MAPK pathways are shown. The pathways share a number of upstream regulatory components (e.g., Ste11), and regulate sets of genes divided here into five groups (A–E), with a few of the many known targets shown. The Fus3 mating response MAPK protein activates the Ste12 transcription factor, leading to expression of groups A and B. The Kss1 filamentation response MAPK protein activates the Ste12–Tec1 regulatory complex, leading to expression of groups B, C, and D. The Hog1 high-osmolarity response MAPK protein activates the Sko1 transcription factor, leading to expression of groups D and E. The standard methods used in microarray analysis will look for genes that are differentially expressed between two states. If we imagine those two states as mating activation and filamentation activation, we identify genes associated with each process, but we do not identify all genes associated with either process. Alternatively, clustering in an experiment where each process is independently active will lead to identification of five clusters (one for each group A–E) even though only three processes are active. Naturally, the complexity is substantially greater as there is no true isolation of a single biological process, as any system with only a single process active would be dead, and any measurement is convolved with measurements of ongoing biological behavior required for survival, homeostasis, or growth. These processes use many of the same genes, due to borrowing of gene function that has occurred throughout evolution. [Note: for S. cerevisiae, plain text Ste12 indicates the protein, while italic text ste12 indicates the gene.]

Essentially, this example shows the two underlying biological principles that need to be addressed in many analyses of high-throughput data—multiple regulation of genes due to gene reuse in different biological processes and nonorthogonality of biological process activity arising from the natural simultaneity of biological behaviors. Mathematically, we can state the problem as a matrix factorization problem:Dij=k=1PAikPkj+ɛij

where D is the data matrix comprising measurements on N genes (or other entities) indexed by i across M conditions indexed by j, P is the pattern matrix for P patterns indexed by k, A is the amplitude or weighting matrix that determines how much of each gene's behavior can be attributed to each pattern, and ɛ is the error matrix. P is essentially a collection of basis vectors for the factorization into P dimensions, and as such it is often useful to normalize the rows of P to sum to 1. This makes the A matrix similar to loading or score matrices, such as in principal component analysis (PCA). It is useful to note there that the nonindependence of biological processes is equivalent to nonorthogonality of the rows of P, indicating the factorization is ideally into a basis space that reflects underlying biological behaviors but is not orthonormal.

We introduced Bayesian Decomposition (BD), a Markov chain Monte Carlo algorithm, to address these fundamental biological issues in microarray studies (Moloshok et al., 2002), extending our original work in spectroscopy (Ochs et al., 1999). Kim and Tidor introduced nonnegative matrix factorization (NMF), created by Lee and Seung (1999), into microarray analysis (Brunet et al., 2004, Kim and Tidor, 2003), for the same reason. Subsequently, it was realized that sparseness aids in identifying biologically meaningful processes, and sparse NMF was introduced (Gao and Church, 2005). Fortuitously, due to its original use in spectroscopy, sparseness was already a feature of BD through its atomic prior (Sibisi and Skilling, 1997). More recently, Carvalho and colleagues introduced Bayesian factor regression modeling (BFRM), an additional Markov chain Monte Carlo method, for microarray data analysis (Carvalho et al., 2008).

Targeted methods that directly model multiple sources of biological information have been introduced as well. Liao and Roychowdhury introduced network component analysis (NCA), which relied on information about the binding of transcriptional regulators to help isolate the signatures of biological processes (Liao et al., 2003). The use of information on transcriptional regulation can also aid in sparseness, as shown by its inclusion in BD as prior information (Kossenkov et al., 2007).

These methods have been developed and applied primarily to microarray data, as it was the first high-throughput biological data that included dynamic behavior, in contrast to sequence data. Microarrays were developed independently by a number of groups in the 1990s (Lockhart et al., 1996, Schena et al., 1995), and their use is now widespread. A number of technical issues plagued early arrays, and error rates were high. The development of normalization and other preprocessing procedures improved data reproducibility and robustness (Bolstad et al., 2003, Cheng and Wong, 2001, Irizarry et al., 2003), leading to studies that demonstrated the ability to produce meaningful datasets from arrays run in different laboratories at different times (English and Butte, 2007). Data can be accessed, though not always with useful metadata, in the GEO and ArrayExpress repositories (Edgar et al., 2002, Parkinson et al., 2005).

However, the methods discussed here are also suitable for other high-throughput data where the fundamental assumptions of multiple overlapping sets within the data and nonorthogonality of these sets across the samples holds. In the near future, these data are likely to include large-scale proteomics measurements and metabolite measurements.

We have previously undertaken a study of some of these methods to determine their ability to solve Eq. (3.1) using simulations of the cell cycle (Kossenkov and Ochs, 2009). This study did not address the recovery of biologically meaningful patterns from real data, where numerous unknowns exist. Most of these relate to the fundamental issue that separates biological studies from those in physics and chemistry—in biology we are unable to isolate variables of interest away from other unknowns, as to do so is to kill the organism under study. Instead, we must perform studies in a background of incomplete knowledge of the activities a cell is undertaking and incomplete knowledge of the entities (e.g., genes, proteins) associated with these processes. In addition, sampling is difficult and therefore tends to be limited (i.e., large N, small P), and the data remain prone to substantial variance, perhaps due to true biological variation instead of technical issues.

We have undertaken a new analysis of the Rosetta compendium, a dataset of quadruplicate measurements of 300 yeast gene knockouts and chemical treatments (Hughes et al., 2000), to determine how well various matrix factorization methods recover signatures of biological processes. The Rosetta study included 63 control replicates of wild-type yeast grown in rich media, allowing a gene-specific error model. One interesting result to emerge from this work is that roughly 10% of yeast genes appear to be under limited transcriptional regulation, so that their transcript levels vary by orders of magnitude without a corresponding variation in protein levels or phenotype. This has obvious implications for studies where whole genome transcript levels are measured on limited numbers of replicates.

Using known biological behaviors that are affected by specific gene knockouts, we compared a number of methods from clustering through the matrix factorization methods discussed above to determine how well such methods recover biological information from microarray measurements. We first give a brief description of each method, then we present the dataset and results of our analyses.

Section snippets

Clustering techniques

To provide a baseline for comparison, we applied two widely used clustering techniques to the dataset, as well as an approach where genes were assigned to groups at random. Hierarchical clustering (HC) was introduced for microarray work by Eisen et al. (1998), and because of easy-to-use software and its lead as the first technique, it has seen significant use and is available in desktop tools (Saeed et al., 2006). HC, as performed by most users, is done in an agglomerative fashion, using a

Application to the Rosetta Compendium

The sample dataset for this study is generated from experiments on the yeast, S. cerevisiae, which has been studied in depth for a number of biological processes, including the eukaryotic cell cycle, transcriptional and translational control, cell wall construction, mating, filamentous growth, and response to high osmolarity. There is substantial existing biological data on gene function, providing a large set of annotations for analysis (Guldener et al., 2005, Mewes et al., 2004). In addition,

Results of Analyses

Although the fundamental goal of the nonclustering methods is the optimal solution to Eq. (3.1), albeit potentially with covariates as in Eq. (3.8), the methods differ substantially in their treatment of the data. BD, as applied here, and the NMF methods require positivity in A and P, while NCA, ICA, PCA, and BFRM allow negative values. The A matrix is still easily interpreted in terms of enrichment of the gene ontology terms from Table 3.1; however, the P matrix can vary greatly in its

Discussion

The matrix factorization methods discussed in this work have significantly different designs, and this affects their value for different types of analysis. PCA is very fast and decomposes the data into a series of PCs that capture maximum variance at each potential dimensionality. This can be very powerful for denoising data if only the strongest PCs are retained, although this has not been successful for high-throughput biological data. In addition, PCA can provide insight to the strongest

References (40)

  • T.R. Hughes et al.

    Functional discovery via a compendium of expression profiles

    Cell

    (2000)
  • M.F. Ochs et al.

    A new method for spectral decomposition using a bilinear Bayesian approach

    J. Magn. Reson.

    (1999)
  • A.I. Saeed et al.

    TM4 microarray software suite

    Methods Enzymol.

    (2006)
  • D.B. Allison et al.

    Microarray data analysis: From disarray to consolidation and consensus

    Nat. Rev. Genet.

    (2006)
  • O. Alter et al.

    Singular value decomposition for genome-wide expression data processing and modeling

    Proc. Natl. Acad. Sci. USA

    (2000)
  • G. Bidaut et al.

    ClutrFree: Cluster tree visualization and interpretation

    Bioinformatics

    (2004)
  • G. Bidaut et al.

    Determination of strongly overlapping signaling activity from microarray data

    BMC Bioinform.

    (2006)
  • B.M. Bolstad et al.

    A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

    Bioinformatics

    (2003)
  • J.P. Brunet et al.

    Metagenes and molecular pattern discovery using matrix factorization

    Proc. Natl. Acad. Sci. USA

    (2004)
  • P. Carmona-Saez et al.

    Biclustering of gene expression data by Non-smooth Non-negative Matrix Factorization

    BMC Bioinform.

    (2006)
  • C.M. Carvalho et al.

    High-dimensional sparse factor modelling: Applications in gene expression genomics

    J. Am. Stat. Assoc.

    (2008)
  • L. Cheng et al.

    Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection

    Proc. Natl. Acad. Sci. USA

    (2001)
  • K.R. Christie et al.

    Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms

    Nucleic Acids Res.

    (2004)
  • R. Edgar et al.

    Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

    Nucleic Acids Res.

    (2002)
  • M.B. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proc. Natl. Acad. Sci. USA

    (1998)
  • S.B. English et al.

    Evaluation and integration of 49 genome-wide experiments and the prediction of previously unknown obesity-related genes

    Bioinformatics

    (2007)
  • A. Frigyesi et al.

    Independent component analysis reveals new and biologically significant structures in micro array data

    BMC Bioinform.

    (2006)
  • Y. Gao et al.

    Improving molecular cancer class discovery through sparse non-negative matrix factorization

    Bioinformatics

    (2005)
  • S. Geman et al.

    Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

    IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI

    (1984)
  • U. Guldener et al.

    CYGD: The Comprehensive Yeast Genome Database

    Nucleic Acids Res.

    (2005)
  • Cited by (0)

    View full text