Skip to main content

2003 | Buch

The Analysis of Gene Expression Data

Methods and Software

herausgegeben von: Giovanni Parmigiani, Elizabeth S. Garrett, Rafael A. Irizarry, Scott L. Zeger

Verlag: Springer New York

Buchreihe : Statistics for Biology and Health

insite
SUCHEN

Über dieses Buch

Thedevelopmentoftechnologiesforhigh–throughputmeasurementofgene expression in biological system is providing powerful new tools for inv- tigating the transcriptome on a genomic scale, and across diverse biol- ical systems and experimental designs. This technological transformation is generating an increasing demand for data analysis in biological inv- tigations of gene expression. This book focuses on data analysis of gene expression microarrays. The goal is to provide guidance to practitioners in deciding which statistical approaches and packages may be indicated for their projects, in choosing among the various options provided by those packages, and in correctly interpreting the results. The book is a collection of chapters written by authors of statistical so- ware for microarray data analysis. Each chapter describes the conceptual and methodological underpinning of data analysis tools as well as their software implementation, and will enable readers to both understand and implement an analysis approach. Methods touch on all aspects of statis- cal analysis of microarrays, from annotation and ?ltering to clustering and classi?cation. All software packages described are free to academic users. The materials presented cover a range of software tools designed for varied audiences. Some chapters describe simple menu-driven software in a user-friendly fashion and are designed to be accessible to microarray data analystswithoutformalquantitativetraining.Mostchaptersaredirectedat microarray data analysts with master’s-level training in computer science, biostatistics, or bioinformatics. A minority of more advanced chapters are intended for doctoral students and researchers.

Inhaltsverzeichnis

Frontmatter
1. The Analysis of Gene Expression Data: An Overview of Methods and Software
Abstract
This chapter is a rough map of the book. It provides a concise overview of data-analytic tasks associated with microarray studies, pointers to chapters that can help perform these tasks, and connections with selected data-analytic tools not covered in any of the chapters. We wish to give a general orientation before moving to the detailed discussion provided by individual chapters. A comprehensive review of microarray data analysis methods is beyond the scope of this introduction.
Giovanni Parmigiani, Elizabeth S. Garrett, Rafael A. Irizarry, Scott L. Zeger
2. Visualization and Annotation of Genomic Experiments
Abstract
We provide a framework for reducing and interpreting results of multiple microarray experiments. The basic tools are a flexible genefiltering procedure, a dynamic and extensible annotation system, and methods for visualization. The gene-filtering procedure efficiently evaluates families of deterministic or statistical predicates on collections of expression measurements. The expression-filtering predicates may involve reference to arbitrarily complex predicates on phenotype or genotype data. The annotation system collects mappings between manufacturer-specified probe set identifiers and public use nomenclature, ontology, and bibliographic systems. Visualization tools allow the exploration of the experimental data with respect to genomic quantities such as chromosomal location or functional groupings.
Robert Gentleman, Vincent Carey
3. Bioconductor R Packages for Exploratory Analysis and Normalization of cDNA Microarray Data
Abstract
This chapter describes a collection of four R packages for exploratory analysis and normalization of two-color cDNA microarray fluorescence intensity data. R’s object-oriented class/method mechanism is exploited to allow efficient and systematic representation and manipulation of large microarray datasets of multiple types. The marrayClasses package contains class definitions and associated methods for pre- and postnormalization intensity data for batches of arrays. The marrayInput package provides functions and tcltk widgets to automate data input and the creation of microarray-specific R objects for storing these data. Functions for diagnostic plots of microarray spot statistics, such as boxplots, scatterplots, and spatial color images, are provided in marrayPlots. Finally, the marrayNorm package implements robust adaptive location and scale normalization procedures, which correct for different types of dye biases (e.g., intensity, spatial, plate biases) and allow the use of control sequences spotted onto the array and possibly spiked into the mRNA samples. The four new packages were developed as part of the Bioconductor project, which aims more generally to produce an open-source and open-development statistical computing framework for the analysis of genomic data.
Sandrine Dudoit, Jean Yee Hwa Yang
4. An R Package for Analyses of Affymetrix Oligonucleotide Arrays
Abstract
We describe an extensible, interactive environment for data analysis and exploration of Affymetrix oligonucleotide array probe-level data. The software utilities provided with the Affymetrix analysis suite summarize the probe set intensities and makes available only one expression measure for each gene. We have developed this package because much can be learned from studying the individual probe intensities or, as we call them, the probe-level data. We provide some examples demonstrating that having access to and methods for probelevel data results in improvements to quality control assessments, normalization, and expression measures. The software is implemented as an add-on package, conveniently named affy, to the freely available and widely used statistical language/software R (Ihaka and Gentleman, 1996). The development of this software as an add-on to R allows us to take advantage of the basic mathematical and statistical functions and powerful graphics capabilities that are provided with R. Our package is distributed as open source code for Linux, Unix, and Microsoft Windows. It is is released under the GNU General Public License. It is part of the Bioconductor project and can be obtained from http://​www.​bioconductor.​org.
Rafael A. Irizarry, Laurent Gautier, Leslie M. Cope
5. DNA-Chip Analyzer (dChip)
Abstract
DNA-Chip Analyzer (dChip) is a software package implementing model-based expression analysis of oligonucleotide arrays and several high-level analysis procedures. The model-based approach allows probe-level analysis on multiple arrays. By pooling information across multiple arrays, it is possible to assess standard errors for the expression indexes. This approach also allows automatic probe selection in the analysis stage to reduce errors due to cross-hybridizing probes and image contamination. High-level analysis in dChip includes comparative analysis and hierarchical clustering. The software is freely available to academic users at www.dchip.org.
Cheng Li, Wing Hung Wong
6. Expression Profiler
Abstract
Expression Profiler (EP, http://ep.ebi.ac.uk/) is a set of tools for the analysis and interpretation of gene expression and other functional genomics data. These tools perform expression data clustering, visualization, and analysis, integration of expression data with protein interaction data and functional annotations, such as GeneOntology, and the analysis of promoter sequences for predicting transcription factor binding sites. Several clustering analysis method implementations and tools for sequence pattern discovery provide a rich data mining environment for various types of biological data. All the tools are Web-based, with minimal browser requirements. Analysis results are cross-linked to other databases and tools are available on the Internet. This enables further integration of the tools and databases; for instance, such public microarray gene expression databases as Array Express.
Jaak Vilo, Misha Kapushesky, Patrick Kemmeren, Ugis Sarkans, Alvis Brazma
7. An S-PLUS Library for the Analysis and Visualization of Differential Expression
Abstract
This chapter describes a genomics library for S-PLUS® 6. One focus of the work involves the testing of hypotheses regarding differential expression. In this area, we provide methods and S-PLUS functions for expression error estimation based on pooling errors within genes and between duplicate arrays for genes in which expression values are similar. This is motivated by the observation that errors between duplicates vary as a function of the average gene expression intensity and by the fact that many gene expression studies are implemented with a limited number of replicated arrays (Lee, 2002).
Our clustering and visualization methods take advantage of S-PLUS GraphletsTM, lightweight applets that are simply created using the Java and XML-based graphics classes and the java. graph graphics device that are new to S-PLUS 6. In addition to providing interactive graphs in a Web browser, the Graphlets enable connection to gene-information databases such as NCBI GenBank. Such connectivity facilitates incorporation of additional annotation information into the graphical and tabular summaries via database querying on the URL. The S-PLUS 6 genomics library is available at www. insightful.com/arrayAnalyzer. This site also provides updates regarding ongoing work in genomics and related areas at Insightful.
Jae K. Lee, Michael O’Connell
8. Dragon and Dragon View: Methods for the Annotation, Analysis, and Visualization of Large-Scale Gene Expression Data
Abstract
Database Referencing of Array Genes ONline (DRAGON) is a database system that consists of information derived from publicly available databases, including UniGene, Swiss Prot, Pfam, and the Kyoto Encyclopedia of Genes and Genomes (KEGG). Through a Web-accessible interface, DRAGON rapidly supplies information pertaining to a range of biological characteristics of all the genes in any large-scale gene expression dataset. The subsequent inclusion of this information during data analysis allows for deeper insight into gene expression patterns. A related set of visualization tools called DRAGON View has been developed to allow for the analysis of large-scale gene expression datasets in relation to biological characteristics of gene sets.
Christopher M. L. S. Bouton, George Henry, Carlo Colantuoni, Jonathan Pevsner
9. Snomad: Biologist-Friendly Web Tools for the Standardization and NOrmalization of Microarray Data
Abstract
The use of DNA microarrays and other gene expression analysis techniques throughout the biological sciences has put extremely large, complex datasets in the hands of biologists who, for the most part, are not formally trained in computational or statistical methods. The majority of gene expression datasets have extensive artifactual bias and/or noise, which are not apparent upon superficial inspection. The SNOMAD gene expression analysis tools are an effort to make important normalization and quality control methods available to a wide audience of biological scientists working with gene expression data. Methods available in the SNOMAD tools include background subtraction, global mean normalization, local mean normalization across absolute intensity, local variance correction across absolute intensity, and ratio correction across the physical surface of the microarray. The SNOMAD web-implementation, available free of charge to all researchers at http://pevsnerlab.kennedykrieger.org/snomad.htm provides these tools without the downloading or installation of additional software, and does not require users to have any statistical or computer programming expertise.
Carlo Colantuoni, George Henry, Christopher M. L. S. Bouton, Scott L. Zeger, Jonathan Pevsner
10. Microarray Analysis Using the MicroArray Explorer
Abstract
The MicroArray Explorer (MAExplorer) is an open-source Java-based microarray data-mining tool that is available from the open source Web site as both a ready-to-run program and source code from http://maexplorer.sourceforge.net/. MAExplorer helps analyze expression patterns of individual genes, gene families, and clusters of genes. It is used as a stand-alone Java application and may be used for both ratio and intensity quantified array data (e.g., Cy3/Cy5, Affymetrix, and others). Data-mining sessions may be saved for continuation at later times or shared with collaborators; significant gene subsets, plots, and reports may be saved on the local disk. Extensions, called MAEPlugins, enable users to add new analysis methods and access to new genomic databases as they become available. MAExplorer was implemented in Java so that the same software could run on many platforms (e.g., Windows, MacOS 8/9 and X, Solaris, Linux, and Unix).
Peter F. Lemkin, Gregory C. Thornwall, Jai Evans
11. Parametric Empirical Bayes Methods for Microarrays
Abstract
We have developed an empirical Bayes methodology for gene expression data to account for replicate arrays, multiple conditions, and a range of modeling assumptions. The methodology is implemented in an R library called EBarrays. Functions in the library calculate posterior probabilities of patterns of differential expression across multiple conditions. This chapter provides an overview of the methodology and its implementation in EBarrays.
Michael A. Newton, Christina Kendziorski
12. SAM Thresholding and False Discovery Rates for Detecting Differential Gene Expression in DNA Microarrays
Abstract
SAM is a computer package for correlating gene expression with an outcome parameter such as treatment, survival time, or diagnostic class. It thresholds an appropriate test statistic and reports the q-value of each test based on a set of sample permutations. SAM works as a Microsoft Excel add-in and has additional features for fold-change thresholding and block permutations. Here, we explain how the SAM methodology works in the context of a general approach to detecting differential gene expression in DNA microarrays. Some recently developed methodology for estimating false discovery rates and q-values has been included in the SAM software, which we summarize here.
John D. Storey, Robert Tibshirani
13. Adaptive Gene Picking with Microarray Data: Detecting Important Low Abundance Signals
Abstract
DNA microarrays to evaluate gene expression present tremendous opportunities for understanding complex biological processes. However, important genes, such as transcription factors and receptors, are expressed at low levels, potentially leading to negative values after adjusting for background. These low-abundance transcripts have previously been ignored or handled in an ad hoc way. We describe a method that analyzes genes with low expression using normal scores and robustly adapts to changing variability across average expression levels. This approach can be the basis for clustering and other exploratory methods. Our algorithm also assigns p-values that are sensitive to changes in variability with gene expression. Together, these two features expand the repertoire of genes that can be analyzed with DNA arrays.
Yi Lin, Samuel T. Nadler, Hong Lan, Alan D. Attie, Brian S. Yandell
14. MAANOVA: A Software Package for the Analysis of Spotted cDNA Microarray Experiments
Abstract
We describe a software package called MAANOVA (MicroArray ANalysis Of VAriance). MAANOVA is a collection of functions for statistical analysis of gene expression data from two-color cDNA microarray experiments. It is available in both the Matlab and R programming environments and can be run on any platform that supports these packages. MAANOVA allows the user to assess data quality, apply data transformations, estimate relative gene expression from designed experiments with ANOVA models, evaluate and interpret ANOVA models, formally test for differential expression of genes and estimate false-discovery rates, produce graphical summaries of expression patterns, and perform cluster analysis with bootstrapping. The development of MAANOVA was motivated by the need to analyze microarray data that arise from sophisticated designed experiments. MAANOVA provides specialized functions for microarray analysis in an open-ended format within flexible computing environments. MAANOVA functions can be used alone or in co mbination with other functions for the rigorous statistical analysis of microarray data.
Hao Wu, M. Kathleen Kerr, Xiangqin Cui, Gary A. Churchill
15. GeneClust
Abstract
Two-way clustering techniques—such as hierarchical clustering, K-means clustering, tree-structured vector quantization, self-organizing maps, and principal components analysis—have been used to organize genes into groups or “clusters“ with similar behavior across relevant tissue samples or cell lines. However, these procedures seek a single global reordering of the samples or cell lines for all genes, and although they are effective in uncovering gross global structure, they are much less effective when applied to more complex clustering patterns (for example, where there are overlapping gene clusters). This chapter describes gene shaving (Hastie et al., 2000), a simple but effective method for identifying subsets of genes with coherent expression patterns and large variations across samples or conditions. After summarizing the gene-shaving methodology, we describe two software packages implementing the method: a small package written in S (usable in either S-Plus or R) and a considerably faster, mixed-language implementation with a graphical user interface intended for more applied use. The package can perform unsupervised, fully supervised, or partially supervised gene shaving, and the user is able to specify various parameters pertinent to the algorithm. The package outputs graphical representations of the extracted clusters (as colored heat maps) and diagnostic statistics. We then demonstrate how the latter tool can be used to analyze two published datasets (the Alon colon data and the NCI60 data).
Kim-Anh Do, Bradley Broom, Sijin Wen
16. POE: Statistical Methods for Qualitative Analysis of Gene Expression
Abstract
In many gene expression studies, the goals include discovery of novel biological classes and identification of genes whose expression can reliably be associated with these classes. Here we present a statistical analysis approach to facilitate both of these goals. The key idea is to model gene expression using latent categories that can be interpreted as a gene being turned “on“ or “off“ compared to a baseline level of expression. This three-way categorization is used for defining a reference in the unsupervised setting, for removing noise prior to clustering, for defining molecular subclasses in a way that is portable across platforms, and for defining easily interpretable probability-based distance measures for visualization, mining, and clustering.
Elizabeth S. Garrett, Giovanni Parmigiani
17. Bayesian Decomposition
Abstract
Gene chips and gene expression microarrays offer the opportunity to study biological systems on a genome-wide basis, exploring the full transcriptional response in an experiment or therapy. Because of the complexity of living organisms, these transcriptional responses are complex, with multiple, overlapping groups of genes being expressed in response to continuing internal and external stimuli. In order to use expression measurements to identify upstream modifications in signaling pathways, it is necessary to disentangle these overlapping responses. Bayesian Decomposition provides a method of identifying such overlap and correctly assigning genes to multiple groups, allowing easier identification of pathway modifications. Here the results of the application of Bayesian Decomposition to cell cycle data are shown.
Michael F. Ochs
18. Bayesian Clustering of Gene Expression Dynamics
Abstract
This chapter presents a Bayesian method for model-based clustering of gene expression dynamics and a program implementing it. The method represents gene expression dynamics as autoregressive equations and uses an agglomerative procedure to search for the most probable set of clusters, given the available data. The main contributions of this approach are the ability to take into account the dynamic nature of gene expression time series during clustering and an automated, principled way to decide when two series are different enough to belong to different clusters. The reliance of this method on an explicit statistical representation of gene expression dynamics makes it possible to use standard statistical techniques to assess the goodness of fit of the resulting model and validate the underlying assumptions. A set of gene expression time series, collected to study the response of human fibroblasts to serum, is used to illustrate the properties of the method and the functionality of the program.
Paola Sebastiani, Marco Ramoni, Isaac S. Kohane
19. Relevance Networks: A First Step Toward Finding Genetic Regulatory Networks Within Microarray Data
Abstract
An increasing number of methodologies are available for finding functional genomic clusters in RNA expression data. In this chapter, we describe a technique, termed relevance networks, that computes comprehensive pairwise measures of similarity for all genes in such a dataset. Associations with high positive or negative measures are saved and displayed in a graph-network-type diagram. Advantages of this method over others include: (1) negative associations (e.g., those from tumor suppressing genes) are shown: (2) disparate data types can be included (i.e., clinical, expression, and phenotypic); and (3) multiple connections are allowed (e.g., a transcription factor may be responsible for regulating the expression of multiple other genes). Java-based software is available for academic use to construct relevance networks, and operation of the software is also explained in this chapter.
Atul J. Butte, Isaac S. Kohane
Backmatter
Metadaten
Titel
The Analysis of Gene Expression Data
herausgegeben von
Giovanni Parmigiani
Elizabeth S. Garrett
Rafael A. Irizarry
Scott L. Zeger
Copyright-Jahr
2003
Verlag
Springer New York
Electronic ISBN
978-0-387-21679-9
Print ISBN
978-0-387-95577-3
DOI
https://doi.org/10.1007/b97411