Skip to main content

2005 | Buch

Bioinformatics and Computational Biology Solutions Using R and Bioconductor

herausgegeben von: Robert Gentleman, Vincent J. Carey, Wolfgang Huber, Rafael A. Irizarry, Sandrine Dudoit

Verlag: Springer New York

Buchreihe : Statistics for Biology and Health

insite
SUCHEN

Über dieses Buch

Bioconductor is a widely used open source and open development software project for the analysis and comprehension of data arising from high-throughput experimentation in genomics and molecular biology. Bioconductor is rooted in the open source statistical computing environment R.

This volume's coverage is broad and ranges across most of the key capabilities of the Bioconductor project, including importation and preprocessing of high-throughput data from microarray, proteomic, and flow cytometry platforms:

Curation and delivery of biological metadata for use in statistical modeling and interpretation

Statistical analysis of high-throughput data, including machine learning and visualization

Modeling and visualization of graphs and networks

The developers of the software, who are in many cases leading academic researchers, jointly authored chapters. All methods are illustrated with publicly available data, and a major section of the book is devoted to exposition of fully worked case studies.

This book is more than a static collection of descriptive text, figures, and code examples that were run by the authors to produce the text; it is a dynamic document. Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.

Inhaltsverzeichnis

Frontmatter

Preprocessing data from genomic experiments

1. Preprocessing Overview
Abstract
In this chapter, we give a brief overview of the tasks of microarray data preprocessing. There are a variety of microarray technology platforms in use, and each of them requires specific considerations. These will be described in detail by other chapters in this part of the book. This overview chapter describes relevant data structures, and provides with some broadly applicable theoretical background.
W. Huber, R. A. Irizarry, R. Gentleman
2. Preprocessing High-density Oligonucleotide Arrays
Abstract
High-density oligonucleotide expression arrays are a widely used microarray platform. Affymetrix GeneChip arrays dominate this market. An important distinction between the GeneChip and other technologies is that on GeneChips, multiple short probes are used to measure gene expression levels. This makes preprocessing particularly important when using this platform. This chapter begins by describing how to import probe-level data into the system and how these data can be examined using the facilities of the AffyBatch class. Then we will describe background adjustment, normalization, and summarization methods. Functionality for GeneChip probe-level data is provided by the affy, affyPLM, affycomp, gcrma, and affypdnn packages. All these tools are useful for preprocessing probe-level data stored in an AffyBatch object into expression-level data stored in an exprSet object. Because there are many competing methods for this preprocessing step, it is useful to have a way to assess the differences. In Bioconductor, this can be carried out using the affycomp package, which we discuss briefly.
B. M. Bolstad, R. A. Irizarry, L. Gautier, Z. Wu
3. Quality Assessment of Affymetrix GeneChip Data
Abstract
This chapter covers quality assessment for Affymetrix GeneChip data. The focus is on procedures available from the affy and affy-PLM packages. Initially some exploratory plots provided by the affy package, including images of the raw probe-level data, boxplots, histograms, and M vs A plots are examined. Next methods for assessing RNA degradation are discussed, specifically we compare the standard procedures recommended by Affymetrix and RNA degradation plots. Finally, we investigate how appropriate probe-level models yield good quality assessment tools. Chip pseudo-images of residuals and weights obtained from fitting robust linear models to the probe level data can be used as a visual tool for identifying artifacts on GeneChip microarrays. Other output from the probe-level modeling tools provide summary plots that may be used to identify aberrant chips.
B. M. Bolstad, F. Collin, J. Brettschneider, K. Simpson, L. Cope, R. A. Irizarry, T.P. Speed
4. Preprocessing Two-Color Spotted Arrays
Abstract
Preprocessing of two-color spotted arrays can be broadly divided in two main categories: quality assessment and normalization. In this chapter, we will focus on functions from the arrayQuality and marray packages that perform these tasks. The chapter begins by describing various data structures and tools available in these packages for reading and storing primary data from two-color spotted arrays. This is followed by descriptions of various exploratory tools such as MAplots, spatial plots, and boxplots to assess data quality of an array. Finally, algorithms available for performing appropriate normalization to remove sources of systematic variation are discussed. We will illustrate the above-mentioned functions using a case study.
Y. H. Yang, A. C. Paquet
5. Cell-Based Assays
Abstract
This chapter describes methods and tools for processing and visualizing data from high-throughput cell-based assays. Such assays are used to examine the contribution of genes to a biological process or phenotype (Carpenter and Sabatini, 2004). In principle, this can be done for any gene or combination of genes and for any biological process of interest. There is a variety of technologies, but all of them rely on the availability of genomic resources such as whole genome sequences, full-length cDNA libraries, siRNA collections; or on libraries of protein-specific ligands (compounds). Typically, all or at least large parts of the experimental procedures and data collection are automated. Cell-based assays offer the potential for clustering of genes based on their functional profiles (Piano et al., 2002) and epistatic analyses to elucidate complex genetic networks (Tong et al., 2004).
W. Huber, F. Hahne
6. SELDI-TOF Mass Spectrometry Protein Data
Abstract
The term proteome is used to denote the set of proteins encoded by a genome, and proteomics is the study of the expression and interactions of the proteins, which can depend on many factors such as cell type, treatment, tissue type, developmental state, and disease state. Conceptually, this is similar to the transcriptomics technologies discussed in Chapters 2-4; however, due to the more complicated chemistry of proteins, compared to RNA, the field has a different and diverse set of technologies and produces a wide range of specific challenges. Here we discuss one particular mass spectrometry technology.
X. Li, R. Gentleman, X. Lu, Q. Shi, J.D. Iglehart, L. Harris, A. Miron

Meta-data: biological annotation and visualization

7. Meta-data Resources and Tools in Bioconductor
Abstract
Closing the gap between knowledge of sequence and knowledge of function requires aggressive, integrative use of biological research databases of many different types. For greatest effectiveness, analysis processes and interpretation of analytic results must be guided using relevant knowledge about the systems under investigation. However, this knowledge is often widely scattered and encoded in a variety of formats. In this section, we consider some of the different sources of biological information as well as the software tools that can be used to access these data and to integrate them into an analysis. Bioconductor provides tools for creating, distributing, and accessing annotation resources in ways that have been found effective in workflows for statistical analysis of microarray and other high-throughput assays.
R. Gentleman, V. J. Carey, J. Zhang
8. Querying On-line Resources
Abstract
Many different meta-data resources are available on-line, and several of these provide a Web services model for interactions. R and Bioconductor support the use of different technologies (including HTTP, SOAP, and XML-RPC) for accessing different Web services. In this chapter we describe the tools for accessing Web services and demonstrate their use in a number of examples.
Our view is very similar to that proposed by Stein (2002), who emphasized Web services as the basic computational resource for bioinformatics. Well-designed Web services will play an essential role in solving many bioinformatic problems and R has the capability of playing many different roles, both on the client and the server side.
V. J. Carey, D. Temple Lang, J. Gentry, J. Zhang, R. Gentleman
9. Interactive Outputs
Abstract
In this chapter, we discuss creation of interactive outputs. We focus on the generation of reports, marked up in HTML, that link sets of genes with on-line resources, such as those supplied by the EBI or the NCBI, and which can be shared between different investigators. We discuss both the simple creation of these pages as well as some of the underlying software tools that can be used to construct new and different outputs. Although linked Web pages form the most commonly used outputs, we also consider some other tools that can be used to produce Web graphics that respond to the mouse in different ways.
C. A. Smith, W. Huber, R. Gentleman
10. Visualizing Data
Abstract
Visualization is an essential part of exploring, analyzing, and reporting data. Visualizations are used in all chapters in this monograph and in most scientific papers. Here we review some of the recurring concepts in visualizing genomic and biological data. We discuss scatterplots to investigate the dependency between pairs of variables, heatmaps for the visualization of matrix-like data, the visualization of distance relationships between objects, and the visualization of data along genomic coordinates.
W. Huber, X. Li, R. Gentleman

Statistical analysis for genomic experiments

11. Analysis Overview
Abstract
Chapters in this part of the book address tasks common in the downstream analysis (after preprocessing) of high-dimensional data. The basic assumption is that preprocessing has led to a sample for which it is reasonable to make comparisons between samples or between feature-vectors assembled across samples. Most examples are based on microarray data, but the principles are much broader and apply to many other sources of data. In this overview, the basic concepts and assumptions are briefly sketched.
V. J. Carey, R. Gentleman
12. Distance Measures in DNA Microarray Data Analysis
Abstract
Both supervised and unsupervised machine learning techniques require selection of a measure of distance between, or similarity among, the objects to be classified or clustered. Different measures of distance or similarity will lead to different machine learning performance. The appropriateness of a distance measure will typically depend on the types of features being used in the learning process.
In this chapter, we examine the properties of distance measures in the context of the analysis of gene expression data from DNA microarray experiments. The feature vectors represent transcript levels, i.e., mRNA abundance or relative abundance, either across biological samples (if comparing genes) or across genes (if comparing samples).
We consider different aspects of distances that help address the heterogeneity of the data and differences in interpretation depending on the source of the data (cDNA arrays versus short oligonucleotide arrays). Traditional measures, such as Euclidean and Manhattan distances as well as correlation-based distances, are considered. Other dissimilarity functions, which involve comparisons of distributions based on the Kullback-Leibler and mutual information criteria, are also examined.
R. Gentleman, B. Ding, S. Dudoit, J. Ibrahim
13. Cluster Analysis of Genomic Data
Abstract
We provide an overview of existing partitioning and hierarchical clustering algorithms in R. We discuss statistical issues and methods in choosing the number of clusters, the choice of clustering algorithm, and the choice of dissimilarity matrix. We also show how to visualize a clustering result by plotting ordered dissimilarity matrices in R. A new R package hopach, which implements the Hierarchical Ordered Partitioning And Collapsing Hybrid (HOPACH) algorithm, is presented (van der Laan and Pollard, 2003). The methodology is applied to a renal cell cancer gene expression data set.
K. S. Pollard, M. J. van der Laan
14. Analysis of Differential Gene Expression Studies
Abstract
In this chapter, we focus on the analysis of differential gene expression studies. Many microarray studies are designed to detect genes associated with different phenotypes, for example, the comparison of cancer tumors and normal cells. In some multifactor experiments, genetic networks are perturbed with various treatments to understand the effects of those treatments and their interactions with each other in the dynamic cellular network. For even the simplest experiments, investigators must consider several issues for appropriate gene selection. We discuss strategies for geneat-a-time analyses, nonspecific and meta-data driven prefiltering techniques, and commonly used test statistics for detecting differential expression. We show how these strategies and statistical tools are implemented and used in Bioconductor. We also demonstrate the use of factorial models for probing complex biological systems and highlight the importance of carefully coordinating known cellular behavior with statistical modeling to make biologically relevant inference from microarray studies.
D. Scholtens, A. von Heydebreck
15. Multiple Testing Procedures: the multtest Package and Applications to Genomics
Abstract
The Bioconductor R package multtest implements widely applicable resampling-based single-step and stepwise multiple testing procedures (MTP) for controlling a broad class of Type I error rates. The current version of multtest provides MTPs for tests concerning means, differences in means, and regression parameters in linear and Cox proportional hazards models. Typical testing scenarios are illustrated by applying various MTPs implemented in multtest to the Acute Lymphoblastic Leukemia (ALL) data set of Chiaretti et al. (2004), with the aim of identifying genes whose expression measures are associated with (possibly censored) biological and clinical outcomes.
K. S. Pollard, S. Dudoit, M. J. van der Laan
16. Machine Learning Concepts and Tools for Statistical Genomics
Abstract
In this chapter, supervised machine learning methods are described in the context of microarray applications. The most widely used families of machine learning methods are described, along with various approaches to learner assessment. The Bioconductor interfaces to machine learning tools are described and illustrated. Key problems of model selection and interpretation are reviewed in examples.
V. J. Carey
17. Ensemble Methods of Computational Inference
Abstract
Prognostic modeling of tumor classes, disease status, and survival time based on information obtained from gene expression profiling techniques is studied in this chapter. The basic principles of ensemble methods like bagging, random forests, and boosting are explained. The application of those methods to data from patients suffering acute lymphoblastic leukemia or renal cell cancer is illustrated. The problem of identifying the best method for a certain prediction task is addressed by means of benchmark experiments.
T. Hothorn, M. Dettling, P. Bühlmann
18. Browser-based Affymetrix Analysis and Annotation
Abstract
webbioc is a CGI-based interface to Bioconductor methods for preprocessing and analyzing Affymetrix data. It wraps up the functionality of a number of Bioconductor packages into a consistent environment that can be deployed for use by small groups or large departments. Without ever seeing a command prompt, it will take the user from raw data to annotated lists of the most significantly differentially expressed genes. It will optionally make use of a back-end computer cluster for batch processing. This chapter will discuss the appropriate circumstances under which webbioc should be deployed and the pros and cons of using it versus the typical command line environment of R. Installation and configuration will be fully covered. Use of theWeb-based interface will be visually demonstrated. Finally, we will describe how to expand the interface by adding additional analysis modules.
C. A. Smith

Graphs and networks

19. Introduction and Motivating Examples
R. Gentleman, W. Huber, V. J. Carey
20. Graphs
Abstract
In this chapter, we describe and discuss various definitions and algorithms for graphs, their representation, and uses. The presentation is formal and we leave references to software and usage for the later chapters. Our goal is to use graphs to explore, navigate, represent, and model biological data. Hence, we must often specialize general concepts and ideas to the tasks at hand. Some of our motivation is taken from the area of social network analysis where many similar problems have been considered and there is a rich history of both concepts and methods.
W. Huber, R. Gentleman, V. J. Carey
21. Bioconductor Software for Graphs
Abstract
We describe software tools for creating, manipulating, and visualizing graphs in the Bioconductor project. We give the rationale for our design decisions and provide brief outlines of how to make use of these tools. The discussion mirrors that of Chapter 20 where the different mathematical constructs were described. It is worth differentiating between packages that are mainly infrastructure (sets of tools that can be used to create other pieces of software) and packages that are designed to provide an end-user application. The packages graph, RBGL, and Rgraphviz are infrastructure packages. Software developers may use these packages to construct tools aimed at specific applications areas, such as the GOstats package.
V. J. Carey, R. Gentleman, W. Huber, J. Gentry
22. Case Studies Using Graphs on Biological Data
Abstract
In this chapter we consider four specific data-analytic and inferential problems that can be addressed using graphs. We demonstrate the use of the software and methods described in Chapters 20 and 21 on real problems in computational biology.We will show how one can investigate relationships between gene expression and protein-protein interaction data, how GO annotations can be used to analyze gene sets, how literature citations can be related to experimental data, and how gene expression data can be mapped on pathways.
R. Gentleman, D. Scholtens, B. Ding, V. J. Carey, W. Huber

Case studies

23. limma: Linear Models for Microarray Data
Abstract
A survey is given of differential expression analyses using the linear modeling features of the limma package. The chapter starts with the simplest replicated designs and progresses through experiments with two or more groups, direct designs, factorial designs and time course experiments. Experiments with technical as well as biological replication are considered. Empirical Bayes test statistics are explained. The use of quality weights, adaptive background correction and control spots in conjunction with linear modelling is illustrated on the β7 data.
G. K. Smyth
24. Classification with Gene Expression Data
Abstract
A survey is given of tasks related to the construction and evaluation of classifiers applied to a renal cell cancer data set. Balanced sample splitting, non-specific filtering, linear discriminant analysis, nearest-neighbor prediction, and support vector machines are all concretely illustrated using the MLInterfaces package. Evaluations based on single and multiple random splits of data are compared. The entire presentation is given in a very generic programming format, to facilitate the adaptation and variation, by other investigators, of the techniques used here.
M. Dettling
25. From CEL Files to Annotated Lists of Interesting Genes
Abstract
One of the most popular applications of microarray technology is the identification of genes that are differentially expressed in two populations.With Affymetrix GeneChip technology, there are several steps between hybridization and the selection of interesting genes. The steps of preprocessing to improve signal to noise ratios, choosing a summary statistic for appropriate ranking of genes, and deciding on a final filter for candidate genes are largely statistical in nature. In this chapter, we demonstrate Bioconductor tools useful for creating such lists. We start from the raw probe level data (CEL files) and conclude with the creation of annotated reports.
R. A. Irizarry
Backmatter
Metadaten
Titel
Bioinformatics and Computational Biology Solutions Using R and Bioconductor
herausgegeben von
Robert Gentleman
Vincent J. Carey
Wolfgang Huber
Rafael A. Irizarry
Sandrine Dudoit
Copyright-Jahr
2005
Verlag
Springer New York
Electronic ISBN
978-0-387-29362-2
Print ISBN
978-0-387-25146-2
DOI
https://doi.org/10.1007/0-387-29362-0