Skip to main content

2011 | Buch

Handbook of Statistical Bioinformatics

herausgegeben von: Henry Horng-Shing Lu, Bernhard Schölkopf, Hongyu Zhao

Verlag: Springer Berlin Heidelberg

Buchreihe : Springer Handbooks of Computational Statistics

insite
SUCHEN

Über dieses Buch

Numerous fascinating breakthroughs in biotechnology have generated large volumes and diverse types of high throughput data that demand the development of efficient and appropriate tools in computational statistics integrated with biological knowledge and computational algorithms. This volume collects contributed chapters from leading researchers to survey the many active research topics and promote the visibility of this research area. This volume is intended to provide an introductory and reference book for students and researchers who are interested in the recent developments of computational statistics in computational biology.

Inhaltsverzeichnis

Frontmatter

Sequence Analysis

Frontmatter
Chapter 1. Accuracy Assessment of Consensus Sequence from Shotgun Sequencing
Abstract
The significance of any genetic or biological implication based on DNA sequencing depends on its accuracy. The statistical evaluation of accuracy requires a probabilistic model of measurement error. In this chapter, we describe two statistical models of sequence assembly from shotgun sequencing respectively for the cases of haploid and diploid target genome. The first model allows us to convert quality scores into probabilities. It combines quality scores of base-calling and the power of alignment to improve sequencing accuracy. Specifically, we start with assembled contigs and represent probabilistic errors by logistic models that takes quality scores and other genomic features as covariates. Since the true sequence is unknown, an EM algorithm is used to deal with missing data. The second model describes the case in which DNA reads are from one of diploid genome, and our aim is to reconstruct the two haplotypes including phase information. The statistical model consists of sequencing errors, compositional information and haplotype memberships of each DNA fragment. Consequently, optimal haplotype sequences can be inferred by maximizing the probability among all configurations conditional on the given assembly. In the meantime, this probability together with the coverage information provides an assessment of the confidence for the reconstruction.
Lei M. Li
Chapter 2. Statistical and Computational Studies on Alternative Splicing
Abstract
The accumulating genome sequences and other high-throughput data have shed light on the extent and importance of alternative splicing in functional regulation. Alternative splicing dramatically increases the transcriptome and proteome diversity of higher organisms by producing multiple splice variants from different combinations of exons. It has an important role in many biological processes including nervous system development and programmed cell death. Many human diseases including cancer arise from defects in alternative splicing and its regulation. This chapter reviews statistical and computational methods on genome-wide alternative splicing studies.
Liang Chen
Chapter 3. Statistical Learning and Modeling of TF-DNA Binding
Abstract
Discovering binding sites and motifs of specific TFs is an important first step towards the understanding of gene regulation circuitry. Computational approaches have been developed to identify transcription factor binding sites from a set of co-regulated genes. Recently, the abundance of gene expression data, ChIP-based TF-binding data (ChIP-array/seq), and high-resolution epigenetic maps have brought up the possibility of capturing sequence features relevant to TF-DNA interactions so as to improve the predictive power of gene regulation modeling. In this chapter, we introduce some statistical models and computational strategies used to predict TF-DNA interactions from the DNA sequence information, and describe a general framework of predictive modeling approaches to the TF-DNA binding problem, which includes both traditional regression methods and statistical learning methods by selecting relevant sequence features and epigenetic markers.
Bo Jiang, Jun S. Liu
Chapter 4. Computational Promoter Prediction in a Vertebrate Genome
Abstract
Computational prediction of vertebrate gene promoters from genomic DNA sequences is one of the most difficult problems in computational genomics, but it is essential for understanding genome organization, improving gene annotation and for further comprehensive studies of gene expression and regulation networks. The advent of new genomic technologies has ushered forth the era of deeper understanding of molecular biology at systems level, more accurate and diverse large-scale molecular data have been fueling the development of new predictive methods and computational tools in this rapidly moving field. In this chapter, I will give an introduction on structure and function of promoters in typical vertebrate genes, as well as experimental methods for determining them. I then describe generic statistical methods for promoter prediction and a few computational approaches as examples. I will further review and update on more recent advances in promoter prediction methodologies and give a future prospect in the conclusion.
Michael Q. Zhang
Chapter 5. Discovering Influential Variables: A General Computer Intensive Method for Common Genetic Disorders
Abstract
We describe a general backward partition method for discovering which of a large number of possible explanatory variables influence a dependent variable Y. This method, based on a variant pioneered by Lo and Zheng, and variations have been used successfully in several biological problems, some of which are discussed here. The problem is an example of feature or variable selection. Although the objective, to understand which are the influential variables, is often not the same as classification, the method has been successfully applied to that problem too.
Tian Zheng, Herman Chernoff, Inchi Hu, Iuliana Ionita-Laza, Shaw-Hwa Lo
Chapter 6. STORMSeq: A Method for Ranking Regulatory Sequences by Integrating Experimental Datasets with Diverse Computational Predictions
Abstract
We present STORMSeq (STructured ranking of Regulatory Motifs and Sequences), a novel probabilistic method for sequence search in which we learn to rank sequences using heterogeneous experimental datasets and the outputs of diverse computational prediction methods. By formulating the problem of sequence search as one of ranking, STORMSeq largely avoids issues of model misspecification and complex inference which arise when modelling different types of datasets in the presence of many hidden variables. The framework allows one to compare orderings over sequences conveyed by diverse types of data, though the data measurements and scoring systems may be difficult to compare to one another. We demonstrate STORMSeq in the contexts of scoring sequences bound by transcription factors and for the problem of finding microRNA targets in human retinoblastomas where in the latter case we can combine mRNA and microRNA expression with protein abundances and sequence data. We will show for both of these problems that (a) by accounting for the dependencies inherent in learning to rank and (b) by incorporating multiple datasets with computational predictions, we can improve the accuracy with which we rank sequences compared to standard methods. Our method is general and can be applied to a wide variety of other problems in which heterogeneous data sets are available, such as ranking therapeutic drug targets and discovery of genetic associations to disease.
Jim C. Huang, Brendan J. Frey
Chapter 7. Mixture Tree Construction and Its Applications
Abstract
A new method for building a gene tree from Single Nucleotide Polymorphism (SNP) data was developed by Chen and Lindsay (Biometrika 93(4):843–860, 2006). Called the mixture tree, it was based on an ancestral mixture model. The sieve parameter in the model plays the role of time in the evolutionary tree of the sequences. By varying the sieve parameter, one can create a hierarchical tree that estimates the population structure at each fixed backward point in time. In this chapter, we will review the model and then present an application to the clustering of the mitochondrial sequences to show that the approach performs well. A simulator that simulates real SNPs sequences with unknown ancestral history will be introduced. Using the simulator we will compare the mixture trees with true trees to evaluate how well the mixture tree method performs. Comparison with some existing methods including neighbor-joining method and maximum parsimony method will also be presented in this chapter.
Grace S. C. Chen, Mingze Li, Michael Rosenberg, Bruce Lindsay

Expression Data Analysis

Frontmatter
Chapter 8. Experimental Designs and ANOVA for Microarray Data
Abstract
Microarray experiments are complex, multistep processes that represent a considerable investment of time and resources. Proper experimental design and analysis are critical to the success of a microarray experiment, and must be considered early in the planning of the experiment. Many aspects of experimental design from low-throughput experiments, such as randomization, replication, and blocking, remain applicable to microarray experiments as well. Similarly, the analysis of variance (ANOVA) remains a valid approach for analyzing data from most microarray experiments. However, the high-dimensional nature of microarrays introduces additional considerations into the design and analysis. This chapter provides an overview of the unique statistical challenges presented by microarrays and describes computational methods for implementing these statistical algorithms.
Richard E. Kennedy, Xiangqin Cui
Chapter 9. The MicroArray Quality Control (MAQC) Project and Cross-Platform Analysis of Microarray Data
Abstract
As a powerful tool for genome-wide gene expression analysis, DNA microarray technology is widely used in biomedical research. One important application of microarrays is to identify differentially expressed genes (DEGs) between two distinct biological conditions, e.g. disease versus normal or treatment versus control, so that the underlying molecular mechanism differentiating the two conditions maybe revealed. Mechanistic interpretation of microarray results requires the identification of reproducible and reliable lists of DEGs, because irreproducible lists of DEGs may lead to different biological conclusions. Many vendors are providing microarray platforms of different characteristics for gene expression analysis, and the widely publicized apparent lack of intra- and cross-platform concordance in DEGs from microarray analysis of the same sets of study samples has been of great concerns to the scientific community and regulatory agencies like the US Food and Drug Administration (FDA). In this chapter, we describe the study design of and the main findings from the FDA-led MicroArray Quality Control (MAQC) project that aims to objectively assess the performance of different microarray platforms and the advantages and limitations of various competing statistical methods in identifying DEGs from microarray data. Using large data sets generated on two human reference RNA samples established by the MAQC project, we show that the levels of concordance in inter-laboratory and cross-platform comparisons are generally high. Furthermore, the levels of concordance largely depend on the statistical criteria used for ranking and selecting DEGs, irrespective of the chosen platforms or test sites. Importantly, a straightforward method combining fold-change ranking with a non-stringent P-value cutoff produces more reproducible lists of DEGs than those by t-test P-value ranking. Similar conclusions are reached when microarray data sets from a rat toxicogenomics study are analyzed. The availability of the MAQC reference RNA samples and the large reference data sets provides a unique resource for the gene expression community to reach consensus on the “best practices” for the generation, analysis, and applications of microarray data in drug development and personalized medicine.
Zhining Wen, Zhenqiang Su, Jie Liu, Baitang Ning, Lei Guo, Weida Tong, Leming Shi
Chapter 10. A Survey of Classification Techniques for Microarray Data Analysis
Abstract
With the recent advance of biomedical technology, a lot of ‘OMIC’ data from genomic, transcriptomic, and proteomic domain can now be collected quickly and cheaply. One such technology is the microarray technology which allows researchers to gather information on expressions of thousands of genes all at the same time. With the large amount of data, a new problem surfaces – how to extract useful information from them. Data mining and machine learning techniques have been applied in many computer applications for some time. It would be natural to use some of these techniques to assist in drawing inference from the volume of information gathered through microarray experiments. This chapter is a survey of common classification techniques and related methods to increase their accuracies for microarray analysis based on data mining methodology. Publicly available datasets are used to evaluate their performance.
Wai-Ki Yip, Samir B. Amin, Cheng Li
Chapter 11. Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies
Abstract
In this chapter, we focus on statistical questions raised by the identification of copy number alterations in tumor samples using genotyping microarrays, also known as Single Nucleotide Polymorphism (SNP) arrays. We define the copy number states formally, and show how they are assessed by SNP arrays. We identify and discuss general and cancer-specific challenges for SNP array data preprocessing, and how they are addressed by existing methods. We review existing statistical methods for the detection of copy number changes along the genome. We describe the influence of two biological parameters – the proportion of normal cells in the sample, and the ploidy of the tumor – on observed data. Finally, we discuss existing approaches for the detection and calling of copy number aberrations in the particular context of cancer studies, and identify statistical challenges that remain to be addressed.
Pierre Neuvial, Henrik Bengtsson, Terence P. Speed
Chapter 12. Computational Analysis of ChIP-chip Data
Abstract
Chromatin immunoprecipitation coupled with genome tiling array hybridization, also known as ChIP-chip, is a powerful technology to identify protein-DNA interactions in genomes. It is widely used to locate transcription factor binding sites and histone modifications. Data generated by ChIP-chip provide important information on gene regulation. This chapter reviews fundamental issues in ChIP-chip data analysis. Topics include data preprocessing, background correction, normalization, peak detection and motif analysis. Statistical models and principles that significantly improve data analysis are discussed. Popular software tools are briefly introduced.
Hongkai Ji
Chapter 13. eQTL Mapping for Functional Classes of Saccharomyces cerevisiae Genes with Multivariate Sparse Partial Least Squares Regression
Abstract
The availability of high-throughput genotyping technologies and microarray assays has enabled investigation of genetic variations that influence levels of gene expression. Expression Quantitative Trait Loci (eQTL) mapping methods have been successfully used to identify the genetic basis of gene expression which in turn led to identification of candidate genes and construction of regulatory networks. One challenging statistical aspect of eQTL mapping is the existence of thousands of traits. We have recently proposed a multivariate sparse partial least squares framework for mapping multiple quantitative traits and identifying genetic variations that affect the expression of a group of genes. In this book chapter, we provide a comprehensive illustration of this methodology with a Saccharomyces cerevisiae linkage study. Data from this study involves segregants from a cross between two Saccharomyces cerevisiae strains. Our application focuses on elucidating genomic markers that affect expression of functional yeast gene classes. We illustrate identification of eQTL regions affecting whole functional classes of genes as well as eQTL regions influencing individual genes.
Dongjun Chung, Sündüz Keleş
Chapter 14. Statistical Analysis of Time Course Microarray Data
Abstract
Time course gene expression experiments have proved valuable in a variety of studies. Their unique data structure and the diversity of tasks often associated with them present new challenges to statistical analysis. In this report, we give a brief review of several primary questions pertaining to such experiments and popular statistical tools to address them.
Lingyan Ruan, Ming Yuan

Systems Biology

Frontmatter
Chapter 15. Kernel Methods in Bioinformatics
Abstract
Kernel methods have now witnessed more than a decade of increasing popularity in the bioinformatics community. In this article, we will compactly review this development, examining the areas in which kernel methods have contributed to computational biology and describing the reasons for their success.
Karsten M. Borgwardt
Chapter 16. Graph Classification Methods in Chemoinformatics
Abstract
Graphs are general and powerful data structures that can be used to represent diverse kinds of molecular objects such as chemical compounds, proteins, and RNAs. In recent years, computational analysis of tens of thousands of labeled graphs has become possible by advanced graph mining methods. For example, frequent pattern mining methods such as gSpan can enumerate all frequent subgraphs in a graph database efficiently. This chapter reviews basics of graph mining methodology and its application to chemoinformatics and bioinformatics. Graph classification and regression techniques based on subgraph patterns are also reviewed extensively.
Koji Tsuda
Chapter 17. Hidden Markov Random Field Models for Network-Based Analysis of Genomic Data
Abstract
Graphs and networks are common ways of depicting biological information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This kind of a priori use of graphs is a useful supplement to the standard numerical data such as microarray gene expression data and single nucleotide polymorphisms (SNPs) data. How to incorporate such a prior network information into analysis of numerical data raises interesting statistical problems. Representing the genetic networks as undirected graphs, we have developed several approaches for identifying differentially expressed genes and genes or SNPs associated with diseases in a unified framework of hidden Markov random field (HMRF) models. Different from the traditional empirical Bayes approaches for analysis of gene expression data, the HMRF-based models account for the prior dependency among the genes on the network and therefore effectively utilize the prior network information in identifying the subnetworks of genes that are perturbed by experimental conditions. In this paper, we briefly review the basic setup of the HMRF models and the emission probability functions for some problems often encountered in analysis of microarray gene expression and SNPs data. We also present some interesting areas that require further research.
Hongzhe Li
Chapter 18. Review of Weighted Gene Coexpression Network Analysis
Abstract
We survey key concepts of weighted gene coexpression network analysis (WGCNA), also known as weighted correlation network analysis, and related data analysis strategies. We describe the construction of a weighted gene coexpression network from gene expression data, identification of network modules and integration of external data such as gene ontology information and clinical phenotype data. We review Differential Weighted Gene Coexpression Network Analysis (DWGCNA), a method for comparing and contrasting networks constructed from qualitatively different groups of samples. DWGCNA provides a means for measuring not only differential expression but also differential connectivity. Further, we show how to incorporate genetic marker data with expression data via Integrated Weighted Gene Coexpression Network Analysis (IWGCNA). Lastly, we describe R software implementing WGCNA methods.
Tova Fuller, Peter Langfelder, Angela Presson, Steve Horvath
Chapter 19. Liquid Association and Related Ideas in Quantifying Changes in Correlation
Abstract
This chapter describes a novel statistical concept of a ternary relationship between variables in a complex data system, coined ‘liquid association’ (LA) by Li (Proc Natl Acad Sci U S A 99(16):16875–16880, 2002). LA describes how variation in the pattern of association between a pair of variables, including its sign and strength, is mediated by a third variable from the background. LA is introduced because despite the many successful applications of similarity based analysis on microarray data, numerous cases where the functional association between genes is known from the literature (confirmed by experiments) but the statistical correlation from the corresponding expression data is practically zero also exist. Other than the noises in the microarray data, a deeper reason may be the biological complexity of the cellular system and the hidden components, which are not directly measured by gene expression, such as multiple functions of a protein, varying cellular oxidization-reduction states, fluctuating hormone levels or other cellular signals and so on.
Ker-Chau Li
Chapter 20. Boolean Networks
Abstract
Reconstruction of genetic regulatory networks from gene expression profiles and protein interaction data is a critical problem in systems biology. Boolean networks and their variants have been used for network reconstruction problems due to Boolean networks’ simplicity. In the graph of a Boolean network, nodes represent the statuses of genes while the edges represent relationships between genes. In a Boolean network model, the status of a gene is quantized as ‘on’ or ‘off’, representing the gene as being ‘active’ or ‘inactive’ respectively. In this chapter, we will introduce the basic definitions of Boolean networks and the analysis of their properties. We will also discuss a related model called probabilistic Boolean network, which extends Boolean networks in order to have the advantage of modeling with data uncertainty and model selection. Furthermore, we will also introduce directed acyclic Boolean network and the statistical method of SPAN to reconstruct Boolean networks from noisy array data by assigning an s-p-score for every pair of genes. At last, we will suggest possible directions for future developments on Boolean networks.
Tung-Hung Chueh, Henry Horng-Shing Lu
Chapter 21. Protein Interaction Networks: Protein Domain Interaction and Protein Function Prediction
Abstract
Most of a cell’s functional processes involve interactions among proteins, and a key challenge in proteomics is to better understand these complex interaction graphs at a systems level. Because of their importance in development and disease, protein-protein interactions (PPIs) have been the subject of intense research in recent years. In addition, a greater understanding of PPIs can be achieved through the detailed investigation of the protein domain interactions which mediate PPIs. In this chapter, we describe recent efforts to predict interactions between proteins and between protein domains. We also describe methods that attempt to use protein interaction data to infer protein function. Protein-protein interactions directly contribute to protein functions, and implications about functions can often be made via PPI studies. These inferences are based on the premise that the function of a protein may be discovered by studying its interaction with one or more proteins of known functions. The second part of this chapter reviews recent computational approaches to predict protein functions from PPI networks.
Yanjun Qi, William Stafford Noble
Chapter 22. Reverse Engineering of Gene Regulation Networks with an Application to the DREAM4 in silico Network Challenge
Abstract
Despite much research, reverse engineering of gene regulation remains a challenging task due to a large number of genes involved and complex relationships among them. In this chapter, we review statistical methods for inferring gene regulation networks, specifically focusing on the methods for analyzing gene expression data. We then present a new reverse engineering method in order to efficiently utilize datasets from various perturbation experiments as well as to integrate these multiple sources of information. We apply our approach to the DREAM in silico network challenge to demonstrate its performance.
Hyonho Chun, Jia Kang, Xianghua Zhang, Minghua Deng, Haisu Ma, Hongyu Zhao
Chapter 23. Inferring Signaling and Gene Regulatory Network from Genetic and Genomic Information
Abstract
Biological systems respond to environmental changes and genetic variations. One of the essential tasks of systems biology is to untangle the signaling and gene regulatory networks that respond to environmental changes or genetic variations. However, unwiring the complex gene regulatory program is extremely challenging due to the large number of variables involved in these regulatory programs. The traditional single gene centered strategy turns out to be both insufficient and inefficient for studying signaling and gene regulatory networks. With the emergence of various high throughput technologies, such as DNA microarray, ChIP-chip, etc., it becomes possible to interrogate the biological systems at genome scale efficiently and cost effectively. As these high throughput data are accumulating rapidly, there exists a clear demand for methods that effectively integrate these data to elucidate the complex behaviors of biological systems. In this chapter, we discuss several recently developed computational models that integrate diverse types of high throughput data, particularly, the genetic and genomic data, as examples for the systems approaches that untangle signaling and gene regulatory networks.
Zhidong Tu, Jun Zhu, Fengzhu Sun
Chapter 24. Computational Drug Target Pathway Discovery: A Bayesian Network Approach
Abstract
Genome-wide transcriptome data together with statistical analysis enable us to reverse-engineer gene networks that can be a kind of views useful for understanding dynamic behaviour of biological elements in cells. In this chapter, we elucidate statistical models for estimating gene networks based on two types of microarray gene expression data, gene knock-down and time-course. In our modeling, nonparametric regression model is combined with Bayesian networks to capture nonlinear relationships between genes and a derived Bayesian information criterion with efficient structure learning algorithm selects network structure. Some efficient algorithms for structure learning of Bayesian networks, which is known as an NP-hard problem for optimal solutions, are also introduced. To demonstrate the statistical gene network analysis shown in this chapter, we estimate gene networks based on microarray data of human endothelial cell treated with an anti-hyperlipidaemia drug fenofibrate. Based on the constructed gene networks, we illustrate computational strategies for discovering drug target genes and pathways.
Seiya Imoto, Yoshinori Tamada, Hiromitsu Araki, Satoru Miyano
Chapter 25. Cancer Systems Biology
Abstract
Cancer is a complex disease, resulting from system-wide interactions of biological processes rather than from any single underlying cause. The processes that drive all cancer development and progression have been termed the ‘hallmarks of cancer’. With the growth of large-scale measurements of numerous molecular and cellular properties, a new approach, cancer systems biology, to understanding the interrelationship between the hallmarks is presently being developed. Cancer systems biology focuses on systems-level analysis and presently strives to develop novel data integration and analysis techniques to model and infer cancer biology and treatment response.
Elana J. Fertig, Ludmila V. Danilova, Michael F. Ochs
Chapter 26. Comparative Genomics
Abstract
Comparative genomics was previously misguided by the naïve dogma that what is true in E. coli is also true in the elephant. With the rejection of such a dogma, comparative genomics has been positioned in proper evolutionary context. Here I numerically illustrate the application of phylogeny-based comparative methods in comparative genomics involving both continuous and discrete characters to solve problems from characterizing functional association of genes to detection of horizontal gene transfer and viral genome recombination, together with a detailed explanation and numerical illustration of statistical significance tests based on the false discovery rate (FDR). FDR methods are essential for multiple comparisons associated with almost any large-scale comparative genomic studies. I discuss the strength and weakness of the methods and provide some guidelines on their proper applications.
Xuhua Xia
Chapter 27. Robust Control of Immune Systems Under Noises: Stochastic Game Approach
Abstract
A robust control of immune response is proposed for therapeutic en- hancement to match a prescribed immune response under uncertain initial states and environmental noises, including continuous intrusion of exogenous pathogens. The worst-case effect of all possible noises and uncertain initial states on the matching for a desired immune response is minimized for the enhanced immune system, i.e., a robust control is designed to track a prescribed immune model response from the stochastic minimax matching perspective. This minimax matching problem could herein be transformed to an equivalent stochastic game problem. The exogenous pathogens and environmental noises (external noises) and stochastic uncertain internal noises are considered as a player to maximize (worsen) the matching error when the therapeutic control agents are considered as another player to minimize the matching error. Since the innate immune system is highly nonlinear, it is not easy to solve the robust control problem by the nonlinear stochastic game method directly. A fuzzy model is proposed to interpolate several linearized immune systems at different operating points to approximate the innate immune system via smooth fuzzy membership functions. With the help of fuzzy approximation method, the stochastic minimax matching control problem of immune systems could be easily solved by the proposed fuzzy stochastic game method via the linear matrix inequality (LMI) technique with the help of Robust Control Toolbox in Matlab. Finally, in silico examples are given to illustrate the design procedure and to confirm the efficiency and efficacy of the proposed method.
Bor-Sen Chen, Chia-Hung Chang, Yung-Jen Chuang
Metadaten
Titel
Handbook of Statistical Bioinformatics
herausgegeben von
Henry Horng-Shing Lu
Bernhard Schölkopf
Hongyu Zhao
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-16345-6
Print ISBN
978-3-642-16344-9
DOI
https://doi.org/10.1007/978-3-642-16345-6