Skip to main content
Top

2011 | Book

Problem Solving Handbook in Computational Biology and Bioinformatics

Editors: Lenwood S. Heath, Naren Ramakrishnan

Publisher: Springer US

insite
SEARCH

About this book

Bioinformatics is growing by leaps and bounds; theories/algorithms/statistical techniques are constantly evolving. Nevertheless, a core body of algorithmic ideas have emerged and researchers are beginning to adopt a "problem solving" approach to bioinformatics, wherein they use solutions to well-abstracted problems as building blocks to solve larger scope problems.

Problem Solving Handbook for Computational Biology and Bioinformatics is an edited volume contributed by world renowned leaders in this field. This comprehensive handbook with problem solving emphasis, covers all relevant areas of computational biology and bioinformatics. Web resources and related themes are highlighted at every opportunity in this central easy-to-read reference.

Designed for advanced-level students, researchers and professors in computer science and bioengineering as a reference or secondary text, this handbook is also suitable for professionals working in this industry.

Table of Contents

Frontmatter

Sequences

Frontmatter
Modern BLAST Programs
Abstract
The Basic Local Alignment Search Tool (BLAST) is arguably the most widely used program in bioinformatics. By sacrificing sensitivity for speed, it makes sequence comparison practical on huge sequence databases currently available. The original version of BLAST was developed in 1990. Since then it has spawned a variant of specialized programs. This chapter surveys the development of BLAST and BLAST-like programs for homology search, discusses alignment statistics that are used in assessment of reported matches in BLAST, and provides the reader with guidance to select appropriate programs and set proper parameters to match research requirements.
Jian Ma, Louxin Zhang
Practical Multiple Sequence Alignment
Abstract
Multiple sequence alignment as a means of comparing DNA, RNA, or amino acid sequences is an essential precondition for various analyses, including structure prediction, modeling binding sites, phylogeny, or function prediction. This range of applications implies a demand for versatile, flexible, and specialized methods to compute accurate alignments. This chapter summarizes the key algorithmic insights gained in the past years to facilitate an easy understanding of the current multiple sequence alignment literature and to enable the readers to use and apply current tools in their own research.
Tobias Rausch, Knut Reinert
Sequence Alignment Statistics
Abstract
This chapter gives some simple, useful techniques for approximating the p-values of various types of optimal alignment scores. It starts with general techniques: if, e.g., a dynamic programming computation has probabilistically independent inputs, its successive states form a Markov chain. Thus, if the states are not too numerous, a “Markov computation” yields their distribution. The chapter reviews the three extreme-value distributions, which are relevant to approximating the distribution of random maxima, in the same way the normal distribution is relevant to approximating the distribution of random sums. In general, convergence to an extreme-value distribution is often painfully slow, so the Poisson approximation for counting rare and weakly dependent events can be a more flexible tool for approximating the distribution of maxima. In particular, the extreme-value and Poisson distributionsyield an approximate distribution for the optimal local alignment score of two random sequences, and a finite-size correction can increase the accuracy of statistical approximations if the sequences are relatively short. Moreover, the concept of “islands” permits many statistical approximation problems in local alignment to be transformed to combinatorial problems. Finally, the “Independent Diagonals Approximation” broadens the application of many of the previous methods, and an “Independent Alignments Approximation” converts many alignment variants into the combinatorial problem of determining an “effective length”.
John L. Spouge

Phylogenetics

Frontmatter
Practical Implications of Coalescent Theory
Abstract
The coalescent has become perhaps the most widely-used population genetics model. By modeling the ancestry of a sample, rather than the evolution of the entire population from which the sample is drawn, it provides a computationally efficient framework for data simulation. Furthermore, from a theoretical perspective, it provides the under-pinnings for many useful analysis techniques. In this chapter, we introduce the coalescent, describe some of the problems that it has been used to address, discuss practical implications that follow from the insight it provides, and summarize some of the available software.
Paul Marjoram, Paul Joyce
Graph Model of Coalescence with Recombinations
Abstract
One of the primary genetic events shaping an autosomal chromosome is recombination. This is a process that occurs during meiosis, in eukaryotes, that results in the offsprings having different combinations of (homologous) genes, or chromosomal segments, of the two parents. The presence of these recombination events in the evolutionary history of each chromosome complicates the genetic landscape of a population, and understanding the manifestations of these genetic exchanges in the chromosome sequences has been a subject of intense curiosity (see [Hud83, Gri99, HSW05] and citations therein).
Laxmi Parida
Phylogenetic Trees From Sequences
Abstract
In this chapter, we review important concepts and approaches for phylogeny reconstruction from sequence data.We first cover some basic definitions and properties of phylogenetics, and briefly explain how scientists model sequence evolution and measure sequence divergence. We then discuss three major approaches for phylogenetic reconstruction: distance-based phylogenetic reconstruction, maximum parsimony, and maximum likelihood. In the third part of the chapter, we review how multiple phylogenies are compared by consensus methods and how to assess confidence using bootstrapping. At the end of the chapter are two sections that list popular software packages and additional reading.
Paul Ryvkin, Li-San Wang
Evolutionary Phylogenetic Networks: Models and Issues
Abstract
Phylogenetic networks are special graphs that generalize phylogenetic trees to allow for modeling of non-treelike evolutionary histories. The ability to sequence multiple genetic markers from a set of organisms and the conflicting evolutionary signals that these markers provide in many cases, have propelled research and interest in phylogenetic networks to the forefront in computational phylogenetics. Nonetheless, the term ‘phylogenetic network‘ has been generically used to refer to a class of models whose core shared property is tree generalization. Several excellent surveys of the different flavors of phylogenetic networks and methods for their reconstruction have been written recently. However, unlike these surveys, this chapte focuses specifically on one type of phylogenetic networks, namely evolutionary phylogenetic networks, which explicitly model reticulate evolutionary events. Further, this chapter focuses less on surveying existing tools, and addresses in more detail issues that are central to the accurate reconstruction of phylogenetic networks.
Luay Nakhleh
Genome Wide Association Studies
Abstract
The availability of high throughput technology for parallel genotyping has opened the field of genetics to genome-wide association studies (GWAS). These studies generate massive amount of genetic data that challenge investigators with issues related to data management, statistical analysis of large data sets, visualization, and annotation of results. We will review the common approach to analysis of GWAS data and then discuss options to learn more from these data.
Paola Sebastiani, Nadia Solovieff

Proteins: Structure, Function, and Biochemistry

Frontmatter
Novel Perspectives on Protein Structure Prediction
Abstract
Our understanding of the protein structure prediction problem is evolving. Recent experimental insights into the protein folding mechanism suggest that many polypeptides may adopt multiple conformations. Consequently, modeling and prediction of an ensemble of configurations is more relevant than the classical approach that aims to compute a single structure for a given sequence. In this chapter, we review recent algorithmic advances which enable the application of statistical mechanics techniques to predicting these structural ensembles. These techniques overcome the limitations of costly folding simulations and allow a rigorous model of the conformational landscape. To illustrate the strength and versatility of this approach, we present applications of these algorithms to various typical protein structure problems ranging from predicting residue contacts to experimental X-ray crystallography measures.
Bonnie Berger, Jéerôme Waldispühl
Stochastic Simulation for Biochemical Systems
Abstract
Biochemical systems are often modeled with ordinary differential equations that are continuous and deterministic by nature. In recent years, with the development of new techniques to collect wet-lab data in a single cell, there are increasing concerns on the stochastic effect in cellular systems, where the small copy numbers of some reactant species in the cell may lead to deviations from the predictions of the deterministic differential equations of classical chemical kinetics. In this chapter, we will review important algorithms for stochastic modeling and simulation of biochemical systems.
Yang Cao

Networks

Frontmatter
Cellular Response Networks
Abstract
Complex networks of interactions between genes, proteins, and other molecules choreograph cellular processes. The interactions that are active in the cell change over time, both as a natural outcome of the cell‘s natural life cycle and in response to external signals. The set of active interactions, called the response network, are likely to be significantly different between a normally-functioning cell and a diseased cell. The wide availability of DNA microarray data and experimentallydetermined interaction networks has made it possible to automatically compute response networks. This chapter surveys algorithms that have been developed to compute response networks.
Christopher D. Lasher, Christopher L. Poirel, T. M. Murali
Identification of Modules in Protein-Protein Interaction Networks
Abstract
In biological systems, most processes are carried out through orchestration of multiple interacting molecules. These interactions are often abstracted using network models. A key feature of cellular networks is their modularity, which contributes significantly to the robustness, as well as adaptability of biological systems. Therefore, modularization of cellular networks is likely to be useful in obtaining insights into the working principles of cellular systems, as well as building tractable models of cellular organization and dynamics. A common, high-throughput source of data on molecular interactions is in the form of physical interactions between proteins, which are organized into protein-protein interaction (PPI) networks. This chapter provides an overview on identification and analysis of functional modules in PPI networks, which has been an active area of research in the last decade.
Proteins that make up a functional module tend to interact with each other and form a densely connected subgraph in a PPI network.Motivated by this observation, module identification is often formulated as a problem of partitioning a PPI network into dense subgraphs, which is also known as graph clustering. This chapter begins with a brief introduction to the module identification problem in PPI networks. Then, graph theoretical measures of modularity such as density, clustering coefficient and edge connectivity are introduced. Algorithmic approaches for identifying modules are then presented in a systematic manner. These clustering approaches are broadly categorized as (i) Bottom-up (ii) Top-down (iii) Iterative Improvement and (iv) Flow Based methods. Subsequently, a sample application of modularization, namely, predicting the function of uncharacterized proteins, is briefly discussed. More advanced methods to identify functional modules often integrate other data sources such as gene expression data with PPI data or use multiple networks to find conserved regions in the networks. After an overview on these advanced methods, some exercises are presented to the reader.
Sinan Erten, Mehmet Koyutürk

Biological Data Management and Mining

Frontmatter
Designing Microarray Experiments
Abstract
Gene expression microarrays have become an important exploratory tool in many screening experiments that aim to discover the genes that change expression in two or more biological conditions and can be used to build molecular profiles for both diagnostic and prognostic use. The still very high costs of microarrays and the difficulty in generating the biological samples are critical issues of microarraybased screening experiments, and the experimental design plays a crucial role in how informative an experiment is going to be. In this chapter, we describe some of the major issues related to the design of either randomized control trials or observational studies and discuss the choice of powerful sample sizes, the selection of informative experimental conditions, and experimental strategies that can minimize confounding. We conclude with a discussion of some of the open problems in the design and analysis of microarray experiments that need further research.
Paola Sebastiani, Jacqui Milton, Ling Wang
Matrix and Tensor Decompositions
Abstract
Advances in high-throughput technologies such as gene and protein expression microarrays in the past decade have made it possible to simultaneously measure the expression levels of thousands of transcripts. This has resulted in large amounts of biological data requiring analysis and interpretation. Many methods for handling such large-scale data have been proposed in the literature. For example, consider a p ×n gene expression matrix V consisting of observations on p genes from n samples representing different experimental conditions, phenotypes or time points. One could be interested in identifying clusters of genes with similar expression profiles across sub-groups of samples. Typically, this is accomplished via a decomposition of V into two or more matrices where each factored matrix has a distinct physical interpretation. Matrix decompositions have been successfully utilized in a variety of applications in computational biology such as molecular pattern discovery, class comparison, class prediction, functional characterization of genes, cross-platform and cross-species analysis, and biomedical informatics. In this chapter, we focus on available and commonly utilized methods for such matrix decompositions as well as survey other potentially useful methods for analyzing highdimensional data.
Karthik Devarajan
Practical Applications of the Gene Ontology Resource
Abstract
The Gene Ontology (GO) is a controlled vocabulary that represents knowledge about the functional attributes of gene products in a structured manner and can be used in both computational and human analyses. This vocabulary has been used by diverse curation groups to associate functional information to individual gene products in the form of annotations. GO has proven an invaluable resource for evaluating and interpreting the biological significance of large data sets, enabling researchers to create hypotheses to direct their future research. This chapter provides an overview of the Gene Ontology, how it can be used, and tips on getting the most out of GO analyses.
Rachael P. Huntley, Emily C. Dimmer, Rolf Apweiler
Backmatter
Metadata
Title
Problem Solving Handbook in Computational Biology and Bioinformatics
Editors
Lenwood S. Heath
Naren Ramakrishnan
Copyright Year
2011
Publisher
Springer US
Electronic ISBN
978-0-387-09760-2
Print ISBN
978-0-387-09759-6
DOI
https://doi.org/10.1007/978-0-387-09760-2

Premium Partner