Skip to main content
Top

2009 | Book

Research in Computational Molecular Biology

13th Annual International Conference, RECOMB 2009, Tucson, AZ, USA, May 18-21, 2009. Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2009, held in Tucson, Arisona, USA in May 2009. The 37 revised full papers presented were carefully reviewed and selected from 166 submissions. As the top conference in computational molecular biology, RECOMB addresses all current issues in algorithmic, theoretical, and experimental bioinformatics such as molecular sequence analysis, recognition of genes and regulatory elements, molecular evolution, protein structure, structural genomics, gene expression, gene networks, drug design, combinatorial libraries, computational proteomics, as well as structural and functional genomics.

Table of Contents

Frontmatter
Searching Protein 3-D Structures in Linear Time

Finding similar structures from 3-D structure databases of proteins is becoming more and more important issue in the post-genomic molecular biology. To compare 3-D structures of two molecules, biologists mostly use the RMSD (root mean square deviation) as the similarity measure. We propose new theoretically and practically fast algorithms for the fundamental problem of finding all the substructures of structures in a structure database of chain molecules (such as proteins), whose RMSDs to the query are within a given constant threshold. We first propose a breakthrough linear-expected-time algorithm for the problem, while the previous best-known time complexity was

O

(

N

log

m

), where

N

is the database size and

m

is the query size. For the expected time analysis, we propose to use the random-walk model (or the ideal chain model) as the model of average protein structures. We furthermore propose a series of preprocessing algorithms that enable faster queries. We checked the performance of our linear-expected-time algorithm through computational experiments over the whole PDB database. According to the experiments, our algorithm is 3.6 to 28 times faster than previously known algorithms for ordinary queries. Moreover, the experimental results support the validity of our theoretical analyses.

Tetsuo Shibuya
Optimization-Based Peptide Mass Fingerprinting for Protein Mixture Identification

In current proteome research, the most widely used method for protein mixture identification is probably peptide sequencing. Peptide sequencing is based on tandem Mass Spectrometry (MS/MS) data. The disadvantage is that MS/MS data only sequences a limited number of peptides and leaves many more peptides uncovered.

Peptide Mass Fingerprinting (PMF) has been widely used to identify single purified proteins from single-stage MS data. Unfortunately, this technique is less accurate than the peptide sequencing method and can not handle protein mixtures, which hampers the widespread use of PMF.

In this paper, we tackle the problem of protein mixture identification from an optimization point of view. We show that some simple heuristics can find good solutions to the optimization problem. As a result, we obtain much better identification results than previous methods. Through a comprehensive simulation study, we identify a set of limiting factors that hinder the performance of PMF-based protein mixture identification. We argue that it is feasible to remove these limitations and PMF can be a powerful tool in the analysis of protein mixtures, especially in the identification of low-abundance proteins which are less likely to be sequenced by MS/MS scanning.

Availability:

The source codes, data and supplementary documents are available at

http://bioinformatics.ust.hk/PMFMixture.rar

Zengyou He, Chao Yang, Can Yang, Robert Z. Qi, Jason Po-Ming Tam, Weichuan Yu
Boosting Protein Threading Accuracy

Protein threading is one of the most successful protein structure prediction methods. Most protein threading methods use a scoring function linearly combining sequence and structure features to measure the quality of a sequence-template alignment so that a dynamic programming algorithm can be used to optimize the scoring function. However, a linear scoring function cannot fully exploit interdependency among features and thus, limits alignment accuracy.

This paper presents a nonlinear scoring function for protein threading, which not only can model interactions among different protein features, but also can be efficiently optimized using a dynamic programming algorithm. We achieve this by modeling the threading problem using a probabilistic graphical model Conditional Random Fields (CRF) and training the model using the gradient tree boosting algorithm. The resultant model is a nonlinear scoring function consisting of a collection of regression trees. Each regression tree models a type of nonlinear relationship among sequence and structure features. Experimental results indicate that this new threading model can effectively leverage weak biological signals and improve both alignment accuracy and fold recognition rate greatly.

Jian Peng, Jinbo Xu
New Perspectives on Gene Family Evolution: Losses in Reconciliation and a Link with Supertrees

Reconciliation between a set of gene trees and a species tree is the most commonly used approach to infer the duplication and loss events in the evolution of gene families, given a species tree. When a species tree is not known, a natural algorithmic problem is to infer a species tree such that the corresponding reconciliation minimizes the number of duplications and/or losses. In this paper, we clarify several theoretical questions and study various algorithmic issues related to these two problems. (1) For a given gene tree

T

and species tree

S

, we show that there is a single history explaining

T

and consistent with

S

that minimizes gene losses, and that this history also minimizes the number of duplications. We describe a simple linear-time and space algorithm to compute this parsimonious history, that is not based on the Lowest Common Ancestor (LCA) mapping approach; (2) We show that the problem of computing a species tree that minimizes the number of gene duplications, given a set of gene trees, is in fact a slight variant of a supertree problem; (3) We show that deciding if a set of gene trees can be explained using only apparent duplications can be done efficiently, as well as computing a parsimonious species tree for such gene trees. We also characterize gene trees that can be explained using only apparent duplications in terms of compatible triplets of leaves.

Cedric Chauve, Nadia El-Mabrouk
A Probabilistic Graphical Model for Ab Initio Folding

Despite significant progress in recent years,

ab initio

folding is still one of the most challenging problems in structural biology. This paper presents a probabilistic graphical model for ab initio folding, which employs Conditional Random Fields (CRFs) and directional statistics to model the relationship between the primary sequence of a protein and its three-dimensional structure. Different from the widely-used fragment assembly method and the lattice model for protein folding, our graphical model can explore protein conformations in a continuous space according to their probability. The probability of a protein conformation reflects its stability and is estimated from PSI-BLAST sequence profile and predicted secondary structure. Experimental results indicate that this new method compares favorably with the fragment assembly method and the lattice model.

Feng Zhao, Jian Peng, Joe DeBartolo, Karl F. Freed, Tobin R. Sosnick, Jinbo Xu
Topology-Free Querying of Protein Interaction Networks

In the network querying problem, one is given a protein complex or pathway of species

A

and a protein–protein interaction network of species

B

; the goal is to identify subnetworks of

B

that are similar to the query. Existing approaches mostly depend on knowledge of the interaction topology of the query in the network of species

A

; however, in practice, this topology is often not known. To combat this problem, we develop a topology-free querying algorithm, which we call

Torque

. Given a query, represented as a set of proteins,

Torque

seeks a matching set of proteins that are sequence-similar to the query proteins and span a connected region of the network, while allowing both insertions and deletions. The algorithm uses alternatively dynamic programming and integer linear programming for the search task. We test

Torque

with queries from yeast, fly, and human, where we compare it to the QNet topology-based approach, and with queries from less studied species, where only topology-free algorithms apply.

Torque

detects many more matches than QNet, while in both cases giving results that are highly functionally coherent.

Sharon Bruckner, Falk Hüffner, Richard M. Karp, Ron Shamir, Roded Sharan
Cross Species Expression Analysis of Innate Immune Response

The innate immune response is the first line of host defense against infections. This system employs a number of different types of cells which in turn activate different sets of genes. Microarray studies of human and mouse cells infected with various pathogens identified hundreds of differentially expressed genes. However, combining these datasets to identify common and unique response patterns remained a challenge. We developed methods based on probabilistic graphical models to combine expression experiments across species, cells and pathogens. Our method analyzes homologous genes in different species concurrently overcoming problems related to noise and orthology assignments. Using our method we identified both core immune response genes and genes that are activated in macrophages in both human and mouse but not in dendritic cells, and vice versa. Our results shed light on immune response mechanisms and on the differences between various types of cells that are used to fight infecting bacteria.

Supporting website

: http://www.cs.cmu.edu/~lyongu/pub/immune/

Yong Lu, Roni Rosenfeld, Gerard J. Nau, Ziv Bar-Joseph
Haplotype Inference in Complex Pedigrees

Despite the desirable information contained in complex pedigree datasets, analysis methods struggle to efficiently process these datasets. The attractiveness of pedigree data sets is their power for detecting rare variants, particularly in comparison with studies of unrelated individuals. In addition, rather than assuming individuals in a study are unrelated, knowledge of their relationships can avoid spurious results due to confounding population structure effects. However, a major challenge for the applicability of pedigree methods is the ability handle complex pedigrees, having multiple founding lineages, inbreeding, and half-sibling relationships.

A key ingredient in association studies is imputation and inference of haplotypes from genotype data. Existing haplotype inference methods either do not efficiently scales to complex pedigrees or their accuracy is limited. In this paper, we present algorithms for efficient haplotype inference and imputation in complex pedigrees. Our method, PhyloPed, leverages the perfect phylogeny model, resulting in an efficient method with high accuracy. In addition, PhyloPed effectively combines the founder haplotype information from different lineages and is immune to inaccuracies in prior information about the founders.

Bonnie Kirkpatrick, Javier Rosa, Eran Halperin, Richard M. Karp
Storage and Retrieval of Individual Genomes

A repetitive sequence collection is one where portions of a

base sequence

of length

n

are repeated many times with small variations, forming a collection of total length

N

. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies

O

(

N

log

N

) bits, which very soon inhibits in-memory analyses. Recent advances in full-text

self-indexing

reduce the space of suffix tree to

O

(

N

log

σ

) bits, where

σ

is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection.

We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length

n

of the base sequence and the number

s

of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on

N

/

n

.

We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.

Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, Niko Välimäki
An Online Approach for Mining Collective Behaviors from Molecular Dynamics Simulations

Collective behavior involving distally separate regions in a protein is known to widely affect its function. In this paper, we present an online approach to study and characterize collective behavior in proteins as molecular dynamics simulations progress. Our representation of MD simulations as a stream of continuously evolving data allows us to succinctly capture spatial and temporal dependencies that may exist and analyze them efficiently using data mining techniques. By using multi-way analysis we identify (a) parts of the protein that are dynamically coupled, (b) constrained residues/ hinge sites that may potentially affect protein function and (c) time-points during the simulation where significant deviation in collective behavior occurred. We demonstrate the applicability of this method on two different protein simulations for barnase and cyclophilin A. For both these proteins we were able to identify constrained/ flexible regions, showing good agreement with experimental results and prior computational work. Similarly, for the two simulations, we were able to identify time windows where there were significant structural deviations. Of these time-windows, for both proteins, over 70% show collective displacements in two or more functionally relevant regions. Taken together, our results indicate that multi-way analysis techniques can be used to analyze protein dynamics and may be an attractive means to automatically track and monitor molecular dynamics simulations.

Arvind Ramanathan, Pratul K. Agarwal, Maria Kurnikova, Christopher J. Langmead
Parameter Synthesis in Nonlinear Dynamical Systems: Application to Systems Biology

The dynamics of biological processes are often modeled as systems of nonlinear ordinary differential equations (ODE). An important feature of nonlinear ODEs is that seemingly minor changes in initial conditions or parameters can lead to radically different behaviors. This is problematic because in general it is never possible to know/measure the precise state of any biological system due to measurement errors. The parameter synthesis problem is to identify sets of parameters (including initial conditions) for which a given system of nonlinear ODEs does not reach a given set of undesirable states. We present an efficient algorithm for solving this problem that combines sensitivity analysis with an efficient search over initial conditions. It scales to high-dimensional models and is exact if the given model is affine. We demonstrate our method on a model of the acute inflammatory response to bacterial infection, and identify initial conditions consistent with 3 biologically relevant outcomes.

Alexandre Donzé, Gilles Clermont, Axel Legay, Christopher J. Langmead
Spatial Clustering of Multivariate Genomic and Epigenomic Information

The combination of fully sequence genomes and new technologies for high density arrays and ultra-rapid sequencing enables the mapping of gene-regulatory and epigenetics marks on a global scale. This new experimental methodology was recently applied to map multiple histone marks and genomic factors, characterizing patterns of genome organization and discovering interactions among processes of epigenetic reprogramming during cellular differentiation. The new data poses a significant computational challenge in both size and statistical heterogeneity. Understanding it collectively and without bias remains an open problem. Here we introduce spatial clustering - a new unsupervised clustering methodology for dissection of large, multi-track genomic and epigenomic data sets into a spatially organized set of distinct combinatorial behaviors. We develop a probabilistic algorithm that finds spatial clustering solutions by learning an HMM model and inferring the most likely genomic layout of clusters. Application of our methods to meta-analysis of combined ChIP-seq and ChIP-chip epigenomic datasets in mouse and human reveals known and novel patterns of local co-occurrence among histone modification and related factors. Moreover, the model weaves together these local patterns into a coherent global model that reflects the higher level organization of the epigenome. Spatial clustering constitutes a powerful and scalable analysis methodology for dissecting even larger scale genomic dataset that will soon become available.

Rami Jaschek, Amos Tanay
How Many Bootstrap Replicates Are Necessary?

Phylogenetic Bootstrapping (BS) is a standard technique for inferring confidence values on phylogenetic trees that is based on reconstructing many trees from minor variations of the input data, trees called replicates. BS is used with all phylogenetic reconstruction approaches, but we focus here on the most popular, Maximum Likelihood (ML). Because ML inference is so computationally demanding, it has proved too expensive to date to assess the impact of the number of replicates used in BS on the quality of the support values. For the same reason, a rather small number (typically 100) of BS replicates are computed in real-world studies. Stamatakis

et al.

recently introduced a BS algorithm that is 1–2 orders of magnitude faster than previous techniques, while yielding qualitatively comparable support values, making an experimental study possible.

In this paper, we propose

stopping criteria

, that is, thresholds computed at runtime to determine when enough replicates have been generated, and report on the first large-scale experimental study to assess the effect of the number of replicates on the quality of support values, including the performance of our proposed criteria. We run our tests on 17 diverse real-world DNA, single-gene as well as multi-gene, datasets, that include between 125 and 2,554 sequences. We find that our stopping criteria typically stop computations after 100–500 replicates (although the most conservative criterion may continue for several thousand replicates) while producing support values that correlate at better than 99.5% with the reference values on the best ML trees. Significantly, we also find that the stopping criteria can recommend very different numbers of replicates for different datasets of comparable sizes.

Our results are thus two-fold: (i) they give the first experimental assessment of the effect of the number of BS replicates on the quality of support values returned through bootstrapping; and (ii) they validate our proposals for stopping criteria. Practitioners will no longer have to enter a guess nor worry about the quality of support values; moreover, with most counts of replicates in the 100–500 range, robust BS under ML inference becomes computationally practical for most datasets. The complete test suite is available at

http://lcbb.epfl.ch/BS.tar.bz2

and BS with our stopping criteria is included in RAxML 7.1.0.

Nicholas D. Pattengale, Masoud Alipour, Olaf R. P. Bininda-Emonds, Bernard M. E. Moret, Alexandros Stamatakis
A Robust Bayesian Two-Sample Test for Detecting Intervals of Differential Gene Expression in Microarray Time Series

Understanding the regulatory mechanisms that are responsible for an organism’s response to environmental changes is an important question in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question

whether

a gene is differentially expressed, they do not explicitly address the question

when

a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a two-sample test for identifying

intervals

of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates and is robust with respect to outliers. We apply our algorithm to study the response of

Arabidopsis thaliana

genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 time points. In classification experiments our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

Oliver Stegle, Katherine Denby, David L. Wild, Zoubin Ghahramani, Karsten M. Borgwardt
Incorporating Nucleosomes into Thermodynamic Models of Transcription Regulation

Transcriptional control is central to many cellular processes and consequently, much effort has been devoted to understanding its underlying mechanisms. Recently, it has become evident that the organization of nucleosomes along promoter regions has an important role in transcriptional control, since most transcription factors cannot bind to sequences bound by nucleosomes, and thus compete with nucleosomes for DNA access. This competition is governed by the relative concentrations of nucleosomes and transcription factors and by their respective sequence binding preferences. Even though competition of nucleosomes and transcription factors may have significant effects on transcription, a mechanistic understanding of its quantitative consequences for gene expression is still missing. Here we employ a thermodynamic framework based on fundamental principles of statistical mechanics to theoretically explore the effect that different nucleosome organizations along promoters have on the activation dynamics of promoters in response to varying concentrations of the regulating transcription factors. We show that even simple landscapes of nucleosome organization reproduce experimental results regarding the effect of nucleosomes as general repressors and as generators of obligate binding cooperativity between transcription factors. Our modeling framework also allows us to characterize the effects that various sequence elements of promoters will have on the induction threshold and on the shape of the promoter activation curves.

Tali Raveh-Sadka, Michal Levo, Eran Segal
Combinatorial Algorithms for Structural Variation Detection in High Throughput Sequenced Genomes

Recent studies show that, along with single nucleotide polymorphisms and small indels, larger structural variants among human individuals are common. These studies have typically been based high-cost library generation and Sanger sequencing; however, recent introduction of next-generation sequencing (NGS) technologies is changing how research in this area is conducted in a significant way. Highthroughput sequencing technologies such as 454, Illumina, Helicos, and AB SOLiD produce shorter reads than the traditional capillary sequencing, yet they reduce the cost (and/or the redundancy) by a factor of 10 - 100 and perhaps even more. Those NGS technologies with the capability of sequencing paired-ends (or matepairs) of a clone insert (which follows a tight length distribution) have made it feasible to perform detailed and comprehensive genome variation and rearrangement studies. Unfortunately, the few existing algorithms for identifying structural variation among individuals using paired-end reads have not been designed to handle the short read lengths and the errors implied by these platforms. Here, we describe, for the first time, algorithms for identifying various forms of structural variation between a paired-end NGS sequenced genome and a reference genome.

Fereydoun Hormozdiari, Can Alkan, Evan E. Eichler, S. Cenk Sahinalp
Optimizing PCR Assays for DNA Based Cancer Diagnostics

Somatically acquired DNA rearrangements are characteristic of many cancers. The use of these mutations as diagnostic markers is challenging, because tumor cells are frequently admixed with normal cells, particularly in early stage tumor samples, and thus the samples contain a high background of normal DNA. Detection is further confounded by the fact that the rearrangement boundaries are not conserved across individuals, and might vary over hundreds of kilobases. Here, we present an algorithm for designing PCR primers and oligonucleotide probes to assay for these variant rearrangements. Specifically, the primers and probes tile the entire genomic region surrounding a rearrangement, so as to amplify the mutant DNA over a wide range of possible breakpoints and robustly assay for the amplified signal on an array. Our solution involves the design of a complex combinatorial optimization problem, and also includes a novel alternating multiplexing strategy that makes efficient detection possible. Simulations show that we can achieve near-optimal detection in many different cases, even when the regions are highly non-symmetric. Additionally, we prove that the suggested multiplexing strategy is optimal in breakpoint detection.

We applied our technique to create a custom design to assay for genomic lesions in several cancer cell-lines associated with a disruption in the

CDKN2A

locus. The

CDKN2A

deletion has highly variable boundaries across many cancers. We successfully detect the breakpoint in all cell-lines, even when the region has undergone multiple rearrangements. These results point to the development of a successful protocol for early diagnosis and monitoring of cancer.

Ali Bashir, Qing Lu, Dennis Carson, Benjamin Raphael, Yu-Tsueng Liu, Vineet Bafna
The Multi-State Perfect Phylogeny Problem with Missing and Removable Data: Solutions via Integer-Programming and Chordal Graph Theory

The

Multi-State Perfect Phylogeny Problem

is an extension of the

Binary

Perfect Phylogeny Problem, allowing characters to take on more than two states. In this paper we consider three problems that extend the utility of the multi-state perfect phylogeny model:

The Missing Data (MD) Problem

where some entries in the input are missing and the question is whether (bounded) values for the missing data can be imputed so that the resulting data has a multi-state perfect phylogeny;

The Character-Removal (CR) Problem

where we want to minimize the number of characters to remove from the data so that the resulting data has a multi-state perfect phylogeny; and

The Missing-Data Character-Removal (MDCR) Problem

where the input has missing data and we want to impute values for the missing data to

minimize

the solution to the resulting Character-Removal Problem.

We detail Integer Linear Programming (ILP) solutions to these problems for the special case of three permitted states per character and report on extensive empirical testing of these solutions. Then we develop a general theory to solve the MD problem for an

arbitrary

number of permitted states, using chordal graph theory and results on minimal triangulation of non-chordal graphs. This establishes new necessary and sufficient conditions for the existence of a perfect phylogeny with (or without) missing data. We implement the general theory using integer linear programming, although other optimization methods are possible. We extensively explore the empirical behavior of the general solution, showing that the methods are very practical for data of size and complexity that is characteristic of many current applications in phylogenetics. Some of the empirical results for the MD problem with an arbitrary number of permitted states are very surprising, suggesting the existence of additional combinatorial structure in multi-state perfect phylogenies.

Dan Gusfield
COE: A General Approach for Efficient Genome-Wide Two-Locus Epistasis Test in Disease Association Study

The availability of high density single nucleotide polymorphisms (SNPs) data has made genome-wide association study computationally challenging. Two-locus epistasis (gene-gene interaction) detection has attracted great research interest as a promising method for genetic analysis of complex diseases. In this paper, we propose a general approach, COE, for efficient large scale gene-gene interaction analysis, which supports a wide range of tests. In particular, we show that many commonly used statistics are convex functions. From the observed values of the events in two-locus association test, we can develop an upper bound of the test value. Such an upper bound only depends on single-locus test and the genotype of the SNP-pair. We thus group and index SNP-pairs by their genotypes. This indexing structure can benefit the computation of all convex statistics. Utilizing the upper bound and the indexing structure, we can prune most of the SNP-pairs without compromising the optimality of the result. Our approach is especially efficient for large permutation test. Extensive experiments demonstrate that our approach provides orders of magnitude performance improvement over the brute force approach.

Xiang Zhang, Feng Pan, Yuying Xie, Fei Zou, Wei Wang
Overlapping Pools for High Throughput Targeted Resequencing

Resequencing genomic DNA from pools of individuals is an effective strategy to detect new variants in targeted regions and compare them between cases and controls. There are numerous ways to assign individuals to the pools on which they are to be sequenced. The naïve, disjoint pooling scheme (many individuals to one pool) in predominant use today, offers insight into allele frequencies, but does not offer the identity of an allele carrier. We present a framework for overlapping pool design, where each individual sample is resequenced in several pools (many individuals to many pools). Upon discovering a variant, the set of pools where this variant is observed reveals the identity of its carrier. We formalize the mathematical framework for such pool designs, and list the requirements from such designs. Next, we build on the theory of error-correcting codes to design arrangements that overcome pitfalls of pooled sequencing. Specifically, three practical concerns of low coverage sequencing are investigated: (1) False positives due to errors introduced during amplification and sequencing; (2) False negatives due to undersampling particular alleles aggravated by non-uniform coverage; and consequently (3) Ambiguous identification of individual carriers in the presence of errors. We show that in practical parameters of resequencing studies, our designs guarantee high probability of unambiguous singleton carrier identification, while maintaining the features of naïve pools in terms of sensitivity, specificity, and the ability to estimate allele frequencies. We demonstrate the ability of our designs by extracting rare variations on pooled short read data of 12 individuals from the 1000 Genome Pilot 3 project.

Snehit Prabhu, Itsik Pe’er
Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction

We present a computational method for analyzing deep sequencing data obtained from a genetically diverse sample. The set of reads obtained from a deep sequencing experiment represents a statistical sample of the underlying population. We develop a generative probabilistic model for assigning observed reads to unobserved haplotypes in the presence of sequencing errors. This clustering problem is solved in a Bayesian fashion using the Dirichlet process mixture to define a prior distribution on the unknown number of haplotypes in the mixture. We devise a Gibbs sampler for sampling from the joint posterior distribution of haplotype sequences, assignment of reads to haplotypes, and error rate of the sequencing process to obtain estimates of the local haplotype structure of the population. The method is evaluated on simulated data and on experimental deep sequencing data obtained from HIV samples.

Osvaldo Zagordi, Lukas Geyrhofer, Volker Roth, Niko Beerenwinkel
Lifting Prediction to Alignment of RNA Pseudoknots

Prediction and alignment of RNA pseudoknot structures are NP-hard. Nevertheless, several efficient prediction algorithms by dynamic programming have been proposed for restricted classes of pseudoknots. We present a general scheme that yields an efficient alignment algorithm for arbitrary such classes. Moreover, we show that such an alignment algorithm benefits from the class restriction in the same way as the corresponding structure prediction algorithm does. We look at five of these classes in greater detail. The time and space complexity of the alignment algorithm is increased by only a linear factor over the respective prediction algorithm. For four of the classes, no efficient alignment algorithms were known. For the fifth, most general class, we improve the previously best complexity of

O

(

n

5

m

5

) time to

O

(

nm

6

), where

n

and

m

denote sequence lengths. Finally, we apply our fastest algorithm with

O

(

nm

4

) time and

O

(

nm

2

) space to comparative de-novo pseudoknot prediction.

Mathias Möhl, Sebastian Will, Rolf Backofen
Detection of Locally Over-Represented GO Terms in Protein-Protein Interaction Networks

High-throughput methods for identifying protein-protein interactions produce increasingly complex and intricate interaction networks. These networks are extremely rich in information, but extracting biologically meaningful hypotheses from them and representing them in a human-readable manner is challenging. We propose a method to identify Gene Ontology terms that are locally over-represented in a subnetwork of a given biological network. Specifically, we propose two methods to evaluate the degree of clustering of proteins associated to a particular GO term and describe four efficient methods to estimate the statistical significance of the observed clustering. We show, using Monte Carlo simulations, that our best approximation methods accurately estimate the true p-value, for random scale-free graphs as well as for actual yeast and human networks. When applied to these two biological networks, our approach recovers many known complexes and pathways, but also suggests potential functions for many subnetworks.

Mathieu Lavallée-Adam, Benoit Coulombe, Mathieu Blanchette
Protein Fragment Swapping: A Method for Asymmetric, Selective Site-Directed Recombination

This paper presents a new approach to site-directed recombination, swapping combinations of selected discontiguous fragments from a source protein in place of corresponding fragments of a target protein. By being both asymmetric (differentiating source and target) and selective (swapping discontiguous fragments), our method focuses experimental effort on a more restricted portion of sequence space, constructing hybrids that are more likely to have the properties that are the objective of the experiment. Furthermore, since the source and target need to be structurally homologous only locally (rather than overall), our method supports swapping fragments from functionally important regions of a source into a target “scaffold”; e.g., to humanize an exogenous therapeutic protein. A protein fragment swapping plan is defined by the residue position boundaries of the fragments to be swapped; it is assessed by an average potential score over the resulting hybrid library, with singleton and pairwise terms evaluating the importance and fit of the swapped residues. While we prove that it is NP-hard to choose an optimal set of fragments under such a potential score, we develop an integer programming approach, which we call

Swagmer

, that works very well in practice. We demonstrate the effectiveness of our method in two types of swapping problem: selective recombination between beta-lactamases and activity swapping between glutathione transferases. We show that the selective recombination approach generates a better plan (in terms of resulting potential score) than a traditional site-directed recombination approach. We also show that in both cases the optimized experiment is significantly better than one that would result from stochastic methods.

Wei Zheng, Karl E. Griswold, Chris Bailey-Kellogg
Simultaneous Alignment and Folding of Protein Sequences

Accurate comparative analysis tools for low-homology proteins remains a difficult challenge in computational biology, especially sequence alignment and consensus folding problems. We present

partiFold-Align

, the first algorithm for simultaneous alignment and consensus folding of unaligned protein sequences; the algorithm’s complexity is polynomial in time and space. Algorithmically,

partiFold-Align

exploits sparsity in the set of super-secondary structure pairings and alignment candidates to achieve an effectively cubic running time for simultaneous pairwise alignment and folding. We demonstrate the efficacy of these techniques on transmembrane

β

-barrel proteins, an important yet difficult class of proteins with few known three-dimensional structures. Testing against structurally derived sequence alignments,

partiFold-Align

significantly outperforms state-of-the-art pairwise sequence alignment tools in the most difficult low sequence homology case and improves secondary structure prediction where current approaches fail. Importantly,

partiFold-Align

requires no prior training. These general techniques are widely applicable to many more protein families.

partiFold-Align

is available at

http://partiFold.csail.mit.edu

.

Jérôme Waldispühl, Charles W. O’Donnell, Sebastian Will, Srinivas Devadas, Rolf Backofen, Bonnie Berger
Shared Peptides in Mass Spectrometry Based Protein Quantification

In analyzing the proteome using mass spectrometry, the mass values help identify the molecules, and the intensities help quantify them, relative to their abundance in other samples. Peptides that are shared across different protein sequences are typically discarded as being uninformative w.r.t each of the parent proteins.

In this paper, we investigate the use of shared peptides which are ubiquitous (~50% of peptides) in mass spectrometric data-sets. In many cases, shared peptides can help compute the relative amounts of different proteins that share the same peptide. Also, proteins with no unique peptide in the sample can still be analyzed for relative abundance. Our paper is the first attempt to use shared peptides in protein quantification, and makes use of combinatorial optimization to reduce the error in relative abundance measurements. We describe the topological and numerical properties required for robust estimates, and use them to improve our estimates for ill-conditioned systems. Extensive simulations validate our approach even in the presence of experimental error. We apply our method to a model of Arabidopsis root knot nematode infection, and elucidate the differential role of many protein family members in mediating host response to the pathogen.

Banu Dost, Nuno Bandeira, Xiangqian Li, Zhouxin Shen, Steve Briggs, Vineet Bafna
Evaluating Between-Pathway Models with Expression Data

Between-Pathway Models (BPMs) are network motifs consisting of pairs of putative redundant pathways. In this paper, we show how adding another source of high-throughput data, microarray gene expression data from knockout experiments, allows us to identify a compensatory functional relationship between genes from the two BPM pathways. We evaluate the quality of the BPMs from four different studies, and we describe how our methods might be extended to refine pathways.

Benjamin J. Hescott, Mark D. M. Leiserson, Lenore J. Cowen, Donna K. Slonim
Sorting Signed Permutations by Inversions in O(nlogn) Time

The study of genomic inversions (or reversals) has been a mainstay of computational genomics for nearly 20 years. After the initial breakthrough of Hannenhalli and Pevzner, who gave the first polynomial-time algorithm for sorting signed permutations by inversions, improved algorithms have been designed, culminating with an optimal linear-time algorithm for computing the inversion distance and a subquadratic algorithm for providing a shortest sequence of inversions—also known as sorting by inversions. Remaining open was the question of whether sorting by inversions could be done in

O

(

n

log

n

) time.

In this paper, we present a qualified answer to this question, by providing two new sorting algorithms, a simple and fast randomized algorithm and a deterministic refinement. The deterministic algorithm runs in time

O

(

n

log

n

 + 

kn

), where

k

is a data-dependent parameter. We provide the results of extensive experiments showing that both the average and the standard deviation for

k

are small constants, independent of the size of the permutation. We conclude (but do not prove) that almost all signed permutations can be sorted by inversions in

O

(

n

log

n

) time.

Krister M. Swenson, Vaibhav Rajan, Yu Lin, Bernard M. E. Moret
Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information

Hierarchical clustering is a popular method for grouping together similar elements based on a distance measure between them. In many cases, annotations for some elements are known beforehand, which can aid the clustering process. We present a novel approach for decomposing a hierarchical clustering into the clusters that optimally match a set of known annotations, as measured by the variation of information metric. Our approach is general and does not require the user to enter the number of clusters desired. We apply it to two biological domains: finding protein complexes within protein interaction networks and identifying species within metagenomic DNA samples. For these two applications, we test the quality of our clusters by using them to predict complex and species membership, respectively. We find that our approach generally outperforms the commonly used heuristic methods.

Saket Navlakha, James White, Niranjan Nagarajan, Mihai Pop, Carl Kingsford
Identification and Frequency Estimation of Inversion Polymorphisms from Haplotype Data

Structural rearrangements, including copy-number alterations and inversions, are increasingly recognized as an important contributor to human genetic variation. Copy number variants are readily measured via array-based techniques like comparative genomic hybridization, but copy-neutral variants such as inversion polymorphisms remain difficult to identify without whole genome sequencing. We introduce a method to identify inversion polymorphisms and estimate their frequency in a population using readily available single nucleotide polymorphism (SNP) data. Our method uses a probabilistic model to describe a population as a mixture of forward and inverted chromosomes and identifies putative inversions by characteristic differences in haplotype frequencies around inversion breakpoints. On simulated data, our method accurately predicts inversions with frequencies as low as 25% in the population and reliably estimates inversion frequencies over a wide range. On the human HapMap Phase 2 data, we predict between 88 and 142 inversion polymorphisms with frequency ranging from 20 to 92 percent. Many of these correspond to known inversions or have other evidence supporting them, and the predicted inversion frequencies largely agree with the limited information presently available.

Suzanne S. Sindi, Benjamin J. Raphael
On the Relationship between DNA Periodicity and Local Chromatin Structure

DNA periodicity and its relationship to the formation of nucleosomes has been investigated extensively using autocorrelation and Fourier transform methods. We provide a precise treatment of the mathematical foundation for this type of analysis, and we apply the resulting method to quantify dinucleotide periodicity in several datasets. We begin by demonstrating, via simulation, the sensitivity of our method relative to previous methods. We then provide evidence of pervasive ~10 bp periodicity in

S. cerevisiae

, with stronger periodicity in sequences associated with positioned nucleosomes. In human, although repeat-masked sequences do not exhibit significant periodicity on average, we find that experimentally determined nucleosome positions show a periodicity of the AA dinucleotide similar to that found in

S. cerevisiae

. Furthermore, transcription start sites in the human genome are marked by a sharp drop in the 10 bp periodicity of the AA dinucleotide, while occupied CTCF sites are surrounded by a local increase.

Sheila M. Reynolds, Jeff A. Bilmes, William Stafford Noble
Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep
Extended Abstract

We introduce a new phylogenetic reconstruction algorithm which, unlike most previous rigorous inference techniques, does not rely on assumptions regarding the branch lengths or the depth of the tree. The algorithm returns a forest which is guaranteed to contain all edges that are: 1) sufficiently long and 2) sufficiently close to the leaves. How much of the true tree is recovered depends on the sequence length provided. The algorithm is distance-based and runs in polynomial time.

Constantinos Daskalakis, Elchanan Mossel, Sebastien Roch
Detecting the Presence and Absence of Causal Relationships between Expression of Yeast Genes with Very Few Samples

Inference of biological networks from high-throughput data is a central problem in bioinformatics. Particularly powerful for network reconstruction is data collected by recent studies that contain both genetic variation information and gene expression profiles from genetically distinct strains of an organism. Various statistical approaches have been applied to these data to tease out the underlying biological networks that govern how individual genetic variation mediates gene expression and how genes regulate and interact with each other. Extracting meaningful causal relationships from these networks remains a challenging but important problem. In this paper we use causal inference techniques to infer the presence or absence of causal relationships between yeast gene expressions in the framework of graphical causal models. We evaluate our method using a well studied dataset consisting of both genetic variation information and gene expressions collected over yeast strains. Our predictions of causal regulators are consistent with previously known experimental evidence. In addition, our method can distinguish between direct and indirect effects of variation on a gene expression level.

Eun Yong Kang, Ilya Shpitser, Chun Ye, Eleazar Eskin
An Adaptive and Memory Efficient Algorithm for Genotype Imputation

Genome wide association studies have proven to be a highly successful method for identification of genetic loci for complex phenotypes in both humans and model organisms. These large scale studies rely on the collection of hundreds of thousands of single nucleotide polymorphisms (SNPs) across the genome. Standard high-throughput genotyping technologies capture only a fraction of the total genetic variation. Recent efforts have shown that it is possible to “impute” with high accuracy the genotypes of SNPs that are not collected in the study provided that they are present in a reference data set which contains both SNPs collected in the study as well as other SNPs. We here introduce a novel HMM based technique to solve the imputation problem that addresses several shortcomings of existing methods. First, our method is adaptive which lets it estimate population genetic parameters from the data and be applied to model organisms that have very different evolutionary histories. Compared to traditional methods, our method is up to ten times more accurate on model organisms such as mouse. Second, our algorithm scales in memory usage in the number of collected markers as opposed to the number of known SNPs. This issue is very relevant due to the size of the reference data sets currently being generated. We compare our method over mouse and human data sets to existing methods and show that each has either comparable or better performance and much lower memory usage. The method is available for download at

http://genetics.cs.ucla.edu/eminim

.

Hyun Min Kang, Noah A. Zaitlen, Buhm Han, Eleazar Eskin
A Statistical Framework for the Functional Analysis of Metagenomes

Metagenomicstudies consider the genetic makeup of microbial communities as a whole, rather than their individual member organisms. The functional and metabolic potential of microbial communities can be analyzed by comparing the relative abundance of gene families in their collective genomic sequences (metagenome) under different conditions. Such comparisons require accurate estimation of gene family frequencies. We present a statistical framework for assessing these frequencies based on the Lander-Waterman theory developed originally for Whole Genome Shotgun (WGS) sequencing projects. We also provide a novel method for assessing the reliability of the estimations which can be used for removing seemingly unreliable measurements. We tested our method on a wide range of datasets, including simulated genomes and real WGS data from sequencing projects of whole genomes. Results suggest that our framework corrects inherent biases in accepted methods and provides a good approximation to the true statistics of gene families in WGS projects.

Itai Sharon, Amrita Pati, Victor M. Markowitz, Ron Y. Pinter
Learning Models for Aligning Protein Sequences with Predicted Secondary Structure

Accurately aligning distant protein sequences is notoriously difficult. A recent approach to improving alignment accuracy is to use additional information such as predicted

secondary structure

. We introduce several new models for scoring alignments of protein sequences with predicted secondary structure, which use the predictions and their confidences to modify both the substitution and gap cost functions. We present efficient algorithms for computing optimal pairwise alignments under these models, all of which run in near-quadratic time. We also review an approach to learning the values of the parameters in these models called

inverse alignment

. We then evaluate the accuracy of these models by studying how well an optimal alignment under the model recovers known benchmark reference alignments. Our experiments show that using parameters learned by inverse alignment, these new secondary-structure-based models provide a significant improvement in alignment accuracy for distant sequences. The best model improves upon the accuracy of the standard sequence alignment model for pairwise alignment by as much as 15% for sequences with less than 25% identity, and improves the accuracy of multiple alignment by 20% for difficult benchmarks whose average accuracy under standard tools is less than 40%.

Eagu Kim, Travis Wheeler, John Kececioglu
Backmatter
Metadata
Title
Research in Computational Molecular Biology
Editor
Serafim Batzoglou
Copyright Year
2009
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-02008-7
Print ISBN
978-3-642-02007-0
DOI
https://doi.org/10.1007/978-3-642-02008-7