Skip to main content

2008 | Buch

Algorithms in Bioinformatics

8th International Workshop, WABI 2008, Karlsruhe, Germany, September 15-19, 2008. Proceedings

herausgegeben von: Keith A. Crandall, Jens Lagergren

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 8th International Workshop on Algorithms in Bioinformatics, WABI 2008, held in Karlsruhe, Germany, in September 2008 as part of the ALGO 2008 meeting. The 32 revised full papers presented together with the abstract of a keynote talk were carefully reviewed and selected from 81 submissions. All current issues of algorithms in bioinformatics are addressed, reaching from mathematical tools to experimental studies of approximation algorithms and reports on significant computational analyses. The topics range in biological applicability from genome mapping, to sequence assembly, to microarray quality, to phylogenetic inference, to molecular modeling.

Inhaltsverzeichnis

Frontmatter
Multichromosomal Genome Median and Halving Problems

Genome median and halving problems aim at reconstructing ancestral genomes and the evolutionary events leading from the ancestor to extant species. The NP-hardness of the breakpoint distance and reversal distance median problems for unichromosomal genomes do not easily imply the same for the multichromosomal case. Here we find the complexity of several genome median and halving problems, including a surprising polynomial result for the breakpoint median and guided halving problems in genomes with circular and linear chromosomes; the multichromosomal problem is thus easier than the unichromosomal one.

Eric Tannier, Chunfang Zheng, David Sankoff
A Branch-and-Bound Method for the Multichromosomal Reversal Median Problem

The ordering of genes in a genome can be changed through rearrangement events such as reversals, transpositions and translocations. Since these rearrangements are “rare events”, they can be used to infer deep evolutionary histories. One important problem in rearrangement analysis is to find the median genome of three given genomes that minimizes the sum of the pairwise genomic distance between it and the three others. To date, MGR is the most commonly used tool for multichromosomal genomes. However, experimental evidence indicates that it leads to worse trees than an optimal median-solver, at least on unichromosomal genomes. In this paper, we present a new branch-and-bound method that provides an exact solution to the multichromosomal reversal median problem. We develop tight lower bounds and improve the enumeration procedure such that the search can be performed efficiently. Our extensive experiments on simulated datasets show that this median solver is efficient, has speed comparable to MGR, and is more accurate when genomes become distant.

Meng Zhang, William Arndt, Jijun Tang
Decompositions of Multiple Breakpoint Graphs and Rapid Exact Solutions to the Median Problem

The median genome problem reduces to a search for the vertex matching in the multiple breakpoint graph (MBG) that maximizes the number of alternating colour cycles formed with the matchings representing the given genomes. We describe a class of “adequate” subgraphs of MBGs that allow a decomposition of an MBG into smaller, more easily solved graphs. We enumerate all of these graphs up to a certain size and incorporate the search for them into an exhaustive algorithm for the median problem. This enables a dramatic speedup in most randomly generated instances with hundreds or even thousands of vertices, as long as the ratio of genome rearrangements to genome size is not too large.

Andrew Wei Xu, David Sankoff
Read Mapping Algorithms for Single Molecule Sequencing Data

Single Molecule Sequencing technologies such as the Heliscope simplify the preparation of DNA for sequencing, while sampling millions of reads in a day. Simultaneously, the technology suffers from a significantly higher error rate, ameliorated by the ability to sample multiple reads from the same location. In this paper we develop novel rapid alignment algorithms for two-pass Single Molecule Sequencing methods. We combine the Weighted Sequence Graph (WSG) representation of all optimal and near optimal alignments between the two reads sampled from a piece of DNA with

k

-mer filtering methods and spaced seeds to quickly generate candidate locations for the reads on the reference genome. We also propose a fast implementation of the Smith-Waterman algorithm using vectorized instructions that significantly speeds up the matching process. Our method combines these approaches in order to build an algorithm that is both fast and accurate, since it is able to take complete advantage of both of the reads sampled during two pass sequencing.

Vladimir Yanovsky, Stephen M. Rumble, Michael Brudno
Exact Transcriptome Reconstruction from Short Sequence Reads

In this paper we address the problem of characterizing the RNA complement of a given cell type, that is, the set of RNA species and their relative copy number, from a large set of short sequence reads which have been randomly sampled from the cell’s RNA sequences through a sequencing experiment. We refer to this problem as the transcriptome reconstruction problem, and we specifically investigate, both theoretically and practically, the conditions under which the problem can be solved. We demonstrate that, even under the assumption of exact information, neither single read nor paired-end read sequences guarantee theoretically that the reconstruction problem has a unique solution. However, by investigating the behavior of the best annotated human gene set, we also show that, in practice, paired-end reads – but not single reads – may be sufficient to solve the vast majority of the transcript variants species and abundances. We finally show that, when we assume that the RNA species existing in the cell are known, single read sequences can effectively be used to infer transcript variant abundances.

Vincent Lacroix, Michael Sammeth, Roderic Guigo, Anne Bergeron
Post-Hybridization Quality Measures for Oligos in Genome-Wide Microarray Experiments

High-throughput microarray experiments produce vast amounts of data. Quality control methods for every step of such experiments are essential to ensure a high biological significance of the conclusions drawn from the data. This issue has been addressed for most steps of the typical microarray pipeline, but the quality of the oligonucleotide probes designed for microarrays has only been evaluated based on their

a priori

properties, such as sequence length or melting temperature predictions. We introduce new oligo quality measures that can be calculated using expression values collected in direct as well as indirect design experiments. Based on these measures, we propose combined oligo quality scores as a tool for assessing probe quality, optimizing array designs and data normalization strategies. We use simulated as well as biological data sets to evaluate these new quality scores. We show that the presented quality scores reliably identify high-quality probes. The set of best-quality probes converges with increasing number of arrays used for the calculation and the measures are robust with respect to the chosen normalization method.

Florian Battke, Carsten Müller-Tidow, Hubert Serve, Kay Nieselt
NAPX: A Polynomial Time Approximation Scheme for the Noah’s Ark Problem

The Noah’s Ark Problem (NAP) is an NP-Hard optimization problem with relevance to ecological conservation management. It asks to maximize the phylogenetic diversity (PD) of a set of taxa given a fixed budget, where each taxon is associated with a cost of conservation and a probability of extinction. NAP has received renewed interest with the rise in availability of genetic sequence data, allowing PD to be used as a practical measure of biodiversity. However, only simplified instances of the problem, where one or more parameters are fixed as constants, have as of yet been addressed in the literature. We present NAPX, the first algorithm for the general version of NAP that returns a 1 − 

ε

approximation of the optimal solution. It runs in

$O\left(\frac{n B^2 h^2 \log^2n}{\log^2(1 - \epsilon)}\right)$

time where

n

is the number of species, and

B

is the total budget and

h

is the height of the input tree. We also provide improved bounds for its expected running time.

Glenn Hickey, Paz Carmi, Anil Maheshwari, Norbert Zeh
Minimum Common String Partition Parameterized

Minimum Common String Partition (MCSP) and related problems are of interest in, e.g., comparative genomics, DNA fingerprint assembly, and ortholog assignment. Given two strings with equal symbol content, the problem is to partition one string into

k

blocks,

k

as small as possible, and to permute them so as to obtain the other string. MCSP is NP-hard, and only approximation algorithms are known. Here we show that MCSP is fixed-parameter tractable in suitable parameters, so that practical instances can be efficiently solved to optimality.

Peter Damaschke
Hardness and Approximability of the Inverse Scope Problem

For a given metabolic network, we address the problem of determining the minimum cardinality set of substrate compounds necessary for synthesizing a set of target metabolites, called the

inverse scope problem

. We define three variants of the inverse scope problem whose solutions may indicate minimal nutritional requirements that must be met to ensure sustenance of an organism, with or without some side products. Here, we show that the inverse scope problems are NP-hard on general graphs and directed acyclic graphs (DAGs). Moreover, we show that the general inverse scope problem cannot be approximated within

n

1/2 − 

ε

for any constant

ε

> 0 unless P = NP. Our results have direct implications for identifying the biosynthetic capabilities of a given organism and for designing biochemical experiments.

Zoran Nikoloski, Sergio Grimbs, Joachim Selbig, Oliver Ebenhöh
Rapid Neighbour-Joining

The neighbour-joining method reconstructs phylogenies by iteratively joining pairs of nodes until a single node remains. The criterion for which pair of nodes to merge is based on both the distance between the pair and the average distance to the rest of the nodes. In this paper, we present a new search strategy for the optimisation criteria used for selecting the next pair to merge and we show empirically that the new search strategy is superior to other state-of-the-art neighbour-joining implementations.

Martin Simonsen, Thomas Mailund, Christian N. S. Pedersen
Efficiently Computing Arbitrarily-Sized Robinson-Foulds Distance Matrices

In this paper, we introduce the HashRF(

p

,

q

) algorithm for computing RF matrices of large binary, evolutionary tree collections. The novelty of our algorithm is that it can be used to compute arbitrarily-sized (

p

×

q

) RF matrices without running into physical memory limitations. In this paper, we explore the performance of our HashRF(

p

,

q

) approach on 20,000 and 33,306 biological trees of 150 taxa and 567 taxa trees, respectively, collected from a Bayesian analysis. When computing the all-to-all RF matrix, HashRF(

p

,

q

) is up to 200 times faster than PAUP* and around 40% faster than HashRF, one of the fastest all-to-all RF algorithms. We show an application of our approach by clustering large RF matrices to improve the resolution rate of consensus trees, a popular approach used by biologists to summarize the results of their phylogenetic analysis. Thus, our HashRF(

p

,

q

) algorithm provides scientists with a fast and efficient alternative for understanding the evolutionary relationships among a set of trees.

Seung-Jin Sul, Grant Brammer, Tiffani L. Williams
Efficient Genome Wide Tagging by Reduction to SAT

Whole genome association has recently demonstrated some remarkable successes in identifying loci involved in disease. Designing these studies involves selecting a subset of known single nucleotide polymorphisms (SNPs) or tag SNPs to be genotyped. The problem of choosing tag SNPs is an active area of research and is usually formulated such that the goal is to select the fewest number of tag SNPs which “cover” the remaining SNPs where “cover” is defined by some statistical criterion. Since the standard formulation of the tag SNP selection problem is NP-hard, most algorithms for selecting tag SNPs are either heuristics which do not guarantee selection of the minimal set of tag SNPs or are exhaustive algorithms which are computationally impractical. In this paper, we present a set of methods which guarantee discovering the minimal set of tag SNPs, yet in practice are much faster than traditional exhaustive algorithms. We demonstrate that our methods can be applied to discover minimal tag sets for the entire human genome. Our method converts the instance of the tag SNP selection problem to an instance of the satisfiability problem, encoding the instance into conjunctive normal form (CNF). We take advantage of the local structure inherent in human variation, as well as progress in knowledge compilation, and convert our CNF encoding into a representation known as DNNF, from which solutions to our original problem can be easily enumerated. We demonstrate our methods by constructing the optimal tag set for the whole genome and show that we significantly outperform previous exhaustive search-based methods. We also present optimal solutions for the problem of selecting multi-marker tags in which some SNPs are “covered” by a pair of tag SNPs. Multi-marker tags can significantly decrease the number of tags we need to select, however discovering the minimal number of multi-marker tags is much more difficult. We evaluate our methods and perform benchmark comparisons to other methods by choosing tag sets using the HapMap data.

Arthur Choi, Noah Zaitlen, Buhm Han, Knot Pipatsrisawat, Adnan Darwiche, Eleazar Eskin
Computing the Minimal Tiling Path from a Physical Map by Integer Linear Programming

We study the problem of selecting the minimal tiling path (MTP) from a set of clones arranged in a physical map. We formulate the constraints of the MTP problem in a graph theoretical framework, and we derive an optimization problem that is solved via integer linear programming. Experimental results show that when we compare our algorithm to the commonly used software FPC, the MTP produced by our method covers a higher portion of the genome, even using a smaller number of MTP clones. These results suggest that if one would employ the MTP produced by our method instead of FPC’s in a clone-by-clone sequencing project, one would reduce by about 12% the sequencing cost.

Serdar Bozdag, Timothy J Close, Stefano Lonardi
An Efficient Lagrangian Relaxation for the Contact Map Overlap Problem

Among the measures for quantifying the similarity between protein 3-D structures, contact map overlap (CMO) maximization deserved sustained attention during past decade. Despite this large involvement, the known algorithms possess a modest performance and are not applicable for large scale comparison.

This paper offers a clear advance in this respect. We present a new integer programming model for CMO and propose an exact B&B algorithm with bounds obtained by a novel Lagrangian relaxation. The efficiency of the approach is demonstrated on a popular small benchmark (Skolnick set, 40 domains). On this set our algorithm significantly outperforms the best existing exact algorithms. Many hard CMO instances have been solved for the first time. To assess furthermore our approach, we constructed a large scale set of 300 protein domains. Computing the similarity measure for any of the 44850 couples, we obtained a classification in excellent agreement with SCOP.

Rumen Andonov, Nicola Yanev, Noël Malod-Dognin
A Faster Algorithm for RNA Co-folding

The current pairwise RNA (secondary) structural alignment algorithms are based on Sankoff’s dynamic programming algorithm from 1985. Sankoff’s algorithm requires

O

(

N

6

) time and

O

(

N

4

) space, where

N

denotes the length of the compared sequences, and thus its applicability is very limited. The current literature offers many heuristics for speeding up Sankoff’s alignment process, some making restrictive assumptions on the length or the shape of the RNA substructures. We show how to speed up Sankoff’s algorithm in practice via non-heuristic methods, without compromising optimality. Our analysis shows that the expected time complexity of the new algorithm is

O

(

N

4

ζ

(

N

)), where

ζ

(

N

) converges to

O

(

N

), assuming a standard polymer folding model which was supported by experimental analysis. Hence our algorithm speeds up Sankoff’s algorithm by a linear factor on average. In simulations, our algorithm speeds up computation by a factor of 3-12 for sequences of length 25-250.

Availability:

Code and data sets are available, upon request.

Michal Ziv-Ukelson, Irit Gat-Viks, Ydo Wexler, Ron Shamir
An Automated Combination of Kernels for Predicting Protein Subcellular Localization

Protein subcellular localization is a crucial ingredient to many important inferences about cellular processes, including prediction of protein function and protein interactions. While many predictive computational tools have been proposed, they tend to have complicated architectures and require many design decisions from the developer.

Here we utilize the multiclass support vector machine (m-SVM) method to directly solve protein subcellular localization without resorting to the common approach of splitting the problem into several binary classification problems. We further propose a general class of protein sequence kernels which considers all motifs, including motifs with gaps. Instead of heuristically selecting one or a few kernels from this family, we utilize a recent extension of SVMs that optimizes over multiple kernels simultaneously. This way, we automatically search over families of possible amino acid motifs.

We compare our automated approach to three other predictors on four different datasets, and show that we perform better than the current state of the art. Further, our method provides some insights as to which sequence motifs are most useful for determining subcellular localization, which are in agreement with biological reasoning. Data files, kernel matrices and open source software are available at

http://www.fml.mpg.de/raetsch/projects/protsubloc

.

Cheng Soon Ong, Alexander Zien
Fast Target Set Reduction for Large-Scale Protein Function Prediction: A Multi-class Multi-label Machine Learning Approach

Large-scale sequencing projects have led to a vast amount of protein sequences, which have to be assigned to functional categories. Currently, profile hidden markov models and kernel-based machine learning methods provide the most accurate results for protein classification. However, the prediction of new sequences with these approaches is computationally expensive. We present an approach for fast scoring of protein sequences by means of feature-based protein sequence representation and multi-class multi-label machine learning techniques. Using the Pfam database, we show that our method provides high computational efficiency and that the approach is well-suitable for pre-filtering of large sequence sets.

Thomas Lingner, Peter Meinicke
Multiple Instance Learning Allows MHC Class II Epitope Predictions Across Alleles

Human adaptive immune response relies on the recognition of short peptides through proteins of the major histocompatibility complex (MHC). MHC class II molecules are responsible for the recognition of antigens external to a cell. Understanding their specificity is an important step in the design of peptide-based vaccines. The high degree of polymorphism in MHC class II makes the prediction of peptides that bind (and then usually cause an immune response) a challenging task. Typically, these predictions rely on machine learning methods, thus a sufficient amount of data points is required. Due to the scarcity of data, currently there are reliable prediction models only for about 7% of all known alleles available.

We show how to transform the problem of MHC class II binding peptide prediction into a well-studied machine learning problem called multiple instance learning. For alleles with sufficient data, we show how to build a well-performing predictor using standard kernels for multiple instance learning. Furthermore, we introduce a new method for training a classifier of an allele without the necessity for binding allele data of the target allele. Instead, we use binding peptide data from other alleles and similarities between the structures of the MHC class II alleles to guide the learning process. This allows for the first time constructing predictors for about two thirds of all known MHC class II alleles. The average performance of these predictors on 14 test alleles is 0.71, measured as area under the ROC curve.

Availability:

The methods are integrated into the EpiToolKit framework for which there exists a webserver at

http://www.epitoolkit.org/

mhciimulti

Nico Pfeifer, Oliver Kohlbacher
An Algorithm for Orienting Graphs Based on Cause-Effect Pairs and Its Applications to Orienting Protein Networks

We consider a graph orientation problem arising in the study of biological networks. Given an undirected graph and a list of ordered source-target pairs, the goal is to orient the graph so that a maximum number of pairs will admit a directed path from the source to the target. We show that the problem is NP-hard and hard to approximate to within a constant ratio. We then study restrictions of the problem to various graph classes, and provide an

O

(log

n

) approximation algorithm for the general case. We show that this algorithm achieves very tight approximation ratios in practice and is able to infer edge directions with high accuracy on both simulated and real network data.

Alexander Medvedovsky, Vineet Bafna, Uri Zwick, Roded Sharan
Enumerating Precursor Sets of Target Metabolites in a Metabolic Network

We present the first exact method based on the topology of a metabolic network to find minimal sets of metabolites (called precursors) sufficient to produce a set of target metabolites. In contrast with previous proposals, our model takes into account self-regenerating metabolites involved in cycles, which may be used to generate target metabolites from potential precursors. We analyse the complexity of the problem and we propose an algorithm to enumerate all minimal precursor sets for a set of target metabolites. The algorithm can be applied to identify a minimal medium necessary for a cell to ensure some metabolic functions. It can be used also to check inconsistencies caused by misannotations in a metabolic network. We present two illustrations of these applications.

Ludovic Cottret, Paulo Vieira Milreu, Vicente Acuña, Alberto Marchetti-Spaccamela, Fábio Viduani Martinez, Marie-France Sagot, Leen Stougie
Boosting the Performance of Inference Algorithms for Transcriptional Regulatory Networks Using a Phylogenetic Approach

Inferring transcriptional regulatory networks from gene-expression data remains a challenging problem, in part because of the noisy nature of the data and the lack of strong network models. Time-series expression data have shown promise and recent work by Babu on the evolution of regulatory networks in

E. coli

and

S. cerevisiae

opened another avenue of investigation. In this paper we take the evolutionary approach one step further, by developing ML-based refinement algorithms that take advantage of established phylogenetic relationships among a group of related organisms and of a simple evolutionary model for regulatory networks to improve the inference of these networks for these organisms from expression data gathered under similar conditions.

We use simulations with different methods for generating gene-expression data, different phylogenies, and different evolutionary rates, and use different network inference algorithms, to study the performance of our algorithmic boosters. The results of simulations (including various tests to exclude confounding factors) demonstrate clear and significant improvements (in both specificity and sensitivity) on the performance of current inference algorithms. Thus gene-expression studies across a range of related organisms could yield significantly more accurate regulatory networks than single-organism studies.

Xiuwei Zhang, Bernard M. E. Moret
Fast Bayesian Haplotype Inference Via Context Tree Weighting

We present a new, Bayesian method for inferring haplotypes for unphased genotypes. The method can be viewed as a unification of some ideas of variable-order Markov chain modelling and ensemble learning that so far have been implemented only separately in some of the state-of-the-art methods. Specifically, we make use of the Context Tree Weighting algorithm to efficiently compute the posterior probability of any given haplotype assignment; we employ a simulated annealing scheme to rapidly find several local optima of the posterior; and we sketch a full Bayesian analogue, in which a weighted sample of haplotype assignments is drawn to summarize the posterior distribution. We also show that one can minimize in linear time the average switch distance, a popular measure of phasing accuracy, to a given (weighted) sample of haplotype assignments. We demonstrate empirically that the presented method typically performs as well as the leading fast haplotype inference methods, and sometimes better. The methods are freely available in a computer program BACH (Bayesian Context-based Haplotyping)

Pasi Rastas, Jussi Kollin, Mikko Koivisto
Genotype Sequence Segmentation: Handling Constraints and Noise

Recombination plays an important role in shaping the genetic variations present in current-day populations. We consider populations evolved from a small number of founders, where each individual’s genomic sequence is composed of segments from the founders. We study the problem of segmenting the genotype sequences into the minimum number of segments attributable to the founder sequences. The minimum segmentation can be used for inferring the relationship among sequences to identify the genetic basis of traits, which is important for disease association studies. We propose two dynamic programming algorithms which can solve the minimum segmentation problem in polynomial time. Our algorithms incorporate biological constraints to greatly reduce the computation, and guarantee that only minimum segmentation solutions with comparable numbers of segments on both haplotypes of the genotype sequence are computed. Our algorithms can also work on noisy data including genotyping errors, point mutations, gene conversions, and missing values.

Qi Zhang, Wei Wang, Leonard McMillan, Jan Prins, Fernando Pardo-Manuel de Villena, David Threadgill
Constructing Phylogenetic Supernetworks from Quartets

In phylogenetics it is common practice to summarize collections of partial phylogenetic trees in the form of supertrees. Recently it has been proposed to construct phylogenetic supernetworks as an alternative to supertrees as these allow the representation of conflicting information in the trees, information that may not be representable in a single tree. Here we introduce

SuperQ

, a new method for constructing such supernetworks. It works by breaking the input trees into quartet trees, and stitching together the resulting set to form a network. The stitching process is performed using an adaptation of the QNet method for phylogenetic network reconstruction. In addition to presenting the new method, we illustrate the applicability of SuperQ to three data sets and discuss future directions for testing and development.

Stefan Grünewald, Andreas Spillner, Kristoffer Forslund, Vincent Moulton
Summarizing Multiple Gene Trees Using Cluster Networks

The result of a multiple gene tree analysis is usually a number of different tree topologies that are each supported by a significant proportion of the genes. We introduce the concept of a cluster network that can be used to combine such trees into a single rooted network, which can be drawn either as a cladogram or phylogram. In contrast to split networks, which can grow exponentially in the size of the input, cluster networks grow only quadratically. A cluster network is easily computed using a modification of the tree-popping algorithm, which we call network-popping. The approach has been implemented as part of the Dendroscope tree-drawing program and its application is illustrated using data and results from three recent studies on large numbers of gene trees.

Daniel H. Huson, Regula Rupp
Fast and Adaptive Variable Order Markov Chain Construction

Variable order Markov chains (VOMCs) are a flexible class of models that extend the well-known Markov chains. They have been applied to a variety of problems in computational biology, e.g. protein family classification. A linear time and space construction algorithm has been published in 2000 by Apostolico and Bejerano. However, neither a report of the actual running time nor an implementation of it have been published since. In this paper we use the lazy suffix tree and the enhanced suffix array to improve upon the algorithm of Apostolico and Bejerano. We introduce a new software which is orders of magnitude faster than current tools for building VOMCs, and is suitable for large scale sequence analysis.

Marcel H. Schulz, David Weese, Tobias Rausch, Andreas Döring, Knut Reinert, Martin Vingron
Computing Alignment Seed Sensitivity with Probabilistic Arithmetic Automata

Heuristic sequence alignment and database search algorithms, such as PatternHunter and BLAST, are based on the initial discovery of so-called alignment

seeds

of well-conserved alignment patterns, which are subsequently extended to full local alignments. In recent years, the theory of classical seeds (matching contiguous q-grams) has been extended to

spaced seeds

, which allow mismatches within a seed, and subsequently to

indel seeds

, which allow gaps in the underlying alignment.

Different seeds within a given class of seeds are usually compared by their

sensitivity

, that is, the probability to match an alignment generated from a particular probabilistic alignment model.

We present a flexible, exact, unifying framework called

probabilistic arithmetic automaton

for seed sensitivity computation that includes all previous results on spaced and indel seeds. In addition, we can easily incorporate sets of arbitrary seeds. Instead of only computing the probability of at least one hit (the standard definition of sensitivity), we can optionally provide the entire distribution of overlapping or non-overlapping seed hits, which yields a different characterization of a seed. A symbolic representation allows fast computation for any set of parameters.

Inke Herms, Sven Rahmann
The Relation between Indel Length and Functional Divergence: A Formal Study

Although insertions and deletions (

indels

) are a common type of evolutionary sequence variation, their origins and their functional consequences have not been comprehensively understood. There is evidence that, on one hand, classical alignment procedures only roughly reflect the evolutionary processes and, on the other hand, that they cause structural changes in the proteins’ surfaces.

We first demonstrate how to identify alignment gaps that have been introduced by evolution to a statistical significant degree, by means of a novel, sound statistical framework, based on pair hidden Markov models (HMMs). Second, we examine paralogous protein pairs in E. coli, obtained by computation of classical global alignments. Distinguishing between indel and non-indel pairs, according to our novel statistics, revealed that, despite having the same sequence identity, indel pairs are significantly less functionally similar than non-indel pairs, as measured by recently suggested GO based functional distances. This suggests that indels cause more severe functional changes than other types of sequence variation and that indel statistics should be taken into additional account to assess functional similarity between paralogous protein pairs.

Raheleh Salari, Alexander Schönhuth, Fereydoun Hormozdiari, Artem Cherkasov, S. Cenk Sahinalp
Detecting Repeat Families in Incompletely Sequenced Genomes

Repeats form a major class of sequence in genomes with implications for functional genomics and practical problems. Their detection and analysis pose a number of challenges in genomic sequence analysis, especially if the genome is not completely sequenced. The most abundant and evolutionary active forms of repeats are found in the form of

families

of long similar sequences. We present a novel method for repeat family detection and characterization in cases where the target genome sequence is not completely known. Therefore we first establish the sequence graph, a compacted version of sparse de Bruijn graphs. Using appropriate analysis of the structure of this graph and its connected components after local modifications, we are able to devise two algorithms for repeat family detection. The applicability of the methods is shown for both simulated and real genomic data sets.

José Augusto Amgarten Quitzau, Jens Stoye
Novel Phylogenetic Network Inference by Combining Maximum Likelihood and Hidden Markov Models
(Extended Abstract)

Horizontal Gene Transfer (HGT) is the event of transferring genetic material from one lineage in the evolutionary tree to a different lineage. HGT plays a major role in bacterial genome diversification and is a significant mechanism by which bacteria develop resistance to antibiotics. Although the prevailing assumption is of complete HGT, cases of partial HGT (which are also named

chimeric

HGT) where only part of a gene is horizontally transferred, have also been reported, albeit less frequently.

In this work we suggest a new probabilistic model for analyzing and modeling phylogenetic networks, the NET-HMM. This new model captures the biologically realistic assumption that neighboring sites of DNA or amino acid sequences are

not

independent, which increases the accuracy of the inference. The model describes the phylogenetic network as a Hidden Markov Model (HMM), where each hidden state is related to one of the network’s trees. One of the advantages of the NET-HMM is its ability to infer partial HGT as well as complete HGT. We describe the properties of the NET-HMM, devise efficient algorithms for solving a set of problems related to it, and implement them in software. We also provide a novel complementary significance test for evaluating the fitness of a model (NET-HMM) to a given data set.

Using NET-HMM we are able to answer interesting biological questions, such as inferring the length of partial HGT’s and the affected nucleotides in the genomic sequences, as well as inferring the exact location of HGT events along the tree branches. These advantages are demonstrated through the analysis of synthetical inputs and two different biological inputs.

Sagi Snir, Tamir Tuller
A Local Move Set for Protein Folding in Triangular Lattice Models

The HP model is one of the most popular discretized models for the protein folding problem, i.e., for computationally predicting the three-dimensional structure of a protein from its amino acid sequence. This model considers the interactions between hydrophobic amino acids to be the driving force in the folding process. Thus, it distinguishes between polar and hydrophobic amino acids only and asks for an embedding of the amino acid sequence into a rectangular grid lattice which maximizes the number of neighboring pairs (contacts) of hydrophobic amino acids in the lattice.

In this paper, we consider an HP-like model which uses a more appropriate grid structure, namely the 2D triangular grid and the face-centered cubic lattice in 3D. We consider a local-search approach for finding an optimal embedding. For defining the local-search neighborhood, we design a move set, the so-called pull moves, and prove its reversibility and completeness. We then use these moves for a tabu search algorithm which is experimentally shown to lead into optimum energy configurations and improve the current best results for several sequences in 2D and 3D.

Hans-Joachim Böckenhauer, Abu Zafer M. Dayem Ullah, Leonidas Kapsokalivas, Kathleen Steinhöfel
Protein Decoy Generation Using Branch and Bound with Efficient Bounding

We propose a new discrete protein structure model (using a modified face-centered cubic lattice). A novel branch and bound algorithm for finding global minimum structures in this model is suggested. The objective energy function is very simple as it depends on the predicted half-sphere exposure numbers of

C

α

-atoms. Bounding and branching also exploit predicted secondary structures and expected radius of gyration. The algorithm is fast and is able to generate the decoy set in less than 48 hours on all proteins tested.

Despite the simplicity of the model and the energy function, many of the lowest energy structures, using exact measures, are near the native structures (in terms of RMSD). As expected, when using predicted measures, the fraction of good decoys decreases, but in all cases tested, we obtained structures within 6 Å RMSD in a set of low-energy decoys. To the best of our knowledge, this is the first

de novo

branch and bound algorithm for protein decoy generation that only depends on such one-dimensional predictable measures. Another important advantage of the branch and bound approach is that the algorithm searches through the entire conformational space. Contrary to search heuristics, like Monte Carlo simulation or tabu search, the problem of escaping local minima is indirectly solved by the branch and bound algorithm when good lower bounds can be obtained.

Martin Paluszewski, Pawel Winter
Backmatter
Metadaten
Titel
Algorithms in Bioinformatics
herausgegeben von
Keith A. Crandall
Jens Lagergren
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-87361-7
Print ISBN
978-3-540-87360-0
DOI
https://doi.org/10.1007/978-3-540-87361-7

Premium Partner