Preserving Inversion Phylogeny Reconstruction

Tractability results are rare in the comparison of gene orders for more than two genomes. Here we present a linear-time algorithm for the small parsimony problem (inferring ancestral genomes given a phylogeny on an arbitrary number of genomes) in the case gene orders are permutations, that evolve by inversions not breaking common gene intervals, and these intervals are organised in a linear structure. We present two examples where this allows to reconstruct the ancestral gene orders in phylogenies of several

γ

-Proteobacteria species and Burkholderia strains, respectively. We prove in addition that the large parsimony problem (where the phylogeny is output) remains NP-complete.

Matthias Bernt, Kun-Mao Chao, Jyun-Wei Kao, Martin Middendorf, Eric Tannier

Fast Phylogenetic Tree Reconstruction Using Locality-Sensitive Hashing

We present the first sub-quadratic time algorithm that with high probability correctly reconstructs phylogenetic trees for short sequences generated by a Markov model of evolution. Due to rapid expansion in sequence databases, such very fast algorithms are necessary. Other fast heuristics have been developed for building trees from large alignments [18,1], but they lack theoretical performance guarantees. Our new algorithm runs in

O

(

n

1 +

γ

(

g

)

log

2

n

) time, where

γ

is an increasing function of an upper bound on the branch lengths in the phylogeny, the upper bound

g

must be below

$1/2-\sqrt{1/8} \approx 0.15$

, and

γ

(

g

) < 1 for all

g

. For phylogenies with very short branches, the running time of our algorithm is near-linear. For example, if all branches have mutation probability less than 0.02, the running time of our algorithm is roughly

O

(

n

1.2

log

2

n

). Our preliminary experiments show that many large phylogenies can be reconstructed more accurately than allowed by current methods, in comparable running times.

Daniel G. Brown, Jakub Truszkowski

Efficient Computation of Popular Phylogenetic Tree Measures

Given a phylogenetic tree

$\mathcal{T}$

of

n

nodes, and a sample

R

of its tips (leaf nodes) a very common problem in ecological and evolutionary research is to evaluate a distance measure for the elements in

R

. Two of the most common measures of this kind are the Mean Pairwise Distance (

$\ensuremath{\mathrm{MPD}} $

) and the Phylogenetic Diversity (

$\ensuremath{\mathrm{PD}} $

). In many applications, it is often necessary to compute the expectation and standard deviation of one of these measures over all subsets of tips of

$\mathcal{T}$

that have a certain size. Unfortunately, existing methods to calculate the expectation and deviation of these measures are inexact and inefficient.

We present analytical expressions that lead to efficient algorithms for computing the expectation and the standard deviation of the MPD and the PD. More specifically, our main contributions are:

1

We present efficient algorithms for computing the expectation and the standard deviation of the MPD exactly, in Θ(

n

) time.

2

We provide a Θ(

n

) time algorithm for computing approximately the expectation of the PD and a

O

(

n

2

) time algorithm for computing approximately the standard deviation of the PD. We also describe the major computational obstacles that hinder the exact calculation of these concepts.

We also describe

O

(

n

) time algorithms for evaluating the MPD and PD given a single sample of tips. Having implemented all the presented algorithms, we assess their efficiency experimentally using as a point of reference a standard software package for processing phylogenetic trees.

Constantinos Tsirogiannis, Brody Sandel, Dimitris Cheliotis

SibJoin: A Fast Heuristic for Half-Sibling Reconstruction

Kinship inference is the task of identifying genealogically related individuals. Questions of kinship are important for determining mating structures, particularly in endangered populations. Although many solutions exist for reconstructing full-sibling relationships, few exist for half-siblings. We present SibJoin, a heuristic-based clustering approach based on Mendelian genetics, which is reasonably accurate and thousands of times faster than existing algorithms. We also identify issues with partition distance, the traditional method for assessing the quality of estimated sibship partitionings. We prefer an information theoretic alternative called variation of information, which takes into account the degree to which misplaced individuals harm sibship structures.

Daniel G. Brown, Daniel Dexter

Reconstructing the Evolution of Molecular Interaction Networks under the DMC and Link Dynamics Models

Molecular interaction networks have emerged as a powerful data source for answering a plethora of biological questions ranging from how cells make decisions to how species evolve. The availability of such data from multiple organisms allows for their analysis from an evolutionary perspective. Indeed, work has emerged recently on network alignment, ancestral network reconstruction, and phylogenetic inference based on networks.

In this paper, we address two central issues in the area of evolutionary analysis of molecular interaction networks, namely (1) correcting genetic distances derived from observed differences between networks, and (2) reconstructing ancestral networks from extant ones. We address both issues computationally under the

link dynamics

and

duplication-mutation with complementarity

(DMC) evolutionary models. We demonstrate the utility and accuracy of our methods on biological and simulated data.

Yun Zhu, Luay Nakhleh

Estimating Population Size via Line Graph Reconstruction

We propose a novel graph theoretic method to estimate haplotype population size from genotype data. The method considers only the potential sharing of haplotypes between individuals and is based on transforming the graph of potential haplotype sharing into a line graph using a minimum number of edge and vertex deletions. We show that the problems are NP complete and provide exact integer programming solutions for them. We test our approach using extensive simulations of multiple population evolution and genotypes sampling scenarios. Our computational experiments show that when most of the sharings are true sharings the problem can be solved very fast and the estimated size is very close to the true size; when many of the potential sharings do not stem from true haplotype sharing, our method gives reasonable lower bounds on the underlying number of haplotypes. In comparison, a naive approach of phasing the input genotypes provides trivial upper bounds of twice the number of genotypes.

Bjarni V. Halldórsson, Dima Blokh, Roded Sharan

Extracting Conflict-Free Information from Multi-labeled Trees

A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious. We define the information content of a MUL-tree

T

as the set of all conflict-free quartet topologies implied by

T

, and define the maximal reduced form of

T

as the smallest tree that can be obtained from

T

by pruning leaves and contracting edges while retaining the same information content. We show that any two MUL-trees with the same information content exhibit the same reduced form. This introduces an equivalence relation among MUL-trees with potential applications to comparing MUL-trees. We present an efficient algorithm to reduce a MUL-tree to its maximally reduced form and evaluate its performance on empirical datasets in terms of both quality of the reduced tree and the degree of data reduction achieved.

Akshay Deepak, David Fernández-Baca, Michelle M. McMahon

Reducing Problems in Unrooted Tree Compatibility to Restricted Triangulations of Intersection Graphs

The compatibility problem is the problem of determining if a set of unrooted trees are compatible, i.e. if there is a supertree that represents all of the trees in the set. This fundamental problem in phylogenetics is NP-complete but fixed-parameter tractable in the number of trees. Recently, Vakati and Fernández-Baca showed how to efficiently reduce the compatibility problem to determining if a specific type of constrained triangulation exists for a non-chordal graph derived from the input trees, mirroring a classic result by Buneman for the closely related Perfect-Phylogeny problem. In this paper, we show a different way of efficiently reducing the compatibility problem to that of determining if another type of constrained triangulation exists for a new non-chordal intersection graph. In addition to its conceptual contribution, such reductions are desirable because of the extensive and continuing literature on graph triangulations, which has been exploited to create algorithms that are efficient in practice for a variety of Perfect-Phylogeny problems. Our reduction allows us to frame the compatibility problem as a minimal triangulation problem (in particular, as a chordal graph sandwich problem) and to frame a maximization variant of the compatibility problem as a minimal triangulation problem.

Rob Gysel, Kristian Stevens, Dan Gusfield

An Optimal Reconciliation Algorithm for Gene Trees with Polytomies

Reconciliation is a method widely used to infer the evolutionary relationship between the members of a gene family. It consists of comparing a gene tree with a species tree, and interpreting the incongruence between the two trees as evidence of duplication and loss. In the case of binary rooted trees, linear-time algorithms have been developed for the duplication, loss, and mutation (duplication + loss) costs. However, a strict prerequisite to reconciliation is to have a gene tree free from error, as few misplaced edges may lead to a completely different result in terms of the number and position of inferred duplications and losses. How should the weak edges be handled? One reasonable answer is to transform the binary gene tree into a non-binary tree by removing each weak edge and collapsing its two incident vertices into one. The created polytomies are “apparent” as they do not reflect a true simultaneous divergence of many copies from a common ancestor, but rather a lack of resolution. In this paper, we consider the problem of reconciling a non-binary rooted gene tree

G

with a binary rooted species tree

S

, were polytomies of

G

are assumed to be apparent. We give a linear-time algorithm that infers a reconciliation of minimum mutation cost between a binary refinement of a polytomy and

S

, improving on the best known result, which is cubic. This implies a straightforward generalization to a gene tree

G

with nodes of arbitrary degree, that runs in time

O

(|

S

||

G

|), which is shown to be an optimal algorithm.

Manuel Lafond, Krister M. Swenson, Nadia El-Mabrouk

Accounting for Gene Tree Uncertainties Improves Gene Trees and Reconciliation Inference

We propose a reconciliation heuristic accounting for gene duplications, losses and horizontal transfers that specifically takes into account the uncertainties in the gene tree. Rearrangements are tried for gene tree edges that are weakly supported, and are accepted whenever they improve the reconciliation cost. We prove useful properties on the dynamic programming matrix used to compute reconciliations, which allows to speed-up the tree space exploration when rearrangements are generated by Nearest Neighbor Interchanges (NNI) edit operations. Experimental results on simulated and real data confirm that running times are greatly reduced when considering the above-mentioned optimization in comparison to the naïve rearrangement procedure. Results also show that gene trees modified by such NNI rearrangements are closer to the correct (simulated) trees and lead to more correct event predictions on average. The program is available at

http://www.atgc-montpellier.fr/

Mowgli/

Thi Hau Nguyen, Jean-Philippe Doyon, Stéphanie Pointet, Anne-Muriel Arigon Chifolleau, Vincent Ranwez, Vincent Berry

RNA Tree Comparisons via Unrooted Unordered Alignments

We generalize some current approaches for RNA tree alignment, which are traditionally confined to

ordered rooted

mappings, to also consider

unordered unrooted

mappings. We define the

Homeomorphic Subtree Alignment

problem, and present a new algorithm which applies to several modes, including global or local, ordered or unordered, and rooted or unrooted tree alignments. Our algorithm generalizes previous algorithms that either solved the problem in an asymmetric manner, or were restricted to the rooted and/or ordered cases. Focusing here on the most general unrooted unordered case, we show that our algorithm has an

O

(

n

T

n

S

min (

d

T

,

d

S

)) time complexity, where

n

T

and

n

S

are the number of nodes and

d

T

and

d

S

are the maximum node degrees in the input trees

T

and

S

, respectively. This maintains (and slightly improves) the time complexity of previous, less general algorithms for the problem.

Supplemental materials, source code, and web-interface for our tool are found in

http://www.cs.bgu.ac.il/~negevcb/FRUUT

.

Nimrod Milo, Shay Zakov, Erez Katzenelson, Eitan Bachmat, Yefim Dinitz, Michal Ziv-Ukelson

Tree Decomposition and Parameterized Algorithms for RNA Structure-Sequence Alignment Including Tertiary Interactions and Pseudoknots

(Extended Abstract)

We present a general setting for structure-sequence comparison in a large class of RNA structures, that unifies and generalizes a number of recent works on specific families of structures. Our approach is based on a

tree decomposition

of structures, and gives rise to a general parameterized algorithm having complexity in

$\mathcal{O}(N\cdot m^t)$

, where

N

(resp.

m

) is the structure (resp. sequence) length, and the exponent

t

depends on the family of structures. For each family considered by previous approaches, our contribution specializes into an algorithm whose complexity either matches or outperforms previous solutions.

Philippe Rinaudo, Yann Ponty, Dominique Barth, Alain Denise

δ-TRIMAX: Extracting Triclusters and Analysing Coregulation in Time Series Gene Expression Data

In an attempt to analyse coexpression in a time series microarray gene expression dataset, we introduce here a novel, fast triclustering algorithm

δ

-TRIMAX that aims to find a group of genes that are coexpressed over a subset of samples across a subset of time-points. Here we defined a novel mean-squared residue score for such 3D dataset. At first it uses a greedy approach to find triclusters that have a mean-squared residue score below a threshold

δ

by deleting nodes from the dataset and then in the next step adds some nodes, keeping the mean squared residue score of the resultant tricluster below

δ

. So, the goal of our algorithm is to find large and coherent triclusters from the 3D gene expression dataset. Additionally, we have defined an affirmation score to measure the performance of our triclustering algorithm for an artificial dataset. To show biological significance of the triclusters we have conducted GO enrichment analysis. We have also performed enrichment analysis of transcription factor binding sites to establish coregulation of a group of coexpressed genes.

Anirban Bhar, Martin Haubrock, Anirban Mukhopadhyay, Ujjwal Maulik, Sanghamitra Bandyopadhyay, Edgar Wingender

CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population

The recently developed RNA-Seq technology provides a high-throughput and reasonably accurate way to analyze the transcriptomic landscape of a tissue. Unfortunately, from a computational perspective, identification and quantification of a gene’s isoforms from RNA-Seq data remains to be a non-trivial problem. We propose CLIIQ, a novel computational method for identification and quantification of expressed isoforms from

multiple samples

in a

population

. Motivated by ideas from compressed sensing literature, CLIIQ is based on an integer linear programming formulation for identifying and quantifying ”the most parsimonious” set of isoforms. We show through simulations that, on a single sample, CLIIQ provides better results in isoform identification and quantification to alternative popular tools. More importantly, CLIIQ has an option to jointly analyze multiple samples, which significantly outperforms other tools in both isoform identification and quantification.

Yen-Yi Lin, Phuong Dao, Faraz Hach, Marzieh Bakhshi, Fan Mo, Anna Lapuk, Colin Collins, S. Cenk Sahinalp

Improved Lower Bounds on the Compatibility of Quartets, Triplets, and Multi-state Characters

We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state characters: There exists a function

f

(

r

) such that, for any set

C

of

r

-state characters,

C

is compatible if and only if every subset of

f

(

r

) characters of

C

is compatible. We show that for every

r

≥ 2, there exists an incompatible set

C

of

$\lfloor\frac{r}{2}\rfloor\cdot\lceil\frac{r}{2}\rceil + 1$

r

-state characters such that every proper subset of

C

is compatible. Thus,

$f(r) \ge \lfloor\frac{r}{2}\rfloor\cdot\lceil\frac{r}{2}\rceil + 1$

for every

r

≥ 2. This improves the previous lower bound of

f

(

r

) ≥

r

given by Meacham (1983), and generalizes the construction showing that

f

(4) ≥ 5 given by Habib and To (2011). We prove our result via a result on quartet compatibility that may be of independent interest: For every integer

n

≥ 4, there exists an incompatible set

Q

of

$\lfloor\frac{n-2}{2}\rfloor\cdot\lceil\frac{n-2}{2}\rceil + 1$

quartets over

n

labels such that every proper subset of

Q

is compatible. We contrast this with a result on the compatibility of triplets: For every

n

≥ 3, if

R

is an incompatible set of more than

n

− 1 triplets over

n

labels, then some proper subset of

R

is incompatible. We show this bound is tight by exhibiting, for every

n

≥ 3, a set of

n

− 1 triplets over

n

taxa such that

R

is incompatible, but every proper subset of

R

is compatible.

Brad Shutters, Sudheer Vakati, David Fernández-Baca

Succinct Multibit Tree: Compact Representation of Multibit Trees by Using Succinct Data Structures in Chemical Fingerprint Searches

Similarity searches in the databases of chemical fingerprints are a fundamental task in discovering novel drug-like molecules. Multibit trees have a data structure that enables fast similarity searches of chemical fingerprints (Kristensen et al., WABI’09). A standard pointer-based representation of multibit trees consumes a large amount of memory to index large-scale fingerprint databases. To make matters worse, original fingerprint databases need to be stored in memory to filter out false positives. A succinct data structure is compact and enables fast operations. Many succinct data structures have been proposed thus far, and have been applied to many fields such as full text indexing and genome mapping. We present compact representations of both multibit trees and fingerprint databases by applying these data structures. Experiments revealed that memory usage in our representations was much smaller than that of the standard pointer-based representation. Moreover, our representations enabled us to efficiently perform PubChem-scale similarity searches.

Yasuo Tabei

Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes

Popular sequence alignment tools such as BWA convert a reference genome to an indexing data structure based on the Burrows-Wheeler Transform (BWT), from which matches to individual query sequences can be rapidly determined. However the utility of also indexing the query sequences themselves remains relatively unexplored.

Here we show that an all-against-all comparison of two sequence collections can be computed from the BWT of each collection with the BWTs held entirely in external memory, i.e. on disk and not in RAM. As an application of this technique, we show that BWTs of transcriptomic and genomic reads can be compared to obtain reference-free predictions of splice junctions that have high overlap with results from more standard reference-based methods.

Code to construct and compare the BWT of large genomic data sets is available at

http://beetl.github.com/BEETL/

as part of the

BEETL

library.

Anthony J. Cox, Tobias Jakobi, Giovanna Rosone, Ole B. Schulz-Trieglaff

Succinct de Bruijn Graphs

We propose a new succinct de Bruijn graph representation. If the de Bruijn graph of

k

-mers in a DNA sequence of length

N

has

m

edges, it can be represented in 4

m

+

o

(

m

) bits. This is much smaller than existing ones. The numbers of outgoing and incoming edges of a node are computed in constant time, and the outgoing and incoming edge with given label are found in constant time and

$\mathcal{O}(k)$

time, respectively. The data structure is constructed in

$\mathcal{O}(Nk \log m/\log\log m)$

time using no additional space.

Alexander Bowe, Taku Onodera, Kunihiko Sadakane, Tetsuo Shibuya

Space-Efficient and Exact de Bruijn Graph Representation Based on a Bloom Filter

The de Bruijn graph data structure is widely used in next-generation sequencing (NGS). Many programs, e.g.

de novo

assemblers, rely on in-memory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory (≥ 30 GB).

We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. An assembly software implementing this structure, Minia, performed a complete

de novo

assembly of human genome short reads using 5.7 GB of memory in 23 hours.

Rayan Chikhi, Guillaume Rizk

From de Bruijn Graphs to Rectangle Graphs for Genome Assembly

Jigsaw puzzles were originally constructed by painting a picture on a rectangular piece of wood and further cutting it into smaller pieces with a jigsaw. The

Jigsaw Puzzle Problem

is to find an arrangement of these pieces that fills up the rectangle in such a way that neighboring pieces have “matching” boundaries with respect to color and texture. While the general Jigsaw Puzzle Problem is NP-complete [6], we discuss its simpler version (called

Rectangle Puzzle Problem

) and study the

rectangle graphs

, recently introduced by Bankevich et al., 2012 [3], for assembling such puzzles. We establish the connection between Rectangle Puzzle Problem and the problem of assembling genomes from read-pairs, and further extend the analysis in [3] to real challenges encountered in applications of rectangle graphs in genome assembly. We demonstrate that addressing these challenges results in an assembler SPAdes+ that improves on existing assembly algorithms in the case of bacterial genomes (including particularly difficult case of genome assemblies from single cells).

SPAdes+ is freely available from

http://bioinf.spbau.ru/spades

.

Nikolay Vyahhi, Alex Pyshkin, Son Pham, Pavel A. Pevzner

MORPH-PRO: A Novel Algorithm and Web Server for Protein Morphing

Proteins are known to be dynamic in nature, changing from one conformation to another while performing vital cellular tasks. It is important to understand these movements in order to better understand protein function. At the same time, experimental techniques provide us with only single snapshots of the whole ensemble of available conformations. Computational protein morphing provides a visualization of a protein structure transitioning from one conformation to another by producing a series of intermediate conformations. We present a novel, efficient morphing algorithm,

Morph-Pro

based on linear interpolation. We also show that apart from visualization, morphing can be used to provide plausible intermediate structures. We test intermediate structures constructed by our algorithm for a protein kinase and evaluate these structures in a virtual docking experiment. The structures are shown to dock with higher score to known ligands than structures solved using X-Ray crystallography.

Natalie E. Castellana, Andrey Lushnikov, Piotr Rotkiewicz, Natasha Sefcovic, Pavel A. Pevzner, Adam Godzik, Kira Vyatkina

How Accurately Can We Model Protein Structures with Dihedral Angles?

Previous study shows that the same type of bond lengths and angles fit Gaussian distributions well with small standard deviations on high resolution protein structure data. The mean values of these Gaussian distributions have been widely used as ideal bond lengths and angles in bioinformatics. However, we are not aware of any research work done to evaluate how accurately we can model protein structures with dihedral angles and ideal bond lengths and angles.

In this paper, we first introduce the protein structure idealization problem. Then, we develop a fast

O

(

nm

/

ε

) dynamic programming algorithm to find an approximately optimal idealized protein backbone structure according to our scoring function. Consequently, we demonstrate that idealized backbone structures always exist with small changes and significantly better free energy. We also apply our algorithm to refine protein pseudo-structures determined in NMR experiments.

Xuefeng Cui, Shuai Cheng Li, Dongbo Bu, Babak Alipanahi Ramandi, Ming Li

Resolving Spatial Inconsistencies in Chromosome Conformation Data

We introduce a new method for filtering noisy 3C interactions that selects subsets of interactions that obey metric constraints of various strictness. We demonstrate that, although the problem is computationally hard, near-optimal results are often attainable in practice using well-designed heuristics and approximation algorithms. Further, we show that, compared with a standard technique, this metric filtering approach leads to (a) subgraphs with higher total statistical significance, (b) lower embedding error, (c) lower sensitivity to initial conditions of the embedding algorithm, and (d) structures with better agreement with light microscopy measurements.

Geet Duggal, Rob Patro, Emre Sefer, Hao Wang, Darya Filippova, Samir Khuller, Carl Kingsford

MS-DPR: An Algorithm for Computing Statistical Significance of Spectral Identifications of Non-linear Peptides

While non-linear peptide natural products such as Vanco- mycin and Daptomycin are among the most effective antibiotics, the computational techniques for sequencing such peptides are still in infancy. Previous methods for sequencing peptide natural products are based on Nuclear Magnetic Resonance spectroscopy, and require large amounts (milligrams) of purified materials. Recently, development of mass spectrometry based methods has enabled accurate sequencing of non-linear peptidic natural products using picograms of materials, but the question of evaluating statistical significance of Peptide Spectrum Matches (PSM) for these peptides remains open. Moreover, it is unclear how to decide whether a given spectrum is produced by linear, cyclic, or branch-cyclic peptide. Surprisingly, all previous mass spectrometery studies overlooked the fact that a very similar problem has been succesfully addressed in particle physics in 1951. In this paper we develop a method for estimating statistical significance of PSMs defined by non-linear peptides, which makes it possible to identify whether a peptide is linear, cyclic or branch-cyclic, an important step toward identification of peptidic natural products.

Hosein Mohimani, Sangtae Kim, Pavel A. Pevzner

FinIS: Improved in silico Finishing Using an Exact Quadratic Programming Formulation

With the increased democratization of sequencing, the reliance of sequence assembly programs on heuristics is at odds with the need for black-box assembly solutions that can be used reliably by non-specialists. In this work, we present a formal definition for

in silico

assembly validation and finishing and explore the feasibility of an exact solution for this problem using quadratic programming (FinIS). Based on results for several real and simulated datasets, we demonstrate that FinIS validates the correctness of a larger fraction of the assembly than existing

ad hoc

tools. Using a test for unique optimal solutions, we show that FinIS can improve on both precision and recall values for the correctness of assembled sequences, when compared to competing programs. Source code and executables for FinIS are freely available at

http://sourceforge.net/projects/finis/

.

Song Gao, Denis Bertrand, Niranjan Nagarajan

Lightweight LCP Construction for Next-Generation Sequencing Datasets

The advent of “next-generation” DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets.

In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in human whole genome sequencing experiments.

Markus J. Bauer, Anthony J. Cox, Giovanna Rosone, Marinella Sciortino

Sign Assignment Problems on Protein Networks

In a maximum sign assignment problem one is given an undirected graph and a set of signed source-target vertex pairs. The goal is to assign signs to the graph’s edges so that a maximum number of pairs admit a source-to-target path whose aggregate sign (product of its edge signs) equals the pair’s sign. This problem arises in the annotation of physical interaction networks with activation/repression signs. It is known to be NP-complete and most previous approaches to tackle it were limited to considering very short paths in the network. Here we provide a sign assignment algorithm that solves the problem to optimality by reformulating it as an integer program. We apply our algorithm to sign physical interactions in yeast and measure our performance using edges whose activation/repression signs are known. We find that our algorithm achieves high accuracy (89%), outperforming a state-of-the-art method by a significant margin.

Shay Houri, Roded Sharan

Sparse Learning Based Linear Coherent Bi-clustering

Clustering algorithms are often limited by an assumption that each data point belongs to a single class, and furthermore that all features of a data point are relevant to class determination. Such assumptions are inappropriate in applications such as gene clustering, where, given expression profile data, genes may exhibit similar behaviors only under some, but not all conditions, and genes may participate in more than one functional process and hence belong to multiple groups. Identifying genes that have similar expression patterns in a common subset of conditions is a central problem in gene expression microarray analysis. To overcome the limitations of standard clustering methods for this purpose, Bi-clustering has often been proposed as an alternative approach, where one seeks groups of observations that exhibit similar patterns over a subset of the features. In this paper, we propose a new bi-clustering algorithm for identifying linear-coherent bi-clusters in gene expression data, strictly generalizing the type of bi-cluster structure considered by other methods. Our algorithm is based on recent sparse learning techniques that have gained significant attention in the machine learning research community. In this work, we propose a novel sparse learning based model, SLLB, for solving the

linear coherent

bi-clustering problem. Experiments on both synthetic data and real gene expression data demonstrate the model is significantly more effective than current bi-clustering algorithms for these problems. The parameter selection problem and the model’s usefulness in other machine learning clustering applications are also discussed. The on-line appendix for this paper can be found at

http://www.cs.ualberta.ca/~ys3/SLLB

.

Yi Shi, Xiaoping Liao, Xinhua Zhang, Guohui Lin, Dale Schuurmans

A Simplified View of DCJ-Indel Distance

The introduction of the double cut and join (DCJ) operation and the derivation of its associated distance caused a flurry of research into the study of multichromosomal rearrangements. However, little of this work has incorporated indels (i.e., insertions and deletions) into the calculation of genomic distance functions, with a particular exception of Braga et al., who provided a linear time algorithm ([1]) for computing the DCJ-indel distance. Although this algorithm only takes linear time, its derivation is lengthy and depends on a large number of possible cases. In this paper, we provide a simplified indel model that solves the problem of DCJ-indel sorting in linear time directly from the classical breakpoint graph, an approach that allows us to describe the solution space of DCJ-indel sorting, thus resolving an existing open problem.

Phillip E. C. Compeau

DCJ-indel Distance with Distinct Operation Costs

The double-cut and join (DCJ) is a genomic operation that generalizes the typical mutations to which genomes are subject. The distance between two genomes, in terms of number of DCJ operations, can be computed in linear time. More powerful is the DCJ-indel model, which handles genomes with unequal contents, allowing, besides the DCJ operations, the insertion and/or deletion of pieces of DNA – named indel operations. It has been shown that the DCJ-indel distance can also be computed in linear time, assuming that the same cost is assigned to any DCJ or indel operation. In the present work we consider a new DCJ-indel distance in which the indel cost is distinct from and upper bounded by the DCJ cost. Considering that the DCJ cost is equal to 1, we set the indel cost equal to a positive constant

w

≤ 1 and show that the distance can still be computed in linear time. This new distance generalizes the previous DCJ-indel distance considered in the literature (which uses the same cost for both types of operations).

Poly H. da Silva, Marília D. V. Braga, Raphael Machado, Simone Dantas

Hidden Breakpoints in Genome Alignments

During the course of evolution, an organism’s genome can undergo changes that affect the large-scale structure of the genome. These changes include gene gain, loss, duplication, chromosome fusion, fission, and rearrangement. When gene gain and loss occurs in addition to other types of rearrangement, breakpoints of rearrangement can exist that are only detectable by comparison of three or more genomes. An arbitrarily large number of these “hidden” breakpoints can exist among genomes that exhibit no rearrangements in pairwise comparisons.

We present an extension of the multichromosomal breakpoint median problem to genomes that have undergone gene gain and loss. We then demonstrate that the median distance among three genomes can be used to calculate a lower bound on the number of hidden breakpoints present. We provide an implementation of this calculation including the median distance, along with some practical improvements on the time complexity of the underlying algorithm.

We apply our approach to measure the abundance of hidden breakpoints in simulated data sets under a wide range of evolutionary scenarios. We demonstrate that in simulations the hidden breakpoint counts depend strongly on relative rates of inversion and gene gain/loss. Finally we apply current multiple genome aligners to the simulated genomes, and show that all aligners introduce a high degree of error in hidden breakpoint counts, and that this error grows with evolutionary distance in the simulation. Our results suggest that hidden breakpoint error may be pervasive in genome alignments.

Birte Kehr, Knut Reinert, Aaron E. Darling

A Probabilistic Approach to Accurate Abundance-Based Binning of Metagenomic Reads

An important problem in metagenomic analysis is to determine and quantify species (or genomes) in a metagenomic sample. The identification of phylogenetically related groups of sequence reads in a metagenomic dataset is often referred to as binning. Similarity-based binning methods rely on reference databases, and are unable to classify reads from unknown organisms. Composition-based methods exploit compositional patterns that are preserved in sufficiently long fragments, but are not suitable for binning very short next-generation sequencing (NGS) reads. Recently, several new metagenomic binning algorithms that can deal with NGS reads and do not rely on reference databases have been developed. However, all of them have difficulty with handling samples containing low-abundance species. We propose a new method to accurately estimate the abundance levels of species based on a novel probabilistic model for counting

l

-mer frequencies in a metagenomic dataset that takes into account frequencies of erroneous

l

-mers and repeated

l

-mers. An expectation maximization (EM) algorithm is used to learn the parameters of the model. Our algorithm automatically determines the number of abundance groups in a dataset and bins the reads into these groups. We show that our method outperforms the most recent abundance-based binning method, AbundanceBin, on both simulated and real datasets. We also show that the improved abundance-based binning method can be incorporated into a recent tool TOSS, which separates genomes with similar abundance levels and employs AbundanceBin as a preprocessing step to handle different abundance levels, to enhance its performance. We test the improved TOSS on simulated datasets and show that it significantly outperforms TOSS on datasets containing low-abundance genomes. Finally, we compare this approach against very recent metagenomic binning tools MetaCluster 4.0 and MetaCluster 5.0 on simulated data and demonstrate that it usually achieves a better sensitivity and breaks fewer genomes.

Olga Tanaseichuk, James Borneman, Tao Jiang

Tandem Halving Problems by DCJ

We address the problem of reconstructing a non-duplicated ancestor to a partially duplicated genome in a model where duplicated content is caused by several tandem duplications throughout its evolution and the only allowed rearrangement operations are DCJ. As a starting point, we consider a variant of the Genome Halving Problem, aiming at reconstructing a tandem duplicated genome instead of the traditional perfectly duplicated genome. We provide a distance in

$\mathcal{O}(n)$

time and a scenario in

$\mathcal{O}(n^2)$

time. In an attempt to enhance our model, we consider several problems related to multiple tandem reconstruction. Unfortunately we show that although the problem of reconstructing a single tandem can be solved polynomially, it is already NP-hard for 2 tandems.

Antoine Thomas, Aïda Ouangraoua, Jean-Stéphane Varré

A Practical Approximation Algorithm for Solving Massive Instances of Hybridization Number

Reticulate events play an important role in determining evolutionary relationships. The problem of computing the minimum number of such events to explain discordance between two phylogenetic trees is a hard computational problem. In practice, exact solvers struggle to solve instances with reticulation number larger than 40. For such instances, one has to resort to heuristics and approximation algorithms. Here we present the algorithm

CycleKiller

which is the first approximation algorithm that can produce solutions verifiably close to optimality for instances with hundreds or even thousands of reticulations. Theoretically, the algorithm is an exponential-time 2-approximation (or 4-approximation in its fastest mode). However, using simulations we demonstrate that in practice the algorithm runs quickly for large and difficult instances, producing solutions within one percent of optimality. An implementation of this algorithm, which extends the theoretical work of [14], has been made publicly available.

Leo van Iersel, Steven Kelk, Nela Lekić, Celine Scornavacca

Distributed String Mining for High-Throughput Sequencing Data

The goal of frequency constrained string mining is to extract substrings that discriminate two (or more) datasets. Known solutions to the problem range from an optimal time algorithm to different time–space tradeoffs. However, all of the existing algorithms have been designed to be run in a sequential manner and require that the whole input fits the main memory. Due to these limitations, the existing algorithms are practical only up to a few gigabytes of input. We introduce a distributed algorithm that has a novel time–space tradeoff and, in practice, achieves a significant reduction in both memory and time compared to state-of-the-art methods. To demonstrate the feasibility of the new algorithm, our study includes comprehensive tests on large-scale metagenomics data. We also study the cost of renting the required infrastructure from, e.g. Amazon EC2. Our distributed algorithm is shown to be practical on terabyte-scale inputs and affordable on rented infrastructure.

Niko Välimäki, Simon J. Puglisi

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter