Skip to main content

Über dieses Buch

This book constitutes the proceedings of the 6th InternationalConference on Algorithms for Computational Biology, AlCoB 2019, held in Berkeley, CA, USA, in May 2019.

The 15 full papers presented together with 1 invited paper were carefully reviewed and selected from 30 submissions. They are organized in the following topical sections: Biological networks and graph algorithms; genome rearrangement, assembly and classification; sequence analysis, phylogenetics and other biological processes.



Invited Talk


New Divide-and-Conquer Techniques for Large-Scale Phylogenetic Estimation

Over the last years, the availability of genomic sequence data from thousands of different species has led to hopes that a phylogenetic tree of all life might be achievable. Yet, the most accurate methods for estimating phylogenies are heuristics for NP-hard optimization problems, many of which are too computationally intensive to use on large datasets. Divide-and-conquer approaches have been proposed to address scalability to large datasets that divide the species into subsets, construct trees on subsets, and then merge the trees together. Prior approaches have divided species sets into overlapping subsets and used supertree methods to merge the subset trees, but limitations in supertree methods suggest this kind of divide-and-conquer approach is unlikely to provide scalability to ultra-large datasets. Recently, a new approach has been developed that divides the species dataset into disjoint subsets, computes trees on subsets, and then combines the subset trees using auxiliary information (e.g., a distance matrix). Here, we describe these strategies and their theoretical properties, present open problems, and discuss opportunities for impact in large-scale phylogenetic estimation using these and similar approaches.
Tandy Warnow

Biological Networks and Graph Algorithms


New Polynomial-Time Algorithm Around the Scaffolding Problem

We describe in this paper an approximation algorithm for the scaffolding problem, which is part of genome inference in bioinformatics. The aim of the problem is to find a maximum weighted collection of disjoint alternating cycles and paths covering a particular graph called scaffold graph. The problem is known to be NP-complete, and we describe further result concerning a special class of graphs aiming to be close to real instances. The described algorithm is the first polynomial-time approximation algorithm designed for this problem on non-complete graphs.
Tom Davot, Annie Chateau, Rodolphe Giroudeau, Mathias Weller

Enumerating Dominant Pathways in Biological Networks by Information Flow Analysis

Cells perceive and respond to their microenvironment as a part of their functioning via networks of processes resulting from molecular interactions. The complexity of such networks has been the subject of studies that address their various aspects. Some of these include static methods that focus on graph representations and their consequent properties, while others take a dynamical systems approach based on simulations. Here, we address the problem of identifying dominant pathways in biological networks that are represented as activation and repression edges. For this purpose, we propose a hybrid method that combines static graph properties with a dynamic quantification of information flow that results from stochastic simulations. We first illustrate our method on a simple example, and then apply it to the Escherichia coli transcription network consisting of 4639 regulatory edges.
Ozan Kahramanoğulları

Comparing Different Graphlet Measures for Evaluating Network Model Fits to BioGRID PPI Networks

The network structure of protein-protein interaction (PPI) networks has been studied for over a decade. Many theoretical models have been proposed to model PPI networks, but continuing noise and incompleteness in these networks make conclusions difficult. Graphlet-based measures are believed to be among the strongest, most discerning and sensitive network comparison tools available. Several graphlet-based measures have been proposed to measure topological agreement between networks and models, with little work done to compare the measures themselves. The last modeling attempt was 4 years ago; it is time for an update. Using Sept. 2018 BioGRID, we fit eight theoretical models to nine BioGRID networks using four different graphlet-based measures. We find the following: (1) Graph Kernel is the best measure based on ROC and AUPR curves; (2) most graphlet measures disagree on the ordering of the data-model fits, although most agree on the top two (STICKY and Hyperbolic Geometric) and bottom two (ER and GEO) models, in direct contradiction to the 4-years-ago conclusion that GEO models are best; (3) the STICKY model is overall the best fit for these PPI networks but the Hyperbolic Geometric model is a better fit than STICKY on 4 species; and (4) even the best models provide p-values for BioGRID that are many orders of magnitude smaller than 1, thus failing any reasonable hypothesis test. We conclude that in spite of STICKY being the best fit, all BioGRID networks fail all hypothesis tests against all existing models, using all existing graphlet-based measures. Further work is needed to discover whether the data or the models are at fault.
Sridevi Maharaj, Zarin Ohiba, Wayne Hayes

Graph-Theoretic Partitioning of RNAs and Classification of Pseudoknots

Dual graphs have been applied to model RNA secondary structures with pseudoknots, or intertwined base pairs. In a previous work, a linear-time algorithm was introduced to partition dual graphs into maximal topological components called blocks and determine whether each block contains a pseudoknot or not. This characterization allowed us to efficiently isolate smaller RNA fragments and classify them as pseudoknotted or pseudoknot-free regions, while keeping these sub-structures intact. In this paper we extend the partitioning algorithm by classifying a pseudoknot as either recursive or non-recursive. A pseudoknot is recursive if it contains independent regions or fragments. Each of these regions can be also identified by the modified algorithm, continuing with our current research in the development of a library of building blocks for RNA design by fragment assembly. Partitioning and classification of RNAs using dual graphs provide a systematic way for study of RNA structure and prediction.
Louis Petingi, Tamar Schlick

PathRacer: Racing Profile HMM Paths on Assembly Graph

Recently large databases containing profile Hidden Markov Models (pHMMs) emerged. These pHMMs may represent the sequences of antibiotic resistance genes, or allelic variations amongst highly conserved housekeeping genes used for strain typing, etc. The typical application of such a database includes the alignment of contigs to pHMM hoping that the sequence of gene of interest is located within the single contig. Such a condition is often violated for metagenomes preventing the effective use of such databases.
We present PathRacer—a novel standalone tool that aligns profile HMM directly to the assembly graph (performing the codon translation on fly for amino acid pHMMs). The tool provides the set of most probable paths traversed by a HMM through the whole assembly graph, regardless whether the sequence of interested is encoded on the single contig or scattered across the set of edges, therefore significantly improving the recovery of sequences of interest even from fragmented metagenome assemblies.
Alexander Shlemov, Anton Korobeynikov

Genome Rearrangement, Assembly and Classification


A Uniform Theory of Adequate Subgraphs for the Genome Median, Halving, and Aliquoting Problems

One of the key computational problems in comparative genomics is the reconstruction of genomes of ancestral species based on genomes of extant species. Since most dramatic changes in genomic architectures are caused by genome rearrangements, this problem is often posed as minimization of the number of genome rearrangements between extant and ancestral genomes. The basic case of three given genomes is known as the genome median problem. Whole genome duplications (WGDs) represent yet another type of dramatic evolutionary events and inspire the reconstruction of pre-duplicated ancestral genomes, referred to as the genome halving problem. Generalization of WGDs to whole genome multiplication events leads to the genome aliquoting problem.
In the present study, we generalize the adequate subgraphs approach previously proposed for the genome median problem to the genome halving and aliquoting problems. Our study lays a theoretical foundation for practical algorithms for the reconstruction of pre-duplicated ancestral genomes.
Pavel Avdeyev, Maria Atamanova, Max A. Alekseyev

Lightweight Metagenomic Classification via eBWT

The development of Next Generation Sequencing has had a major impact on the study of genetic sequences, and in particular, on the advancement of metagenomics, whose aim is to identify the microorganisms that are present in a sample collected directly from the environment. In this paper, we describe a new lightweight alignment-free and assembly-free framework for metagenomic classification that compares each unknown sequence in the sample to a collection of known genomes. We take advantage of the combinatorial properties of an extension of the Burrows-Wheeler transform, and we sequentially scan the required data structures, so that we can analyze unknown sequences of large collections using little internal memory. For the best of our knowledge, this is the first approach that is assembly- and alignment-free, and is not based on k-mers. We show that our experiments confirm the effectiveness of our approach and the high accuracy even in negative control samples. Indeed we only classify 1 short read on 5,726,358 random shuffle reads. Finally, the results are comparable with those achieved by read-mapping classifiers and by k-mer based classifiers.
Veronica Guerrini, Giovanna Rosone

MULKSG: MULtiple K Simultaneous Graph Assembly

This work shows how to parallelize multi K de Bruijn graph genome assembly simultaneously, removing the bottleneck of iterative multi K assembly. The expected execution time on a single node with 40 cores is variable, with the average execution time for the entire pipeline over 16 datasets tested being 1613 s for SPAdes vs. 1581 s for MULKSG, with the MULKSG graph creation and traversal averaging 15% faster than SPAdes. We implement a multi-node implementation for the graph creation and traversal portions of the assembly, showing the speedups in Fig. 4. We show that when implemented correctly with correction phases performed per graph in parallel, the expected outcome is very close to the original method, in some cases having less errors while keeping the same NGA50 and genome coverage %. We show this works in practice, implementing with the popular genome assembler SPAdes. Further, this algorithmic change gets rid of the single node sequential bottleneck on multi K genome assembly, allowing for the use of parallel error correction, graph building, graph correction, and graph traversal. We implement a parallel version of the assembly and show the statistics are the same as when run on a single node. The code is open source and can be found at https://​github.​com/​cwright7101/​mulksg.
Christopher Wright, Sriram Krishnamoorty, Milind Kulkarni

Counting Sorting Scenarios and Intermediate Genomes for the Rank Distance

An important problem in genome comparison is the genome sorting problem, that is, the problem of finding a sequence of basic operations that transforms one genome into another whose length (possibly weighted) equals the distance between them. These sequences are called optimal sorting scenarios. However, there is usually a large number of such scenarios, and a naïve algorithm is very likely to be biased towards a specific type of scenario, impairing its usefulness in real-world applications. One way to go beyond the traditional sorting algorithms is to explore all possible solutions, looking at all the optimal sorting scenarios instead of just an arbitrary one. Another related approach is to analyze all the intermediate genomes, that is, all the genomes that can occur in an optimal sorting scenario. In this paper, we show how to count the number of optimal sorting scenarios and the number of intermediate genomes between any two given genomes, under the rank distance.
João Paulo Pereira Zanetti, Leonid Chindelevitch, João Meidanis

Generalizations of the Genomic Rank Distance to Indels

The rank distance model, introduced by Zanetti et al. in 2016, represents genome rearrangements in multi-chromosomal genomes looking at them as matrices. So far, this model only supported comparisons between genomes with the same gene content. We seek to generalize it, allowing for genomes with different gene content. In this paper, we approach such generalization from two different angles, both using the same representation of genomes, and leading to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications.
João Paulo Pereira Zanetti, Leonid Chindelevitch, João Meidanis

Sequence Analysis, Phylogenetics and Other Biological Processes


Using INC Within Divide-and-Conquer Phylogeny Estimation

In a recent paper (Zhang, Rao, and Warnow, Algorithms for Molecular Biology 2019), the INC (incremental tree building) algorithm was presented and proven to be absolute fast converging under standard sequence evolution models. A variant of INC which allows a set of disjoint constraint trees to be provided and then uses INC to merge the constraint trees was also presented (i.e., Constrained INC). We report on a study evaluating INC on a range of simulated datasets, and show that it has very poor accuracy in comparison to standard methods. We also explore the design space for divide-and-conquer strategies for phylogeny estimation that use Constrained INC, and show modifications that provide improved accuracy. In particular, we present INC-ML, a divide-and-conquer approach to maximum likelihood (ML) estimation that comes close to the leading ML heuristics in terms of accuracy, and is more accurate than the current best distance-based methods.
Thien Le, Aaron Sy, Erin K. Molloy, Qiuyi (Richard) Zhang, Satish Rao, Tandy Warnow

Predicting Methylation from Sequence and Gene Expression Using Deep Learning with Attention

DNA methylation has been extensively linked to alterations in gene expression, playing a key role in the manifestation of multiple diseases, especially cancer. Hence, the sequence determinants of methylation and the relationship between methylation and expression are of great interest from a molecular biology perspective. Several models have been suggested to support the prediction of methylation status. These models, however, have two main limitations: (a) they are limited to specific CpG loci; and (b) they are not easily interpretable. We address these limitations using deep learning with attention. We produce a general model that predicts DNA methylation for a given sample in any CpG position based solely on the sample’s gene expression profile and the sequence surrounding the CpG. Depending on gene-CpG proximity, our model attains a Spearman correlation of up to 0.84 for thousands of CpG sites on two separate test sets of CpG positions and subjects (cancer and healthy samples). Importantly, our approach, especially the use of attention, offers a novel framework with which to extract valuable insights from gene expression data when combined with sequence information. We demonstrate this by linking several motifs and genes to methylation activity, including Nodal and Hand1. The code and trained weights are available at: https://​github.​com/​YakhiniGroup/​Methylation.
Alona Levy-Jurgenson, Xavier Tekpli, Vessela N. Kristensen, Zohar Yakhini

A Mathematical Model for Enhancer Activation Kinetics During Cell Differentiation

Cell differentiation and development are for a great part steered by cell type specific enhancers. Transcription factor (TF) binding to an enhancer together with DNA looping result in transcription initiation. In addition to binding motifs for TFs, enhancer regions typically contain specific histone modifications. This information has been used to detect enhancer regions and classify them into different subgroups. However, it is poorly understood how TF binding and histone modifications are causally connected and what kind of molecular dynamics steer the activation process.
Contrary to previous studies, we do not treat the activation events as static epigenetic marks but consider the enhancer activation as a dynamic process. We develop a mathematical model to describe the dynamic mechanisms between TF binding and histone modifications known to characterize an active enhancer. We estimate model parameters from time-course data and infer the causal relationships between TF binding and different histone modifications. We benchmark the performance of this framework using simulated data and survey the ability of our method to identify the correct model structures for a variety of system dynamics, noise levels and the number of measurement time points.
Kari Nousiainen, Jukka Intosalmi, Harri Lähdesmäki

Transcript Abundance Estimation and the Laminar Packing Problem

The expectation-maximization (EM) algorithm, or a streaming version of it, is widely used to resolve ambiguity during transcript abundance estimation from RNA-seq reads. The streaming algorithm is fast and memory efficient but its accuracy can depend on the order of the reads, which can be stabilized if a tree can be constructed to capture the ambiguity structure. Motivated by this, the laminar packing problem is introduced, which is proved to be NP-hard. Hardness of approximation and approximability results are also provided. Finally, an integer linear programming (ILP) formulation and a greedy approach are applied to real data from the human transcriptome to demonstrate that large instances can be solved in practice.
Atif Rahman, Lior Pachter

Efficient Algorithms for Finding Edit-Distance Based Motifs

Motif mining is a classical data mining problem which aims to extract relevant information and discover knowledge from voluminous datasets in a variety of domains. Specifically, for the temporal data containing real numbers, it is formulated as time series motif mining (TSMM) problem. If the input is alphabetical and edit-distance is considered, this is called Edit-distance Motif Search (EMS). In EMS, the problem of interest is to find a pattern of length l which occurs with an edit-distance of at most d in each of the input sequences.
There exists some algorithms proposed in the literature to solve EMS problem. However, in terms of challenging instances and large datasets, they are still not efficient. In this paper, EMS3, a motif mining algorithm, that advances the state-of-the-art EMS solvers by exploiting the idea of projection is proposed. Solid theoretical analyses and extensive experiments on commonly used benchmark datasets show that EMS3 is efficient and outperforms the existing state-of-the-art algorithm (EMS2).
Peng Xiao, Xingyu Cai, Sanguthevar Rajasekaran


Weitere Informationen

Premium Partner