Skip to main content
Top

2020 | Book

Algorithms for Computational Biology

7th International Conference, AlCoB 2020, Missoula, MT, USA, April 13–15, 2020, Proceedings

insite
SEARCH

About this book

This book constitutes the proceedings of the 7th International Conference on Algorithms for Computational Biology, AlCoB 2020, held in Missoula, MT, USA in April 2020.
The 15 full papers included in this volume were carefully reviewed and selected from 24 submissions. They were organized in topical sections on genomics, phylogenetics, and RNA-Seq and other biological processes.

Table of Contents

Frontmatter

Genomics

Frontmatter
Parallel Generalized Suffix Tree Construction for Genomic Data
Abstract
After a decade of digitization and technological advancements, we have an abundance of usable genomic data, which provide unique insights into our well-being. However, such datasets are large in volume, and retrieving meaningful information from them is often challenging. Hence, different indexing techniques and data structures have been proposed to handle such a massive scale of data. We utilize one such technique: Generalized Suffix Tree (GST). In this paper, we introduce an efficient parallel generalized suffix tree construction algorithm that is scalable for arbitrary genomic datasets. Our construction mechanism employs shared and distributed memory architecture collectively while not posing any fixed, prior memory requirement as it uses external memory (disks). Our experimental results show that our proposed architecture offers around 4-times speedup with respect to the sequential algorithm with only 16 parallel processors. The experiments on different datasets and parameters also exhibit the scalability of the execution time. In addition, we utilize different string queries and demonstrate their execution time on such tree structure, illustrating the efficacy and usability of GST for genomic data.
Md Momin Al Aziz, Parimala Thulasiraman, Noman Mohammed
A 3.5-Approximation Algorithm for Sorting by Intergenic Transpositions
Abstract
Genome Rearrangements affect large stretches of genomes during evolution. One of the most studied genome rearrangement is the transposition, which occurs when a sequence of genes is moved to another position inside the genome. Mathematical models have been used to estimate the evolutionary distance between two different genomes based on genome rearrangements. However, many of these models have focused only on the (order of the) genes of a genome, disregarding other important elements in it. Recently, researchers have shown that considering existing regions between each pair of genes, called intergenic regions, can enhance the distance estimation in realistic data. In this work, we study the transposition distance between two genomes, but we also consider intergenic regions, a problem we name Sorting Permutations by Intergenic Transpositions (SbIT). We show that this problem is NP-hard and propose a 3.5-approximation algorithm for it.
Andre Rodrigues Oliveira, Géraldine Jean, Guillaume Fertin, Klairton Lima Brito, Ulisses Dias, Zanoni Dias
Heuristics for Reversal Distance Between Genomes with Duplicated Genes
Abstract
In comparative genomics, one goal is to find similarities between genomes of different organisms. Comparisons using genome features like genes, gene order, and regulatory sequences are carried out with this purpose in mind.
Genome rearrangements are mutational events that affect large extensions of the genome. They are responsible for creating extant species with conserved genes in different positions across genomes.
Close species—from an evolutionary point of view—tend to have the same set of genes or share most of them. When we consider gene order to compare two genomes, it is possible to use a parsimony criterion to estimate how close the species are. We are interested in the shortest sequence of genome rearrangements capable of transforming one genome into the other, which is named rearrangement distance.
Reversal is one of the most studied genome rearrangements events. This event acts in a segment of the genome, inverting the position and possibly the orientation of genes in it.
When the genome has no gene repetition, a common approach is to map it as a permutation such that each element represents a conserved block.
When genomes have replicated genes, this mapping is usually performed using strings. The number of replicas depends on the organisms being compared, but in many scenarios, it tends to be small. In this work, we study the reversal distance between genomes with duplicated genes considering that the orientation of genes is unknown. We present three heuristics that use techniques like genetic algorithms and local search. We conduct experiments using a database of simulated genomes and compared our results with other algorithms from the literature.
Gabriel Siqueira, Klairton Lima Brito, Ulisses Dias, Zanoni Dias
Extending Maximal Perfect Haplotype Blocks to the Realm of Pangenomics
Abstract
Recent work provides the first method to measure the relative fitness of genomic variants within a population that scales to large numbers of genomes. A key component of the computation involves finding conserved haplotype blocks, which can be done in linear time. Here, we extend the notion of conserved haplotype blocks to pangenomes, which can store more complex variation than a single reference genome. We define a maximal perfect pangenome haplotype block and give a linear-time, suffix tree based approach to find all such blocks from a set of pangenome haplotypes. We demonstrate the method by applying it to a pangenome built from yeast strains.
Lucia Williams, Brendan Mumey
Gaps and Runs in Syntenic Alignments
Abstract
Gene loss is the obverse of novel gene acquisition by a genome through a variety of evolutionary processes. It serves a number of functional and structural roles, compensating for the energy and material costs of gene complement expansion.
A type of gene loss widespread in the lineages of plant genomes is “fractionation” after whole genome doubling or tripling, where one of a pair or triplet of paralogous genes in parallel syntenic contexts is discarded.
The detailed syntenic mechanisms of gene loss, especially in fractionation, remain controversial.
We focus on the the frequency distribution of gap lengths (number of deleted genes – not nucleotides) within syntenic blocks calculated during the comparison of chromosomes from two genomes. We mathematically characterize s simple model in some detail and show how it is an adequate description neither of the Coffea arabica subgenomes nor its two progenitor genomes.
We find that a mixture of two models, a random, one-gene-at-a-time, model and a geometric-length distributed excision for removing a variable number of genes, fits well.
Zhe Yu, Chunfang Zheng, David Sankoff

Phylogenetics

Frontmatter
Comparing Integer Linear Programming to SAT-Solving for Hard Problems in Computational and Systems Biology
Abstract
It is useful to have general-purpose solution methods that can be applied to a wide range of problems, rather than relying on the development of clever, intricate algorithms for each specific problem. Integer Linear Programming is the most widely-used such general-purpose solution method. It is successful in a wide range of problems. However, there are some problems in computational biology where integer linear programming has had only limited success. In this paper, we explore an alternate, general-purpose solution method: SAT-solving, i.e., constructing Boolean formulas in conjunctive normal form (CNF) that encode a problem instance, and using a SAT-solver to determine if the CNF formula is satisfiable or not. In three hard problems examined, we were very surprised to find the SAT-solving approach was dramatically better than the ILP approach in two problems; and a little slower, but more robust, in the third problem. We also re-examined and confirmed an earlier result on a fourth problem, using current ILP and SAT-solvers. These results should encourage further efforts to exploit SAT-solving in computational biology.
Hannah Brown, Lei Zuo, Dan Gusfield
Combining Networks Using Cherry Picking Sequences
Abstract
Phylogenetic networks are important for the study of evolution. The number of methods to find such networks is increasing, but most such methods can only reconstruct small networks. To find bigger networks, one can attempt to combine small networks. In this paper, we study the Network Hybridization problem, a problem of combining networks into another network with low complexity. We characterize this complexity via a restricted problem, Tree-child Network Hybridization, and we present an FPT algorithm to efficiently solve this restricted problem.
Remie Janssen, Mark Jones, Yukihiro Murakami
Linear Time Algorithm for Tree-Child Network Containment
Abstract
Phylogenetic networks are used to represent evolutionary scenarios in biology and linguistics. To find the most probable scenario, it may be necessary to compare candidate networks, to distinguish different networks, and to see when one network is embedded in another. Here, we consider the Network Containment problem, which asks whether a given network is contained in another network. We give a linear-time algorithm to this problem for the class of tree-child networks using the recently introduced tree-child sequences by Linz and Semple. We implement this algorithm in Python and show that the linear-time theoretical bound on the input size is achievable in practice.
Remie Janssen, Yukihiro Murakami
PathOGiST: A Novel Method for Clustering Pathogen Isolates by Combining Multiple Genotyping Signals
Abstract
In this paper we study the problem of clustering bacterial isolates into epidemiologically related groups from next-generation sequencing data. Existing methods for this problem mainly use a single genotyping signal, and either use a distance-based method with a pre-specified number of clusters, or a phylogenetic tree-based method with a pre-specified threshold. We propose PathOGiST, an algorithmic framework for clustering bacterial isolates by leveraging multiple genotypic signals and calibrated thresholds. PathOGiST uses different genotypic signals, clusters the isolates based on these individual signals with correlation clustering, and combines the clusterings based on the individual signals through consensus clustering. We implemented and tested PathOGiST on three different bacterial pathogens - Escherichia coli, Yersinia pseudotuberculosis, and Mycobacterium tuberculosis - and we conclude by discussing further avenues to explore.
Mohsen Katebi, Pedro Feijao, Julius Booth, Mehrdad Mansouri, Sean La, Alex Sweeten, Reza Miraskarshahi, Matthew Nguyen, Johnathan Wong, William Hsiao, Cedric Chauve, Leonid Chindelevitch
TreeSolve: Rapid Error-Correction of Microbial Gene Trees
Abstract
Gene tree reconstruction is an important problem in phylogenetics. However, gene sequences often lack sufficient information to confidently distinguish between competing gene tree topologies. To overcome this limitation, the best gene tree reconstruction methods use a known species tree topology to guide the reconstruction of the gene tree. While such species-tree-aware gene tree reconstruction methods have been repeatedly shown to result in vastly more accurate gene trees, the most accurate of these methods often have prohibitively high computational costs.
In this work, we introduce a highly computationally efficient and robust species-tree-aware method, named TreeSolve, for microbial gene tree reconstruction. TreeSolve works by collapsing weakly supported edges of the input gene tree, resulting in a non-binary gene tree, and then using new algorithms and techniques to optimally resolve the non-binary gene trees with respect to the given species tree in an appropriately and dynamically constrained search space. Using thousands of real and simulated gene trees, we demonstrate that TreeSolve significantly outperforms the best existing species-tree-aware methods for microbes in terms of accuracy, speed, or both. Crucially, TreeSolve also implicitly keeps track of multiple optimal gene tree reconstructions and can compute either a single best estimate of the gene tree or multiple distinct estimates. As we demonstrate, aggregating over multiple gene tree candidates helps distinguish between correct and incorrect parts of an error-corrected gene tree. Thus, TreeSolve not only enables rapid gene tree error-correction for large gene trees without compromising on accuracy, but also enables accounting of inference uncertainty.
Misagh Kordi, Mukul S. Bansal

RNA-Seq and Other Biological Processes

Frontmatter
Time Series Adjustment Enhancement of Hierarchical Modeling of Arabidopsis Thaliana Gene Interactions
Abstract
Network models of gene interactions, using time course gene transcript abundance data, are computationally created using a genetic algorithm designed to incorporate hierarchical Bayesian methods with time series adjustments. The posterior probabilities of interaction between pairs of genes are based on likelihoods of directed acyclic graphs. This algorithm is applied to transcript abundance data collected from Arabidopsis thaliana genes. This study extends the underlying statistical and mathematical theory of the Norris-Patton likelihood by including time series adjustments.
Edward E. Allen, John Farrell, Alexandria F. Harkey, David J. John, Gloria Muday, James L. Norris, Bo Wu
BESTox: A Convolutional Neural Network Regression Model Based on Binary-Encoded SMILES for Acute Oral Toxicity Prediction of Chemical Compounds
Abstract
Compound toxicity prediction is a very challenging and critical task in the drug discovery and design field. Traditionally, cell or animal-based experiments are required to confirm the acute oral toxicity of chemical compounds. However, these methods are often restricted by availability of experimental facilities, long experimentation time, and high cost. In this paper, we propose a novel convolutional neural network regression model, named BESTox, to predict the acute oral toxicity (\(LD_{50}\)) of chemical compounds. This model learns the compositional and chemical properties of compounds from their two-dimensional binary matrices. Each matrix encodes the occurrences of certain atom types, number of bonded hydrogens, atom charge, valence, ring, degree, aromaticity, chirality, and hybridization along the SMILES string of a given compound. In a benchmark experiment using a dataset of 7413 observations (train/test 5931/1482), BESTox achieved a squared correlation coefficient (\(R^2\)) of 0.619, root-mean-squared error (RMSE) of 0.603, and mean absolute error (MAE) of 0.433. Despite of the use of a shallow model architecture and simple molecular descriptors, our method performs comparably against two recently published models.
Jiarui Chen, Hong-Hin Cheong, Shirley Weng In Siu
Stratified Test Alleviates Batch Effects in Single-Cell Data
Abstract
Analyzing single-cell sequencing data across batches is challenging. We find that the Van Elteren test, a stratified version of Wilcoxon rank-sum test, elegantly mitigates the problem. We also modified the common language effect size to supplement this test, further improving its utility. On both simulated and real patient data we show the ability of Van Elteren test to control for false positives and false negatives. The effect size also estimates the differences between cell types more accurately.
Shaoheng Liang, Qingnan Liang, Rui Chen, Ken Chen
A Topological Data Analysis Approach on Predicting Phenotypes from Gene Expression Data
Abstract
The goal of this study was to investigate if gene expression measured from RNA sequencing contains enough signal to separate healthy and afflicted individuals in the context of phenotype prediction. We observed that standard machine learning methods alone performed somewhat poorly on the disease phenotype prediction task; therefore we devised an approach augmenting machine learning with topological data analysis.
We describe a framework for predicting phenotype values by utilizing gene expression data transformed into sample-specific topological signatures by employing feature subsampling and persistent homology. The topological data analysis approach developed in this work yielded improved results on Parkinson’s disease phenotype prediction when measured against standard machine learning methods.
This study confirms that gene expression can be a useful indicator of the presence or absence of a condition, and the subtle signal contained in this high dimensional data reveals itself when considering the intricate topological connections between expressed genes.
Sayan Mandal, Aldo Guzmán-Sáenz, Niina Haiminen, Saugata Basu, Laxmi Parida
BOAssembler: A Bayesian Optimization Framework to Improve RNA-Seq Assembly Performance
Abstract
High throughput sequencing of RNA (RNA-Seq) can provide us with millions of short fragments of RNA transcripts from a sample. How to better recover the original RNA transcripts from those fragments (RNA-Seq assembly) is still a difficult task. For example, RNA-Seq assembly tools typically require hyper-parameter tuning to achieve good performance for particular datasets. This kind of tuning is usually unintuitive and time-consuming. Consequently, users often resort to default parameters, which do not guarantee consistent good performance for various datasets.
Results: Here we propose BOAssembler, a framework that enables end-to-end automatic tuning of RNA-Seq assemblers, based on Bayesian Optimization principles. Experiments show this data-driven approach is effective to improve the overall assembly performance. The approach would be helpful for downstream (e.g. gene, protein, cell) analysis, and more broadly, for future bioinformatics benchmark studies.
Shunfu Mao, Yihan Jiang, Edwin Basil Mathew, Sreeram Kannan
Backmatter
Metadata
Title
Algorithms for Computational Biology
Editors
Carlos Martín-Vide
Miguel A. Vega-Rodríguez
Travis Wheeler
Copyright Year
2020
Electronic ISBN
978-3-030-42266-0
Print ISBN
978-3-030-42265-3
DOI
https://doi.org/10.1007/978-3-030-42266-0

Premium Partner