Skip to main content

Über dieses Buch

This book constitutes the proceedings of the 23rd Annual Conference on Research in Computational Molecular Biology, RECOMB 2019, held in Washington, DC, USA, in April 2019.

The 17 extended and 20 short abstracts presented were carefully reviewed and selected from 175 submissions. The short abstracts are included in the back matter of the volume. The papers report on original research in all areas of computational molecular biology and bioinformatics.



An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure.
In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes—patterns of color occurrence—present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows into the thousands.
We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than \(11\times \) better compression compared to RRR.
Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson, Rob Patro

Identifying Clinical Terms in Free-Text Notes Using Ontology-Guided Machine Learning

Objective: Automatic recognition of medical concepts in unstructured text is an important component of many clinical and research applications and its accuracy has a large impact on electronic health record analysis. The mining of such terms is complicated by the broad use of synonyms and non-standard terms in medical documents. Here we presented a machine learning model for concept recognition in large unstructured text which optimizes the use of ontological structures and can identify previously unobserved synonyms for concepts in the ontology.
Materials and Methods: We present a neural dictionary model which can be used to predict if a phrase is synonymous to a concept in a reference ontology. Our model, called Neural Concept Recognizer (NCR), uses a convolutional neural network and utilizes the taxonomy structure to encode input phrases, then rank medical concepts based on the similarity in that space. It also utilizes the biomedical ontology structure to optimize the embedding of various terms and has fewer training constraints than previous methods. We train our model on two biomedical ontologies, the Human Phenotype Ontology (HPO) and SNOMED-CT.
Results: We tested our model trained on HPO on two different data sets: 288 annotated PubMed abstracts and 39 clinical reports. We also tested our model trained on the SNOMED-CT on 2000 MIMIC-III ICU discharge summaries. The results of our experiments show the high accuracy of our model, as well as the value of utilizing the taxonomy structure of the ontology in concept recognition.
Conclusion: Most popular medical concept recognizers rely on rule-based models, which cannot generalize well to unseen synonyms. Also, most machine learning methods typically require large corpora of annotated text that cover all classes of concepts, which can be extremely difficult to get for biomedical ontologies. Without relying on a large-scale labeled training data or requiring any custom training, our model can efficiently generalize to new synonyms and performs as well or better than state-of-the-art methods custom built for specific ontologies.
Aryan Arbabi, David R. Adams, Sanja Fidler, Michael Brudno

ModHMM: A Modular Supra-Bayesian Genome Segmentation Method

Genome segmentation methods are powerful tools to obtain cell type or tissue specific genome-wide annotations and are frequently used to discover regulatory elements. However, traditional segmentation methods show low predictive accuracy and their data-driven annotations have some undesirable properties. As an alternative, we developed ModHMM, a highly modular genome segmentation method. Inspired by the supra-Bayesian approach, it incorporates predictions from a set of classifiers. This allows to compute genome segmentations by utilizing state-of-the-art methodology. We demonstrate the method on ENCODE data and show that it outperforms traditional segmentation methods not only in terms of predictive performance, but also in qualitative aspects. Therefore, ModHMM is a valuable alternative to study the epigenetic and regulatory landscape across and within cell types or tissues. The software is freely available at https://​github.​com/​pbenner/​modhmm.
Philipp Benner, Martin Vingron

Learning Robust Multi-label Sample Specific Distances for Identifying HIV-1 Drug Resistance

Acquired immunodeficiency syndrome (AIDS) is a syndrome caused by the human immunodeficiency virus (HIV). During the progression of AIDS, a patient’s the immune system is weakened, which increases the patient’s susceptibility to infections and diseases. Although antiretroviral drugs can effectively suppress HIV, the virus mutates very quickly and can become resistant to treatment. In addition, the virus can also become resistant to other treatments not currently being used through mutations, which is known in the clinical research community as cross-resistance. Since a single HIV strain can be resistant to multiple drugs, this problem is naturally represented as a multi-label classification problem. Given this multi-class relationship, traditional single-label classification methods usually fail to effectively identify the drug resistances that may develop after a particular virus mutation. In this paper, we propose a novel multi-label Robust Sample Specific Distance (RSSD) method to identify multi-class HIV drug resistance. Our method is novel in that it can illustrate the relative strength of the drug resistance of a reverse transcriptase sequence against a given drug nucleoside analogue and learn the distance metrics for all the drug resistances. To learn the proposed RSSDs, we formulate a learning objective that maximizes the ratio of the summations of a number of \(\ell _1\)-norm distances, which is difficult to solve in general. To solve this optimization problem, we derive an efficient, non-greedy, iterative algorithm with rigorously proved convergence. Our new method has been verified on a public HIV-1 drug resistance data set with over 600 RT sequences and five nucleoside analogues. We compared our method against other state-of-the-art multi-label classification methods and the experimental results have demonstrated the effectiveness of our proposed method.
Lodewijk Brand, Xue Yang, Kai Liu, Saad Elbeleidy, Hua Wang, Hao Zhang

MethCP: Differentially Methylated Region Detection with Change Point Models

Whole-genome bisulfite sequencing (WGBS) provides a precise measure of methylation across the genome, yet presents a challenge in identifying regions that are differentially methylated (DMRs) between different conditions. Many methods have been developed, which focus primarily on the setting of two-group comparison. We develop a DMR detecting method MethCP for WGBS data, which is applicable for a wide range of experimental designs beyond the two-group comparisons, such as time-course data. MethCP identifies DMRs based on change point detection, which naturally segments the genome and provides region-level differential analysis. For simple two-group comparison, we show that our method outperforms developed methods in accurately detecting the complete DM region on a simulated dataset and an Arabidopsis dataset. Moreover, we show that MethCP is capable of detecting wide regions with small effect sizes, which can be common in some settings but existing techniques are poor in detecting such DMRs. We also demonstrate the use of MethCP for time-course data on another dataset following methylation throughout seed germination in Arabidopsis.
Availability: The package MethCP has been submitted to Bioconductor, and is currently available at https://​github.​com/​boyinggong/​MethCP.
Boying Gong, Elizabeth Purdom

On the Complexity of Sequence to Graph Alignment

Availability of extensive genetics data across multiple individuals and populations is driving the growing importance of graph based reference representations. Aligning sequences to graphs is a fundamental operation on several types of sequence graphs (variation graphs, assembly graphs, pan-genomes, etc.) and their biological applications. Though research on sequence to graph alignments is nascent, it can draw from related work on pattern matching in hypertext. In this paper, we study sequence to graph alignment problems under Hamming and edit distance models, and linear and affine gap penalty functions, for multiple variants of the problem that allow changes in query alone, graph alone, or in both. We prove that when changes are permitted in graphs either standalone or in conjunction with changes in the query, the sequence to graph alignment problem is \(\mathcal {NP}\)-complete under both Hamming and edit distance models for alphabets of size \({\ge }2\). For the case where only changes to the sequence are permitted, we present an \(O(|V|+m|E|)\) time algorithm, where m denotes the query size, and V and E denote the vertex and edge sets of the graph, respectively. Our result is generalizable to both linear and affine gap penalty functions, and improves upon the run-time complexity of existing algorithms.
Chirag Jain, Haowen Zhang, Yu Gao, Srinivas Aluru

Minimization-Aware Recursive (): A Novel, Provable Algorithm that Accelerates Ensemble-Based Protein Design and Provably Approximates the Energy Landscape

Protein design algorithms that model continuous sidechain flexibility and conformational ensembles better approximate the in vitro and in vivo behavior of proteins. The previous state of the art, iMinDEE-\(A^*\)-\(K^*\), computes provable \(\varepsilon \)-approximations to partition functions of protein states (e.g., bound vs. unbound) by computing provable, admissible pairwise-minimized energy lower bounds on protein conformations and using the \(A^*\) enumeration algorithm to return a gap-free list of lowest-energy conformations. iMinDEE-A\(^*\)-\(K^*\) runs in time sublinear in the number of conformations, but can be trapped in loosely-bounded, low-energy conformational wells containing many conformations with highly similar energies. That is, iMinDEE-\(A^*\)-\(K^*\) is unable to exploit the correlation between protein conformation and energy: similar conformations often have similar energy. We introduce two new concepts that exploit this correlation: Minimization-Aware Enumeration and Recursive \(K^{*}\). We combine these two insights into a novel algorithm, Minimization-Aware Recursive \(K^{*}\) (\({ MARK}^{*}\)), that tightens bounds not on single conformations, but instead on distinct regions of the conformation space. We compare the performance of iMinDEE-\(A^*\)-\(K^*\) vs. \({ MARK}^{*}\) by running the \(BBK^*\) algorithm, which provably returns sequences in order of decreasing \(K^{*}\) score, using either iMinDEE-\(A^*\)-\(K^*\) or \({ MARK}^{*}\) to approximate partition functions. We show on 200 design problems that \({ MARK}^{*}\) not only enumerates and minimizes vastly fewer conformations than the previous state of the art, but also runs up to two orders of magnitude faster. Finally, we show that \({ MARK}^{*}\) not only efficiently approximates the partition function, but also provably approximates the energy landscape. To our knowledge, \({ MARK}^{*}\) is the first algorithm to do so. We use \({ MARK}^{*}\) to analyze the change in energy landscape of the bound and unbound states of the HIV-1 capsid protein C-terminal domain in complex with camelid V\(_{\mathrm{{H}}}\)H, and measure the change in conformational entropy induced by binding. Thus, \({ MARK}^{*}\) both accelerates existing designs and offers new capabilities not possible with previous algorithms.
Jonathan D. Jou, Graham T. Holt, Anna U. Lowegard, Bruce R. Donald

Sparse Binary Relation Representations for Genome Graph Annotation

High-throughput DNA sequencing data is accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. While there has been good progress towards representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this work, we present a new compression approach, Multi-BRWT, which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world datasets.
Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, André Kahles

How Many Subpopulations Is Too Many? Exponential Lower Bounds for Inferring Population Histories

Reconstruction of population histories is a central problem in population genetics. Existing coalescent-based methods, like the seminal work of Li and Durbin (Nature, 2011), attempt to solve this problem using sequence data but have no rigorous guarantees. Determining the amount of data needed to correctly reconstruct population histories is a major challenge. Using a variety of tools from information theory, the theory of extremal polynomials, and approximation theory, we prove new sharp information-theoretic lower bounds on the problem of reconstructing population structure—the history of multiple subpopulations that merge, split and change sizes over time. Our lower bounds are exponential in the number of subpopulations, even when reconstructing recent histories. We demonstrate the sharpness of our lower bounds by providing algorithms for distinguishing and learning population histories with matching dependence on the number of subpopulations.
Younhun Kim, Frederic Koehler, Ankur Moitra, Elchanan Mossel, Govind Ramnarayan

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (\({\mathsf{BWT}}\)) of the string that will allow us to find the interval in the string’s suffix array (\({\mathsf{SA}}\)) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the \({\mathsf{SA}}\) that—when used with the rank data structure—allows us access to the \({\mathsf{SA}}\). The rank data structure can be kept small even for large genomic databases, by run-length compressing the \({\mathsf{BWT}}\), but until recently there was no means known to keep the \({\mathsf{SA}}\) sample small without greatly slowing down access to the \({\mathsf{SA}}\). Now that Gagie et al. (SODA 2018) have defined an \({\mathsf{SA}}\) sample that takes about the same space as the run-length compressed \({\mathsf{BWT}}\)—we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the \({\mathsf{BWT}}\) of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.’s \({\mathsf{SA}}\) sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the \({\mathsf{SA}}\) sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over Bowtie with respect to both memory and time.
Availability: The implementations of our methods can be found at https://​gitlab.​com/​manzai/​Big-BWT (BWT and SA sample construction) and at https://​github.​com/​alshai/​r-index (indexing).
Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, Giovanni Manzini

Tumor Copy Number Deconvolution Integrating Bulk and Single-Cell Sequencing Data

Characterizing intratumor heterogeneity (ITH) is crucial to understanding cancer development, but it is hampered by limits of available data sources. Bulk DNA sequencing is the most common technology to assess ITH, but mixes many genetically distinct cells in each sample, which must then be computationally deconvolved. Single-cell sequencing (SCS) is a promising alternative, but its limitations—e.g., high noise, difficulty scaling to large populations, technical artifacts, and large data sets—have so far made it impractical for studying cohorts of sufficient size to identify statistically robust features of tumor evolution. We have developed strategies for deconvolution and tumor phylogenetics combining limited amounts of bulk and single-cell data to gain some advantages of single-cell resolution with much lower cost, with specific focus on deconvolving genomic copy number data. We developed a mixed membership model for clonal deconvolution via non-negative matrix factorization (NMF) balancing deconvolution quality with similarity to single-cell samples via an associated efficient coordinate descent algorithm. We then improve on that algorithm by integrating deconvolution with clonal phylogeny inference, using a mixed integer linear programming (MILP) model to incorporate a minimum evolution phylogenetic tree cost in the problem objective. We demonstrate the effectiveness of these methods on semi-simulated data of known ground truth, showing improved deconvolution accuracy relative to bulk data alone.
Haoyun Lei, Bochuan Lyu, E. Michael Gertz, Alejandro A. Schäffer, Xulian Shi, Kui Wu, Guibo Li, Liqin Xu, Yong Hou, Michael Dean, Russell Schwartz

OMGS: Optical Map-Based Genome Scaffolding

Due to the current limitations of sequencing technologies, de novo genome assembly is typically carried out in two stages, namely contig (sequence) assembly and scaffolding. While scaffolding is computationally easier than sequence assembly, the scaffolding problem can be challenging due to the high repetitive content of eukaryotic genomes, possible mis-joins in assembled contigs and inaccuracies in the linkage information. Genome scaffolding tools either use paired-end/mate-pair/linked/Hi-C reads or genome-wide maps (optical, physical or genetic) as linkage information. Optical maps (in particular Bionano Genomics maps) have been extensively used in many recent large-scale genome assembly projects (e.g., goat, apple, barley, maize, quinoa, sea bass, among others). However, the most commonly used scaffolding tools have a serious limitation: they can only deal with one optical map at a time, forcing users to alternate or iterate over multiple maps. In this paper, we introduce a novel scaffolding algorithm called OMGS that for the first time can take advantages of multiple optical maps. OMGS solves several optimization problems to generate scaffolds with optimal contiguity and correctness. Extensive experimental results demonstrate that our tool outperforms existing methods when multiple optical maps are available, and produces comparable scaffolds using a single optical map. OMGS can be obtained from https://​github.​com/​ucrbioinfo/​OMGS.
Weihua Pan, Tao Jiang, Stefano Lonardi

Fast Approximation of Frequent k-mers and Applications to Metagenomics

Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. While several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire dataset, that can be extremely expensive for high-throughput sequencing datasets. While in some applications it is crucial to estimate all k-mers and their abundances, in other situations reporting only frequent k-mers, that appear with relatively high frequency in a dataset, may suffice. This is the case, for example, in the computation of k-mers’ abundance-based distances among datasets of reads, commonly used in metagenomic analyses.
In this work, we develop, analyze, and test, a sampling-based approach, called SAKEIMA, to approximate the frequent k-mers and their frequencies in a high-throughput sequencing dataset while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the VC dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a dataset and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer based distances between high-throughput sequencing datasets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers abundances providing significant speed-ups in the analysis of large sequencing datasets.
Leonardo Pellegrina, Cinzia Pizzi, Fabio Vandin

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://​github.​com/​ksahlin/​isONclust.
Kristoffer Sahlin, Paul Medvedev

A Sticky Multinomial Mixture Model of Strand-Coordinated Mutational Processes in Cancer

The characterization of mutational processes in terms of their signatures of activity relies, to the most part, on the assumption that mutations in a given cancer genome are independent of one another. Recently, it was discovered that certain segments of mutations, termed processive groups, occur on the same DNA strand and are generated by a single process or signature. Here we provide a first probabilistic model of mutational signatures that accounts for their observed stickiness and strand-coordination. The model conditions on the observed strand for each mutation, and allows the same signature to generate a run of mutations. We show that this model provides a more accurate description of the properties of mutagenic processes than independent-mutation models or strand oblivous models, achieving substantially higher likelihood on held-out data. We apply this model to characterize the processivity of mutagenic processes across multiple types of cancer in terms of replication and transcriptional strand-coordination.
Itay Sason, Damian Wojtowicz, Welles Robinson, Mark D. M. Leiserson, Teresa M. Przytycka, Roded Sharan

Disentangled Representations of Cellular Identity

We introduce a disentangled representation for cellular identity that constructs a latent cellular state from a linear combination of condition specific basis vectors that are then decoded into gene expression levels. The basis vectors are learned with a deep autoencoder model from single-cell RNA-seq data. Linear arithmetic in the disentangled representation successfully predicts nonlinear gene expression interactions between biological pathways in unobserved treatment conditions. We are able to recover the mean gene expression profiles of unobserved conditions with an average Pearson r = 0.73, which outperforms two linear baselines, one with an average r = 0.43 and another with an average r = 0.19. Disentangled representations hold the promise to provide new explanatory power for the interaction of biological pathways and the prediction of effects of unobserved conditions for applications such as combinatorial therapy and cellular reprogramming. Our work is motivated by recent advances in deep generative models that have enabled synthesis of images and natural language with desired properties from interpolation in a “latent representation” of the data.
Ziheng Wang, Grace H. T. Yeo, Richard Sherwood, David Gifford

RENET: A Deep Learning Approach for Extracting Gene-Disease Associations from Literature

Over one million new biomedical articles are published every year. Efficient and accurate text-mining tools are urgently needed to automatically extract knowledge from these articles to support research and genetic testing. In particular, the extraction of gene-disease associations is mostly studied. However, existing text-mining tools for extracting gene-disease associations have limited capacity, as each sentence is considered separately. Our experiments show that the best existing tools, such as BeFree and DTMiner, achieve a precision of 48% and recall rate of 78% at most. In this study, we designed and implemented a deep learning approach, named RENET, which considers the correlation between the sentences in an article to extract gene-disease associations. Our method has significantly improved the precision and recall rate to 85.2% and 81.8%, respectively. The source code of RENET is available at https://​bitbucket.​org/​alexwuhkucs/​gda-extraction/​src/​master/​.
Ye Wu, Ruibang Luo, Henry C. M. Leung, Hing-Fung Ting, Tak-Wah Lam


Weitere Informationen

Premium Partner