Skip to main content

2018 | Buch

Research in Computational Molecular Biology

22nd Annual International Conference, RECOMB 2018, Paris, France, April 21-24, 2018, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 22nd Annual Conference on Research in Computational Molecular Biology, RECOMB 2018, held in Paris, France, in April 2018.
The 16 extended and 22 short abstracts presented were carefully reviewed and selected from 193 submissions. The short abstracts are included in the back matter of the volume. They report on original research in all areas of computational molecular biology and bioinformatics.

Inhaltsverzeichnis

Frontmatter
Long Reads Enable Accurate Estimates of Complexity of Metagenomes
Abstract
Although reduced microbiome diversity has been linked to various diseases, estimating the diversity of bacterial communities (the number and the total length of distinct genomes within a metagenome) remains an open problem in microbial ecology. We describe the first analysis of microbial diversity using long reads without any assumption on the frequencies of genomes within a metagenome (parametric methods) and without requiring a large database that covers the total diversity (non-parametric methods). The long read technologies provide new insights into the diversity of metagenomes by interrogating rare species that remained below the radar of previous approaches based on short reads. We present a novel approach for estimating the diversity of metagenomes based on joint analysis of short and long reads and benchmark it on various datasets. We estimate that genomes comprising a human gut metagenome have total length varying from 1.3 to 3.5 billion nucleotides, with genomes responsible for \(50\%\) of total abundance having total length varying from only 40 to 60 million nucleotides. In contrast, genomes comprising an aquifer sediment metagenome have more than two-orders of magnitude larger total length (\({\approx }840\) billion nucleotides).
Anton Bankevich, Pavel Pevzner
Chromatyping: Reconstructing Nucleosome Profiles from NOMe Sequencing Data
Abstract
Measuring nucleosome positioning in cells is crucial for the analysis of epigenetic gene regulation. Reconstruction of nucleosome profiles of individual cells or subpopulations of cells remains challenging because most genome-wide assays measure nucleosome positioning and DNA accessibility for thousands of cells using bulk sequencing. Here we use characteristics of the NOMe-sequencing assay to derive a new approach, called ChromaClique, for deconvolution of different nucleosome profiles (chromatypes) from cell subpopulations of one NOMe-seq measurement. ChromaClique uses a maximal clique enumeration algorithm on a newly defined NOMe read graph that is able to group reads according to their nucleosome profiles. We show that the edge probabilities of that graph can be efficiently computed using Hidden Markov Models. We demonstrate using simulated data that ChromaClique is more accurate than a related method and scales favorably, allowing genome-wide analyses of chromatypes in cell subpopulations. Software is available at https://​github.​com/​shounak1990/​ChromaClique under MIT license.
Shounak Chakraborty, Stefan Canzar, Tobias Marschall, Marcel H. Schulz
GTED: Graph Traversal Edit Distance
Abstract
Many problems in applied machine learning deal with graphs (also called networks), including social networks, security, web data mining, protein function prediction, and genome informatics. The kernel paradigm beautifully decouples the learning algorithm from the underlying geometric space, which renders graph kernels important for the aforementioned applications.
In this paper, we give a new graph kernel which we call graph traversal edit distance (GTED). We introduce the GTED problem and give the first polynomial time algorithm for it. Informally, the graph traversal edit distance is the minimum edit distance between two strings formed by the edge labels of respective Eulerian traversals of the two graphs. Also, GTED is motivated by and provides the first mathematical formalism for sequence co-assembly and de novo variation detection in bioinformatics.
We demonstrate that GTED admits a polynomial time algorithm using a linear program in the graph product space that is guaranteed to yield an integer solution. To the best of our knowledge, this is the first approach to this problem. We also give a linear programming relaxation algorithm for a lower bound on GTED. We use GTED as a graph kernel and evaluate it by computing the accuracy of an SVM classifier on a few datasets in the literature. Our results suggest that our kernel outperforms many of the common graph kernels in the tested datasets. As a second set of experiments, we successfully cluster viral genomes using GTED on their assembly graphs obtained from de novo assembly of next generation sequencing reads. Our GTED implementation can be downloaded from http://​chitsazlab.​org/​software/​gted/​.
Ali Ebrahimpour Boroojeny, Akash Shrestha, Ali Sharifi-Zarchi, Suzanne Renick Gallagher, S. Cenk Sahinalp, Hamidreza Chitsaz
Statistical Inference of Peroxisome Dynamics
Abstract
The regulation of organelle abundance sustains critical biological processes, such as metabolism and energy production. Biochemical models mathematically express these temporal changes in terms of reactions, and their rates. The rate parameters are critical components of the models, and must be experimentally inferred. However, the existing methods for rate inference are limited, and not directly applicable to organelle dynamics.
This manuscript introduces a novel approach that integrates modeling, inference and experimentation, and incorporates biological replicates, to accurately infer the rates. The approach relies on a biochemical model in form of a stochastic differential equation, and on a parallel implementation of inference with particle filter. It also relies on a novel microscopy workflow that monitors organelles over long periods of time in cell culture. Evaluations on simulated datasets demonstrated the advantages of this approach in terms of increased accuracy and shortened computation time. An application to imaging of peroxisomes determined that fission, rather than de novo generation, is predominant in maintaining the organelle level under basal conditions. This biological insight serves as a starting point for a system view of organelle regulation in cells.
Cyril Galitzine, Pierre M. Jean Beltran, Ileana M. Cristea, Olga Vitek
Loss-Function Learning for Digital Tissue Deconvolution
Abstract
The gene expression profile of a tissue averages the expression profiles of all cells in this tissue. Digital tissue deconvolution (DTD) addresses the following inverse problem: Given the expression profile y of a tissue, what is the cellular composition c of that tissue? If X is a matrix whose columns are reference profiles of individual cell types, the composition c can be computed by minimizing \(\mathcal {L}(y-Xc)\) for a given loss function \(\mathcal {L}\). Current methods use predefined all-purpose loss functions. They successfully quantify the dominating cells of a tissue, while often falling short in detecting small cell populations.
Here we use training data to learn the loss function \(\mathcal {L}\) along with the composition c. This allows us to adapt to application-specific requirements such as focusing on small cell populations or distinguishing phenotypically similar cell populations. Our method quantifies large cell fractions as accurately as existing methods and significantly improves the detection of small cell populations and the distinction of similar cell types.
Franziska Görtler, Stefan Solbrig, Tilo Wettig, Peter J. Oefner, Rainer Spang, Michael Altenbuchinger
Inference of Population Structure from Ancient DNA
Abstract
Methods for inferring population structure from genetic information traditionally assume samples are contemporary. Yet, the increasing availability of ancient DNA sequences begs revision of this paradigm. We present Dystruct (Dynamic Structure), a framework and toolbox for inference of shared ancestry from data that include ancient DNA. By explicitly modeling population history and genetic drift as a time-series, Dystruct more accurately and realistically discovers shared ancestry from ancient and contemporary samples. Formally, we use a normal approximation of drift, which allows a novel, efficient algorithm for optimizing model parameters using stochastic variational inference. We show that Dystruct outperforms the state of the art when individuals are sampled over time, as is common in ancient DNA datasets. We further demonstrate the utility of our method on a dataset of 92 ancient samples alongside 1941 modern ones genotyped at 222755 loci. Our model tends to present modern samples as the mixtures of ancestral populations they really are, rather than the artifactual converse of presenting ancestral samples as mixtures of contemporary groups.
Tyler A. Joseph, Itsik Pe’er
Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-linear Chaining Extended
Abstract
Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics. Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigorous approach to combine the anchors. It is a well-known approach for the case of two sequences as inputs. Here we extend the approach so that one of the inputs can be a directed acyclic graph (DAGs), e.g. a splicing graph in transcriptomics or a variant graph in pan-genomics.
This extension to DAGs turns out to have a tight connection to the minimum path cover problem, asking us to find a minimum-cardinality set of paths that cover all the nodes of a DAG. We study the case when the size k of a minimum path cover is small, which is often the case in practice. First, we propose an algorithm for finding a minimum path cover of a DAG (VE) in \(O(k|E|\log |V|)\) time, improving all known time-bounds when k is small and the DAG is not too dense. Second, we introduce a general technique for extending dynamic programming (DP) algorithms from sequences to DAGs. This is enabled by our minimum path cover algorithm, and works by mimicking the DP algorithm for sequences on each path of the minimum path cover. This technique generally produces algorithms that are slower than their counterparts on sequences only by a factor k. Our technique can be applied, for example, to the classical longest increasing subsequence and longest common subsequence problems, extended to labeled DAGs. Finally, we apply this technique to the co-linear chaining problem, which is a generalization of both of these two problems. We also implemented the new co-linear chaining approach. Experiments on splicing graphs show that the new method is efficient also in practice.
Anna Kuosmanen, Topi Paavilainen, Travis Gagie, Rayan Chikhi, Alexandru Tomescu, Veli Mäkinen
Modeling Dependence in Evolutionary Inference for Proteins
Abstract
Protein structure alignment is a classic problem of computational biology, and is widely used to identify structural and functional similarity and to infer homology among proteins. Previously a statistical model for protein structural evolution has been introduced and shown to significantly improve phylogenetic inferences compared to approaches that utilize only amino acid sequence information. Here we extend this model to account for correlated evolutionary drift among neighboring amino acid positions, resulting in a spatio-temporal model of protein structure evolution. The result is a multivariate diffusion process convolved with a spatial birth-death process, which comes with little additional computational cost or analytical complexity compared to the site-independent model (SIM). We demonstrate that this extended, site-dependent model (SDM) yields a significant reduction of bias in estimated evolutionary distances and helps further improve phylogenetic tree reconstruction.
Gary Larson, Jeffrey L. Thorne, Scott Schmidler
Constrained De Novo Sequencing of neo-Epitope Peptides Using Tandem Mass Spectrometry
Abstract
Neoepitope peptides are newly formed antigens presented by major histocompatibility complex class I (MHC-I) on cell surfaces. The cells presenting neoepitope peptides are recognized and subsequently killed by cytotoxic T-cells. Immunopeptidomic approaches aim to characterize the peptide repertoire (including neoepitope) associated with the MHC-I molecules on the surface of tumor cells using proteomic technologies, providing critical information for designing effective immunotherapy strategies. We developed a novel constrained de novo sequencing algorithm to identify neo-epitope peptides from tandem mass spectra acquired in immunopeptidomic analyses. Our method incorporates prior probabilities to putative peptides according to position specific scoring matrices (PSSMs) representing the sequence preferences recognized by MHC-I molecules. We implemented a dynamic programming algorithm to determine the peptide sequences with an optimal posterior matching score for each given MS/MS spectrum. Similar to the de novo peptide sequencing, the dynamic programming algorithm allows an efficient searching in the entire peptide sequence space. On an LC-MS/MS dataset, we demonstrated the performance of our algorithm in detecting the neoepitope peptides bound by the HLA-C*0501 molecules that were superior to database search approaches and existing general purpose de novo peptide sequencing algorithms.
Sujun Li, Alex DeCourcy, Haixu Tang
Reverse de Bruijn: Utilizing Reverse Peptide Synthesis to Cover All Amino Acid k-mers
Abstract
Peptide arrays measure the binding intensity of a specific protein to thousands of amino acid peptides. By using peptides that cover all k-mers, a comprehensive picture of the binding spectrum is obtained. Researchers would like to measure binding to the longest k-mer possible, but are constrained by the number of peptides that can fit into a single microarray. A key challenge is designing a minimum number of peptides that cover all k-mers. Here, we suggest a novel idea to reduce the length of the sequence covering all k-mers by utilizing a unique property of the peptide synthesis process. Since the synthesis can start from both ends of the peptide template, it is enough to cover each k-mer or its reverse, and use the same template twice: in forward and reverse. Then, the computational problem is to generate a minimum length sequence that for each k-mer either contains it or its reverse. We developed an algorithm ReverseCAKE to generate such a sequence. ReverseCAKE runs in time linear in the output size and is guaranteed to produce a sequence that is longer by at most \(\varTheta (\sqrt{n}\log {n})\) characters compared to the optimum n. The obtained saving factor by ReverseCAKE approaches the theoretical lower bound as k increases. In addition, we formulated the problem as an integer linear program and empirically observed that the solutions obtained by ReverseCAKE are near-optimal. Through this work we enable more effective design of peptide microarrays.
Yaron Orenstein
Circular Networks from Distorted Metrics
Abstract
Trees have long been used as a graphical representation of species relationships. However complex evolutionary events, such as genetic reassortments or hybrid speciations which occur commonly in viruses, bacteria and plants, do not fit into this elementary framework. Alternatively, various network representations have been developed. Circular networks are a natural generalization of leaf-labeled trees interpreted as split systems, that is, collections of bipartitions over leaf labels corresponding to current species. Although such networks do not explicitly model specific evolutionary events of interest, their straightforward visualization and fast reconstruction have made them a popular exploratory tool to detect network-like evolution in genetic datasets. Standard reconstruction methods for circular networks, such as Neighbor-Net, rely on an associated metric on the species set. Such a metric is first estimated from DNA sequences, which leads to a key difficulty: distantly related sequences produce statistically unreliable estimates. This is problematic for Neighbor-Net as it is based on the popular tree reconstruction method Neighbor-Joining, whose sensitivity to distance estimation errors is well established theoretically. In the tree case, more robust reconstruction methods have been developed using the notion of a distorted metric, which captures the dependence of the error in the distance through a radius of accuracy. Here we design the first circular network reconstruction method based on distorted metrics. Our method is computationally efficient. Moreover, the analysis of its radius of accuracy highlights the important role played by the maximum incompatibility, a measure of the extent to which the network differs from a tree.
Sebastien Roch, Kun-Chieh Wang
A Nested 2-Level Cross-Validation Ensemble Learning Pipeline Suggests a Negative Pressure Against Crosstalk snoRNA-mRNA Interactions in Saccharomyces Cerevisae
Abstract
The growing number of RNA-mediated regulation mechanisms identified in the last decades suggests a widespread impact of RNA-RNA interactions. The efficiency of the regulation relies on highly specific and coordinated interactions, while simultaneously repressing the formation of opportunistic complexes. However, the analysis of RNA interactomes is highly challenging due to the large number of potential partners, discrepancy of the size of RNA families, and the inherent noise in interaction predictions.
We designed a recursive 2-step cross-validation pipeline to capture the specificity of ncRNA-mRNA interactomes. Our method has been designed to detect significant loss or gain of specificity between ncRNA-mRNA interaction profiles. Applied to snoRNA-mRNA in Saccharomyces Cerevisae, our results suggest the existence of a repression of ncRNA affinities with mRNAs, and thus the existence of an evolutionary pressure inhibiting such interactions.
Antoine Soulé, Jean-Marc Steyaert, Jérôme Waldispühl
Context-Specific Nested Effects Models
Abstract
Advances in systems biology have made clear the importance of network models for capturing knowledge about complex relationships in gene regulation, metabolism, and cellular signaling. A common approach to uncovering biological networks involves performing perturbations on elements of the network, such as gene knockdown experiments, and measuring how the perturbation affects some reporter of the process under study. In this paper, we develop context-specific nested effects models (CSNEMs), an approach to inferring such networks that generalizes nested effect models (NEMs). The main contribution of this work is that CSNEMs explicitly model the participation of a gene in multiple contexts, meaning that a gene can appear in multiple places in the network. Biologically, the representation of regulators in multiple contexts may indicate that these regulators have distinct roles in different cellular compartments or cell cycle phases. We present an evaluation of the method on simulated data as well as on data from a study of the sodium chloride stress response in Saccharomyces cerevisiae.
Yuriy Sverchkov, Yi-Hsuan Ho, Audrey Gasch, Mark Craven
Algorithmic Framework for Approximate Matching Under Bounded Edits with Applications to Sequence Analysis
Abstract
We present a novel algorithmic framework for solving approximate sequence matching problems that permit a bounded total number k of mismatches, insertions, and deletions. The core of the framework relies on transforming an approximate matching problem into a corresponding exact matching problem on suitably edited string suffixes, while carefully controlling the required number of such edited suffixes to enable the design of efficient algorithms. For a total input size of n, our framework limits the number of generated edited suffixes to no more than a factor of \(O(\log ^k n)\) of the input size (for any constant k), and restricts the algorithm to linear space usage by overlapping the generation and processing of edited suffixes. Our framework improves the best known upper bound of \(n^2 k^{1.5}/ 2^{\varOmega (\sqrt{{\log n}/{k}})}\) for the classic k-edit longest common substring problem [Abboud, Williams, and Yu; SODA 2015] to yield the first strictly sub-quadratic time algorithm that runs in \(O(n\log ^k n)\) time and O(n) space for any constant k. We present similar subquadratic time and linear space algorithms for (i) computing the alignment-free distance between two genomes based on the k-edit average common substring measure, (ii) mapping reads/read fragments to a reference genome while allowing up to k edits, and (iii) computing all-pair maximal k-edit common substrings (also, suffix/prefix overlaps), which has applications in clustering and assembly. We expect our algorithmic framework to be a broadly applicable theoretical tool, and may inspire the design of practical heuristics and software.
Sharma V. Thankachan, Chaitanya Aluru, Sriram P. Chockalingam, Srinivas Aluru
Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes
Abstract
Exploring the genetic diversity of microbes within the environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare these sequencing data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current methods either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of low-abundance microbial species, e. g. detecting pathogens in ancient metagenomic samples. We present a new method, SPARSE, which improves taxonomic assignments of metagenomic reads. SPARSE balances existing biased reference databases by grouping reference genomes into similarity-based hierarchical clusters, implemented as an efficient incremental data structure. SPARSE assigns reads to these clusters using a probabilistic model, which specifically penalizes non-specific mappings of reads from unknown sources and hence reduces false-positive assignments. Our evaluation on simulated datasets from two recent evaluation studies demonstrated the improved precision of SPARSE in comparison to other methods for species-level classification. In a third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains from the same sample. In real archaeological datasets, SPARSE identified ancient pathogens with \({\le }0.02\%\) abundance, consistent with published findings that required additional sequencing data. In these datasets, other methods either missed targeted pathogens or reported non-existent ones.
SPARSE and all evaluation scripts are available at https://​github.​com/​zheminzhou/​SPARSE.
Zhemin Zhou, Nina Luhmann, Nabil-Fareed Alikhan, Christopher Quince, Mark Achtman
Backmatter
Metadaten
Titel
Research in Computational Molecular Biology
herausgegeben von
Benjamin J. Raphael
Copyright-Jahr
2018
Electronic ISBN
978-3-319-89929-9
Print ISBN
978-3-319-89928-2
DOI
https://doi.org/10.1007/978-3-319-89929-9