nach oben

2020 | Buch

Kapitel lesen Erstes Kapitel lesen

Research in Computational Molecular Biology

24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10–13, 2020, Proceedings

herausgegeben von: Russell Schwartz

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the proceedings of the 24th Annual Conference on Research in Computational Molecular Biology, RECOMB 2020, held in Padua, Italy, in May 2020. The 13 regular and 24 short papers presented were carefully reviewed and selected from 206 submissions. The papers report on original research in all areas of computational molecular biology and bioinformatics.

Inhaltsverzeichnis

Frontmatter

Extended Abstracts

Frontmatter

Computing the Rearrangement Distance of Natural Genomes

The computation of genomic distances has been a very active field of computational comparative genomics over the last 25 years. Substantial results include the polynomial-time computability of the inversion distance by Hannenhalli and Pevzner in 1995 and the introduction of the double-cut and join (DCJ) distance by Yancopoulos, Attie and Friedberg in 2005. Both results, however, rely on the assumption that the genomes under comparison contain the same set of unique markers (syntenic genomic regions, sometimes also referred to as genes). In 2015, Shao, Lin and Moret relax this condition by allowing for duplicate markers in the analysis. This generalized version of the genomic distance problem is NP-hard, and they give an ILP solution that is efficient enough to be applied to real-world datasets. A restriction of their approach is that it can be applied only to balanced genomes, that have equal numbers of duplicates of any marker. Therefore it still needs a delicate preprocessing of the input data in which excessive copies of unbalanced markers have to be removed.In this paper we present an algorithm solving the genomic distance problem for natural genomes, in which any marker may occur an arbitrary number of times. Our method is based on a new graph data structure, the multi-relational diagram, that allows an elegant extension of the ILP by Shao, Lin and Moret to count runs of markers that are under- or over-represented in one genome with respect to the other and need to be inserted or deleted, respectively. With this extension, previous restrictions on the genome configurations are lifted, for the first time enabling an uncompromising rearrangement analysis. Any marker sequence can directly be used for the distance calculation.The evaluation of our approach shows that it can be used to analyze genomes with up to a few ten thousand markers, which we demonstrate on simulated and real data.

Leonard Bohnenkämper, Marília D. V. Braga, Daniel Doerr, Jens Stoye

Deep Large-Scale Multi-task Learning Network for Gene Expression Inference

Gene expressions profiling empowers many biological studies in various fields by comprehensive characterization of cellular status under different experimental conditions. Despite the recent advances in high-throughput technologies, profiling the whole-genome set is still challenging and expensive. Based on the fact that there is high correlation among the expression patterns of different genes, the above issue can be addressed by a cost-effective approach that collects only a small subset of genes, called landmark genes, as the representative of the entire genome set and estimates the remaining ones, called target genes, via the computational model. Several shallow and deep regression models have been presented in the literature for inferring the expressions of target genes. However, the shallow models suffer from underfitting due to their insufficient capacity in capturing the complex nature of gene expression data, and the existing deep models are prone to overfitting due to the lack of using the interrelations of target genes in the learning framework. To address these challenges, we formulate the gene expression inference as a multi-task learning problem and propose a novel deep multi-task learning algorithm with automatically learning the biological interrelations among target genes and utilizing such information to enhance the prediction. In particular, we employ a multi-layer sub-network with low dimensional latent variables for learning the interrelations among target genes (i.e. distinct predictive tasks), and impose a seamless and easy to implement regularization on deep models. Unlike the conventional complicated multi-task learning methods, which can only deal with tens or hundreds of tasks, our proposed algorithm can effectively learn the interrelations from the large-scale ($$\sim $$10,000) tasks on the gene expression inference problem, and does not suffer from cost-prohibitive operations. Experimental results indicate the superiority of our method compared to the existing gene expression inference models and alternative multi-task learning algorithms on two large-scale datasets.

Kamran Ghasedi Dizaji, Wei Chen, Heng Huang

A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets

As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of k-mers that hit every sequence of length L, and can thus serve as indices to L-long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of k (e.g. $$k > 13$$). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating k-mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal k-mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating near-optimal UHSs, which newly handles $$k > 13$$. We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA’s runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.

Barış Ekim, Bonnie Berger, Yaron Orenstein

Multiple Competition-Based FDR Control and Its Application to Peptide Detection

Competition-based FDR control has been commonly used for over a decade in the computational mass spectrometry community [5]. Recently, the approach has gained significant popularity in other fields after Barber and Candés laid its theoretical foundation in a more general setting that included the feature selection problem [1]. In both cases, the competition is based on a head-to-head comparison between an observed score and a corresponding decoy/knockoff. We recently demonstrated some advantages of using multiple rather than a single decoy when addressing the problem of assigning peptide sequences to observed mass spectra [17]. In this work, we consider a related problem—detecting peptides based on a collection of mass spectra—and we develop a new framework for competition-based FDR control using multiple null scores. Within this framework, we offer several methods, all of which are based on a novel procedure that rigorously controls the FDR in the finite sample setting. Using real data to study the peptide detection problem we show that, relative to existing single-decoy methods, our approach can increase the number of discovered peptides by up to 50% at small FDR thresholds.

Kristen Emery, Syamand Hasam, William Stafford Noble, Uri Keich

Supervised Adversarial Alignment of Single-Cell RNA-seq Data

Dimensionality reduction is an important first step in the analysis of single cell RNA-seq (scRNA-seq) data. In addition to enabling the visualization of the profiled cells, such representations are used by many downstream analyses methods ranging from pseudo-time reconstruction to clustering to alignment of scRNA-seq data from different experiments, platforms, and labs. Both supervised and unsupervised methods have been proposed to reduce the dimension of scRNA-seq. However, all methods to date are sensitive to batch effects. When batches correlate with cell types, as is often the case, their impact can lead to representations that are batch rather than cell type specific. To overcome this we developed a domain adversarial neural network model for learning a reduced dimension representation of scRNA-seq data. The adversarial model tries to simultaneously optimize two objectives. The first is the accuracy of cell type assignment and the second is the inability to distinguish the batch (domain). We tested the method by using the resulting representation to align several different datasets. As we show, by overcoming batch effects our method was able to correctly separate cell types, improving on several prior methods suggested for this task. Analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.

Songwei Ge, Haohan Wang, Amir Alavi, Eric Xing, Ziv Bar-Joseph

Bagging MSA Learning: Enhancing Low-Quality PSSM with Deep Learning for Accurate Protein Structure Property Prediction

Accurate predictions of protein structure properties, e.g. secondary structure and solvent accessibility, are essential in analyzing the structure and function of a protein. PSSM (Position-Specific Scoring Matrix) features are widely used in the structure property prediction. However, some proteins may have low-quality PSSM features due to insufficient homologous sequences, leading to limited prediction accuracy. To address this limitation, we propose an enhancing scheme for PSSM features. We introduce the “Bagging MSA” method to calculate PSSM features used to train our model, and adopt a convolutional network to capture local context features and bidirectional-LSTM for long-term dependencies, and integrate them under an unsupervised framework. Structure property prediction models are then built upon such enhanced PSSM features for more accurate predictions. Empirical evaluation of CB513, CASP11, and CASP12 datasets indicate that our unsupervised enhancing scheme indeed generates more informative PSSM features for structure property prediction.

Yuzhi Guo, Jiaxiang Wu, Hehuan Ma, Sheng Wang, Junzhou Huang

Open Access

AStarix: Fast and Optimal Sequence-to-Graph Alignment

We present an algorithm for the optimal alignment of sequences to genome graphs. It works by phrasing the edit distance minimization task as finding a shortest path on an implicit alignment graph. To find a shortest path, we instantiate the A$$^\star $$ paradigm with a novel domain-specific heuristic function that accounts for the upcoming subsequence in the query to be aligned, resulting in a provably optimal alignment algorithm called AStarix.Experimental evaluation of AStarix shows that it is 1–2 orders of magnitude faster than state-of-the-art optimal algorithms on the task of aligning Illumina reads to reference genome graphs. Implementations and evaluations are available at https://github.com/eth-sri/astarix .

Pesho Ivanov, Benjamin Bichsel, Harun Mustafa, André Kahles, Gunnar Rätsch, Martin Vechev

PDF Zum Volltext

Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss

Phylogenomics—the estimation of species trees from multi-locus datasets—is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this paper, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL. All scripts and datasets used in this study are available on the Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1 .

Brandon Legried, Erin K. Molloy, Tandy Warnow, Sébastien Roch

RoboCOP: Multivariate State Space Model Integrating Epigenomic Accessibility Data to Elucidate Genome-Wide Chromatin Occupancy

Chromatin is the tightly packaged structure of DNA and protein within the nucleus of a cell. The arrangement of different protein complexes along the DNA modulates and is modulated by gene expression. Measuring the binding locations and level of occupancy of different transcription factors (TFs) and nucleosomes is therefore crucial to understanding gene regulation. Antibody-based methods for assaying chromatin occupancy are capable of identifying the binding sites of specific DNA binding factors, but only one factor at a time. On the other hand, epigenomic accessibility data like ATAC-seq, DNase-seq, and MNase-seq provide insight into the chromatin landscape of all factors bound along the genome, but with minimal insight into the identities of those factors. Here, we present RoboCOP, a multivariate state space model that integrates chromatin information from epigenomic accessibility data with nucleotide sequence to compute genome-wide probabilistic scores of nucleosome and TF occupancy, for hundreds of different factors at once. RoboCOP can be applied to any epigenomic dataset that provides quantitative insight into chromatin accessibility in any organism, but here we apply it to MNase-seq data to elucidate the protein-binding landscape of nucleosomes and 150 TFs across the yeast genome. Using available protein-binding datasets from the literature, we show that our model more accurately predicts the binding of these factors genome-wide.

Sneha Mitra, Jianling Zhong, David M. MacAlpine, Alexander J. Hartemink

Representation of -mer Sets Using Spectrum-Preserving String Sets

Given the popularity and elegance of $$k$$-mer based tools, finding a space-efficient way to represent a set of $$k$$-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of $$k$$-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of $$k$$-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static $$k$$-mer membership index, UST-FM, which we show improves index size by 10–44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/ .

Amatur Rahman, Paul Medvedev

NetMix: A Network-Structured Mixture Model for Reduced-Bias Estimation of Altered Subnetworks

A classic problem in computational biology is the identification of altered subnetworks: subnetworks of an interaction network that contain genes/proteins that are differentially expressed, highly mutated, or otherwise aberrant compared to other genes/proteins. Numerous methods have been developed to solve this problem under various assumptions, but the statistical properties of these methods are often unknown. For example, some widely-used methods are reported to output very large subnetworks that are difficult to interpret biologically. In this work, we formulate the identification of altered subnetworks as the problem of estimating the parameters of a class of probability distributions which we call the Altered Subset Distribution (ASD). We derive a connection between a popular method, jActiveModules, and the maximum likelihood estimator (MLE) of the ASD. We show that the MLE is statistically biased, explaining the large subnetworks output by jActiveModules. We introduce NetMix, an algorithm that uses Gaussian mixture models to obtain less biased estimates of the parameters of the ASD. We demonstrate that NetMix outperforms existing methods in identifying altered subnetworks on both simulated and real data, including the identification of differentially expressed genes from both microarray and RNA-seq experiments and the identification of cancer driver genes in somatic mutation data.

Matthew A. Reyna, Uthsav Chitra, Rebecca Elyanow, Benjamin J. Raphael

Stochastic Sampling of Structural Contexts Improves the Scalability and Accuracy of RNA 3D Module Identification

RNA structures possess multiple levels of structural organization. Secondary structures are made of canonical (i.e. Watson-Crick and Wobble) helices, connected by loops whose local conformations are critical determinants of global 3D architectures. Such local 3D structures consist of conserved sets of non-canonical base pairs, called RNA modules. Their prediction from sequence data is thus a milestone toward 3D structure modelling. Unfortunately, the computational efficiency and scope of the current 3D module identification methods are too limited yet to benefit from all the knowledge accumulated in modules databases. Here, we introduce BayesPairing 2, a new sequence search algorithm leveraging secondary structure tree decomposition which allows to reduce the computational complexity and improve predictions on new sequences. We benchmarked our methods on 75 modules and 6380 RNA sequences, and report accuracies that are comparable to the state of the art, with considerable running time improvements. When identifying 200 modules on a single sequence, BayesPairing 2 is over 100 times faster than its previous version, opening new doors for genome-wide applications.

Roman Sarrazin-Gendron, Hua-Ting Yao, Vladimir Reinharz, Carlos G. Oliver, Yann Ponty, Jérôme Waldispühl

Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length

Universal hitting sets are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between universal hitting sets and minimizers schemes, where minimizers schemes with low density (i.e., efficient schemes) correspond to universal hitting sets of small size. Local schemes are a generalization of minimizers schemes which can be used as replacement for minimizers scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a universal hitting set. We give bounds for the remaining path length of the Mykkeltveit universal hitting set. Additionally, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound.

Hongyu Zheng, Carl Kingsford, Guillaume Marçais

Short Papers

Frontmatter

Strain-Aware Assembly of Genomes from Mixed Samples Using Flow Variation Graphs

The goal of strain-aware genome assembly is to reconstruct all individual haplotypes from a mixed sample at the strain level and to provide abundance estimates for the strains.

Jasmijn A. Baaijens, Leen Stougie, Alexander Schönhuth

Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments

We present Spectral Jaccard Similarity, a technique that combines min-hashing and spectral methods in order to efficiently estimate pairwise alignment between genomic reads.

Tavor Z. Baharav, Govinda M. Kamath, David N. Tse, Ilan Shomorony

MosaicFlye: Resolving Long Mosaic Repeats Using Long Reads

Although long-read sequencing technologies opened a new era in genome assembly, the problem of resolving unbridged repeats (i.e., repeats that are not spanned by any reads) such as long segmental duplications in the human genome remains largely unsolved, making it a major obstacle towards achieving the goal of complete genome assemblies.

Anton Bankevich, Pavel Pevzner

Bayesian Non-parametric Clustering of Single-Cell Mutation Profiles

Cancer is an evolutionary process characterized by the accumulation of mutations that drive tumor initiation, progression, and treatment resistance.

Nico Borgsmüller, Jose Bonet, Francesco Marass, Abel Gonzalez-Perez, Nuria Lopez-Bigas, Niko Beerenwinkel

PaccMannRL: Designing Anticancer Drugs From Transcriptomic Data via Reinforcement Learning

The pharmaceutical industry has experienced a significant productivity decline: Less than 0.01% of drug candidates obtain market approval, with an estimated 10–15 years until market release and costs that range between one [2] to three billion dollars per drug [3].

Jannis Born, Matteo Manica, Ali Oskooei, Joris Cadow, María Rodríguez Martínez

CluStrat: A Structure Informed Clustering Strategy for Population Stratification

Genome-wide association studies (GWAS) have been extensively used to estimate the signed effects of trait-associated alleles. One of the key challenges in GWAS are confounding factors, such as population stratification, which can lead to spurious genotype-trait associations. Recent independent studies [1, 8, 10] failed to replicate the strong evidence of previously reported signals of directional selection on height in Europeans in the UK Biobank cohort, and attributed the loss of signal to cryptic relatedness in populations.

Aritra Bose, Myson C. Burch, Agniva Chowdhury, Peristera Paschou, Petros Drineas

PWAS: Proteome-Wide Association Study

Over the last two decades, Genome-Wide Association Study (GWAS) has become a canonical tool for exploratory genetic research, generating countless gene-phenotype associations. Despite its accomplishments, several limitations and drawbacks still hinder its success, including low statistical power and obscurity about the causality of implicated variants. We introduce PWAS (Proteome-Wide Association Study), a new method for detecting protein-coding genes associated with phenotypes through protein function alterations. PWAS aggregates the signal of all variants jointly affecting a protein-coding gene and assesses their overall impact on the protein’s function using machine-learning and probabilistic models. Subsequently, it tests whether the gene exhibits functional variability between individuals that correlates with the phenotype of interest. By collecting the genetic signal across many variants in light of their rich proteomic context, PWAS can detect subtle patterns that standard GWAS and other methods overlook. It can also capture more complex modes of heritability, including recessive inheritance. Furthermore, the discovered associations are supported by a concrete molecular model, thus reducing the gap to inferring causality. To demonstrate its applicability for a wide range of human traits, we applied PWAS on a cohort derived from the UK Biobank (~330K individuals) and evaluated it on 49 prominent phenotypes. 23% of the significant PWAS associations on that cohort (2,998 of 12,896) were missed by standard GWAS. A comparison between PWAS to existing methods proves its capacity to recover causal protein-coding genes and highlighting new associations with plausible biological mechanism.

Nadav Brandes, Nathan Linial, Michal Linial

Estimating the Rate of Cell Type Degeneration from Epigenetic Sequencing of Cell-Free DNA

Cells die at different rates as a function of disease state, age, environmental exposure, and behavior [8, 10]. Knowing the rate at which cells die is a fundamental scientific question, with direct translational applicability. A quantifiable indication of cell death could facilitate disease diagnosis and prognosis, prioritize patients for admission into clinical trials, and improve evaluations of treatment efficacy and disease progression [1, 4, 14, 16]. Circulating cell-free DNA (cfDNA) in the bloodstream originates from dying cells and is a promising non-invasive biomarker for cell death.

Christa Caggiano, Barbara Celona, Fleur Garton, Joel Mefford, Brian Black, Catherine Lomen-Hoerth, Andrew Dahl, Noah Zaitlen

Potpourri: An Epistasis Test Prioritization Algorithm via Diverse SNP Selection

Genome-wide association studies (GWAS) have been an important tool for susceptibility gene discovery in genetic disorders and investigating the interplay among multiple loci has played an imporatant role. Such interactions between two or more loci is called epistasis and it has a major role in complex genetic traits.

Gizem Caylak, A. Ercument Cicek

Privacy-Preserving Biomedical Database Queries with Optimal Privacy-Utility Trade-Offs

Sharing data across research groups is an essential driver of biomedical research. In particular, biomedical databases with interactive query-answering systems allow users to retrieve information from the database using restricted types of queries. For example, medical data repositories allow researchers designing clinical studies to query how many patients in the database satisfy certain criteria, a workflow known as cohort discovery. In addition, genomic “beacon” services allow users to query whether or not a given genetic variant is observed in the database, a workflow we refer to as variant lookup. While these systems aim to facilitate the sharing of aggregate biomedical insights without divulging sensitive individual-level data, they can still leak private information about the individuals through the query answers. To address these privacy concerns, existing studies have proposed to perturb query results with a small amount of noise in order to reduce sensitivity to underlying individuals [1, 2]. However, these existing efforts either lack rigorous guarantees of privacy or introduce an excessive amount of noise into the system, limiting their effectiveness in practice.

Hyunghoon Cho, Sean Simmons, Ryan Kim, Bonnie Berger

Iterative Refinement of Cellular Identity from Single-Cell Data Using Online Learning

Recent experimental advances have enabled high-throughput single-cell measurement of gene expression, chromatin accessibility and DNA methylation. We previously employed integrative non-negative matrix factorization (iNMF) to jointly align multiple single-cell datasets ($$X_i$$) and learn interpretable low-dimensional representations using dataset-specific ($$V_i)$$ and shared metagene factors (W) and cell factor loadings ($$H_i$$). We developed an alternating nonnegative least squares (ANLS) algorithm to solve the iNMF optimization problem [2]:

Chao Gao, Joshua D. Welch

A Guided Network Propagation Approach to Identify Disease Genes that Combines Prior and New Information

Summary. A major challenge in biomedical data science is to identify the causal genes underlying complex genetic diseases. Despite the massive influx of genome sequencing data, identifying disease-relevant genes remains difficult as individuals with the same disease may share very few, if any, genetic variants.

Borislav H. Hristov, Bernard Chazelle, Mona Singh

A Scalable Method for Estimating the Regional Polygenicity of Complex Traits

A key question in human genetics is understanding the proportion of SNPs modulating a particular phenotype or the proportion of susceptibility SNPs for a disease, termed polygenicity. Previous studies have observed that complex traits tend to be highly polygenic, opposing the previous belief that only a handful of SNPs contribute to a trait [1–4]. Beyond these genome-wide estimates, the distribution of polygenicity across genomic regions as well as the genomic factors that affect regional polygenicity remain poorly understood.

Ruth Johnson, Kathryn S. Burch, Kangcheng Hou, Mario Paciuc, Bogdan Pasaniuc, Sriram Sankararaman

Efficient and Accurate Inference of Microbial Trajectories from Longitudinal Count Data

The human body is home to trillions of microbial cells that play an essential role in health and disease [2]. The gut microbiome, for instance, is responsible for a variety of normal physiological processes such as the regulation of immune response and breakdown of xenobiotics [3]. Disturbances in gut communities have been associated with several diseases, notably obesity [7] and colitis [8]. Moreover, changes to the vaginal microbiome during pregnancy are associated with risk of preterm birth [4]. Consequently, investigating the human microbiome can provide insight into biological processes and the etiology of disease.

Tyler A. Joseph, Amey P. Pasarkar, Itsik Pe’er

Identifying Causal Variants by Fine Mapping Across Multiple Studies

Genome-Wide Association Studies (GWAS) have successfully identified numerous genetic variants associated with a variety of complex traits in humans. However, most of these associated variants are not causal, and are simply in Linkage Disequilibrium (LD) with the true causal variants. This problem is addressed by statistical “fine mapping” methods, which attempt to prioritize a small subset of variants for further testing while accounting for LD structure [1]. CAVIAR [2] introduced a widely-adopted Bayesian approach that accounted for uncertainty in association statistics using a multivariate normal (MVN) model and allowed for potentially multiple causal SNPs at a locus. There is growing interest in improving fine-mapping by leveraging information from multiple studies. One example of this is trans-ethnic fine mapping, which can significantly improve fine mapping power and resolution by leveraging the distinct LD structures in each population. However, existing methods either assume a single causal SNP at each locus or do not explicitly model heterogeneity, limiting their power.

Nathan LaPierre, Kodi Taraszka, Helen Huang, Rosemary He, Farhad Hormozdiari, Eleazar Eskin

MONN: A Multi-objective Neural Network for Predicting Pairwise Non-covalent Interactions and Binding Affinities Between Compounds and Proteins

Background. Computational approaches for inferring the mechanisms of compound-protein interactions (CPIs) can greatly facilitate drug development. Recently, although a number of deep learning based methods have been proposed to predict binding affinities of CPIs and attempt to capture local interaction sites in compounds and proteins through neural attentions, they still lack a systematic evaluation on the interpretability of the identified local features [1–3]. In this work, we constructed the first benchmark dataset containing the pairwise inter-molecular non-covalent interactions for more than 10,000 compound-protein pairs. Our comprehensive evaluation suggested that current neural attention based approaches have difficulty in automatically capturing the accurate local non-covalent interactions between compounds and proteins.

Shuya Li, Fangping Wan, Hantao Shu, Tao Jiang, Dan Zhao, Jianyang Zeng

Evolutionary Context-Integrated Deep Sequence Modeling for Protein Engineering

Protein engineering seeks to design proteins with improved or novel functions. Compared to rational design and directed evolution approaches, machine learning-guided approaches traverse the fitness landscape more effectively and hold the promise for accelerating engineering and reducing the experimental cost and effort.

Yunan Luo, Lam Vo, Hantian Ding, Yufeng Su, Yang Liu, Wesley Wei Qian, Huimin Zhao, Jian Peng

Log Transformation Improves Dating of Phylogenies

The level of divergence between species represented by sequence data is a function of unknown time and mutation rates. Therefore, sequence data do not reveal exact timing of evolutionary events, and inferred phylogenies often have branch lengths estimated in the unit of the expected number of substitutions. Dating a phylogeny is the process of translating branch lengths from this unit to time unit. Such a process requires soft or hard constraints for the timing of some nodes and infers the divergence times of the remaining nodes.

Uyen Mai, Siavash Mirarab

Reconstructing Genotypes in Private Genomic Databases from Genetic Risk Scores

Genomic researchers are already aware that some forms of aggregate data from their databases should not be released publicly, because there is a risk that an attacker may be able to determine whether a particular individual is a member of the database (a membership inference attack).

Brooks Paige, James Bell, Aurélien Bellet, Adrià Gascón, Daphne Ezer

d-PBWT: Dynamic Positional Burrows-Wheeler Transform

Durbin’s positional Burrows-Wheeler transform (PBWT) [1] is a scalable foundational data structure for modeling population haplotype sequences. It offers efficient algorithms for matching haplotypes that approach theoretically optimal complexity.

Ahsan Sanaullah, Degui Zhi, Shaojie Zhang

A Mixture Model for Signature Discovery from Sparse Mutation Data

Mutational signatures and their exposures are key to understanding the processes that shape cancer genomes with applications to diagnosis and treatment. Yet current signature discovery or refitting approaches are limited to relatively rich mutation data that comes from whole-genome or whole-exome sequencing. Recently, orders of magnitude sparser data sets from gene panel sequencing have become increasingly available in the clinical setting. Such data have typically less than 10 mutations per sample, making them challenging to deal with using current approaches. Here we suggest a novel mixture model for sparse mutation data. In application to synthetic sparse datasets and real gene panel sequences it is shown to outperform current approaches and yield mutational signatures and patient stratifications that are in higher agreement with the literature.

Itay Sason, Yuexi Chen, Mark D. M. Leiserson, Roded Sharan

Single-Cell Tumor Phylogeny Inference with Copy-Number Constrained Mutation Losses

Motivation: Single-cell DNA sequencing enables the measurement of somatic mutations in individual tumor cells, and provides data to reconstruct the evolutionary history of the tumor.

Gryte Satas, Simone Zaccaria, Geoffrey Mon, Benjamin J. Raphael

Reconstruction of Gene Regulatory Networks by Integrating Biological Model and a Recommendation System

Gene Regulatory Networks (GRNs) control many aspects of cellular processes including cell differentiation, maintenance of cell type specific states, signal transduction, and response to stress. Since GRNs provide information that is essential for understanding cell function, the inference of these networks is one of the key challenges in systems biology.

Yijie Wang, Justin M. Fear, Isabelle Berger, Hangnoh Lee, Brian Oliver, Teresa M. Przytycka

Probing Multi-way Chromatin Interaction with Hypergraph Representation Learning

Advances in high-throughput mapping of 3D genome organization have enabled genome-wide characterization of chromatin interactions. However, proximity ligation based mapping approaches for pairwise chromatin interaction such as Hi-C cannot capture multi-way interactions, which are informative to delineate higher-order genome organization and gene regulation mechanisms at single-nucleus resolution.

Ruochi Zhang, Jian Ma

Backmatter

Titel: Research in Computational Molecular Biology
herausgegeben von: Russell Schwartz
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-45257-5
Print ISBN: 978-3-030-45256-8
DOI: https://doi.org/10.1007/978-3-030-45257-5