Skip to main content

2022 | Buch

Research in Computational Molecular Biology

26th Annual International Conference, RECOMB 2022, San Diego, CA, USA, May 22–25, 2022, Proceedings


Über dieses Buch

This book constitutes the proceedings of the 26th Annual Conference on Research in Computational Molecular Biology, RECOMB 2022, held in San Diego, CA, USA in May 2022. The 17 regular and 23 short papers presented were carefully reviewed and selected from 188 submissions. The papers report on original research in all areas of computational molecular biology and bioinformatics.



Extended Abstracts

Unsupervised Integration of Single-Cell Multi-omics Datasets with Disproportionate Cell-Type Representation

Integrated analysis of multi-omics data allows the study of how different molecular views in the genome interact to regulate cellular processes; however, with a few exceptions, applying multiple sequencing assays on the same single cell is not possible. While recent unsupervised algorithms align single-cell multi-omic datasets, these methods have been primarily benchmarked on co-assay experiments rather than the more common single-cell experiments taken from separately sampled cell populations. Therefore, most existing methods perform subpar alignments on such datasets. Here, we improve our previous work Single Cell alignment using Optimal Transport (SCOT) by using unbalanced optimal transport to handle disproportionate cell-type representation and differing sample sizes across single-cell measurements. We show that our proposed method, SCOTv2, consistently yields quality alignments on five real-world single-cell datasets with varying cell-type proportions and is computationally tractable. Additionally, we extend SCOTv2 to integrate multiple ( $$M\ge 2$$ M ≥ 2 ) single-cell measurements and present a self-tuning heuristic process to select hyperparameters in the absence of any orthogonal correspondence information.Available at: .

Pınar Demetçi, Rebecca Santorella, Björn Sandstede, Ritambhara Singh
Semi-supervised Single-Cell Cross-modality Translation Using Polarbear

The emergence of single-cell co-assays enables us to learn to translate between single-cell modalities, potentially offering valuable insights from datasets where only one modality is available. However, the sparsity of single-cell measurements and the limited number of cells measured in typical co-assay datasets impedes the power of cross-modality translation. Here, we propose Polarbear, a semi-supervised translation framework to predict cross-modality profiles that is trained using a combination of co-assay data and traditional “single-assay” data. Polarbear uses single-assay and co-assay data to train an autoencoder for each modality and then uses just the co-assay data to train a translator between the embedded representations learned by the autoencoders. With this approach, Polarbear is able to translate between modalities with improved accuracy relative to state-of-the-art translation techniques. As an added benefit of the training procedure, we show that Polarbear also produces a matching of cells across modalities.

Ran Zhang, Laetitia Meng-Papaxanthos, Jean-Philippe Vert, William Stafford Noble
Transcription Factor-Centric Approach to Identify Non-recurring Putative Regulatory Drivers in Cancer

Recent efforts to sequence the genomes of thousands of matched normal-tumor samples have led to the identification of millions of somatic mutations, the majority of which are non-coding. Most of these mutations are believed to be passengers, but a small number of non-coding mutations could contribute to tumor initiation or progression, e.g. by leading to dysregulation of gene expression. Efforts to identify putative regulatory drivers rely primarily on information about the recurrence of mutations across tumor samples. However, in regulatory regions of the genome, individual mutations are rarely seen in more than one donor. Instead of using recurrence information, here we present a method to prioritize putative regulatory driver mutations based on the magnitude of their effects on transcription factor-DNA binding. For each gene, we integrate the effects of mutations across all its regulatory regions, and we ask whether these effects are larger than expected by chance, given the mutation spectra observed in regulatory DNA in the cohort of interest. We applied our approach to analyze mutations in a liver cancer data set with ample somatic mutation and gene expression data available. By combining the effects of mutations across all regulatory regions of each gene, we identified dozens of genes whose regulation in tumor cells is likely to be significantly perturbed by non-coding mutations. Overall, our results show that focusing on the functional effects of non-coding mutations, rather than their recurrence, has the potential to prioritize putative regulatory drivers and the genes they dysregulate in tumor cells.

Jingkang Zhao, Vincentius Martin, Raluca Gordân
DeepMinimizer: A Differentiable Framework for Optimizing Sequence-Specific Minimizer Schemes

Minimizers are k-mer sampling schemes designed to generate sketches for large sequences that preserve sufficiently long matches between sequences. Despite their widespread application, learning an effective minimizer scheme with optimal sketch size is still an open question. Most work in this direction focuses on designing schemes that work well on expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, require greedy approximations to solve an intractable discrete optimization problem on the permutation space of k-mer orderings. To address this challenge, we propose: (a) a reformulation of the combinatorial solution space using a deep neural network re-parameterization; and (b) a fully differentiable approximation of the discrete objective. We demonstrate that our framework, DeepMinimizer, discovers minimizer schemes that significantly outperform state-of-the-art constructions on genomic sequences.

Minh Hoang, Hongyu Zheng, Carl Kingsford
MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs

Metagenomics has allowed us to obtain various genetic material from different species and gain valuable insights into microbial communities. Binning plays an important role in the early stages of metagenomic analysis pipelines. A typical pipeline in metagenomics binning is to assemble short reads into longer contigs and then bin into groups representing different species in the metagenomic sample. While existing binning tools bin metagenomic contigs, they do not make use of the assembly graphs that produce such assemblies. Here we propose MetaCoAG, a tool that utilizes assembly graphs with the composition and coverage information to bin metagenomic contigs. MetaCoAG uses single-copy marker genes to estimate the number of initial bins, assigns contigs into bins iteratively and adjusts the number of bins dynamically throughout the binning process. Experimental results on simulated and real datasets demonstrate that MetaCoAG significantly outperforms state-of-the-art binning tools, producing similar or more high-quality bins than the second-best tool. To the best of our knowledge, MetaCoAG is the first stand-alone contig-binning tool to make direct use of the assembly graph information.Availability: MetaCoAG is freely available at .

Vijini Mallawaarachchi, Yu Lin
A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World

Principal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present ThreSPCA, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, ThreSPCA is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world.

Agniva Chowdhury, Aritra Bose, Samson Zhou, David P. Woodruff, Petros Drineas
Gene Set Priorization Guided by Regulatory Networks with p-values through Kernel Mixed Model

The transcriptome association study has helped prioritize many causal genes for detailed study and thus further helped the development of many therapeutic strategies for multiple diseases. How- ever, prioritizing the causal gene only does not seem always to be able to offer sufficient guidance to the downstream analysis. Thus, in this paper, we propose to perform the association studies from another perspective: we aim to prioritize genes with a tradeoff between the pursuit of the causality evidence and the interest of the genes in the pathway. We introduce a new method for transcriptome association study by incorporating the information of gene regulatory networks. In addition to directly building the regularization into variable selection methods, we also expect the method to report p-values of the associated genes so that these p-values have been empirically proved trustworthy by geneticists. Thus, we introduce a high-dimension variable selection method with the following two merits: it has a flexible modeling power that allows the domain experts to consider the structure of covariates so that prior knowledge, such as the gene regulatory network, can be integrated; it also calculates the p-value, with a practical manner widely accepted by geneticists, so that the identified covariates can be directly assessed with statistical guarantees. With simulations, we demonstrate the empirical strength of our method against other high-dimension variable selection methods. We further apply our method to Alzheimer’s disease, and our method identifies interesting sets of genes.

Haohan Wang, Oscar L. Lopez, Wei Wu, Eric P. Xing
Real-Valued Group Testing for Quantitative Molecular Assays

Combinatorial group testing and compressed sensing both focus on recovering a sparse vector of dimensionality n from a much smaller number $$m<n$$ m < n of measurements. In the first approach, the problem is defined over the Boolean field – the goal is to recover a Boolean vector and measurements are Boolean; in the second approach, the unknown vector and the measurements are over the reals. Here, we focus on real-valued group testing setting that more closely fits modern testing protocols relying on quantitative measurements, such as qPCR, where the goal is recovery of a sparse, Boolean vector and the pooling matrix needs to be Boolean and sparse, but the unknown input signal vector and the measurement outcomes are nonnegative reals, and the matrix algebra implied in the test protocol is over the reals. With the recent renewed interest in group testing, focus has been on quantitative measurements resulting from qPCR, but the method proposed for sample pooling were based on matrices designed with Boolean measurements in mind. Here, we investigate constructing pooling matrices dedicated for the real-valued group testing. We provide conditions for pooling matrices to guarantee unambiguous decoding of positives in this setting. We also show a deterministic algorithm for constructing matrices meeting the proposed condition, for small matrix sizes that can be implemented using a laboratory robot. Using simulated data, we show that the proposed approach leads to matrices that can be applied for higher positivity rates than combinatorial group testing matrices considered for viral testing previously. We also validate the approach through wet lab experiments involving SARS-CoV-2 nasopharyngeal swab samples.

Seyran Saeedi, Myrna Serrano, Dennis G. Yang, J. Paul Brooks, Gregory A. Buck, Tomasz Arodz
On the Effect of Intralocus Recombination on Triplet-Based Species Tree Estimation

We consider species tree estimation from multiple loci subject to intralocus recombination. We focus on $$R^{*}$$ R ∗ , a summary coalescent-based method using rooted triplets. We demonstrate analytically that intralocus recombination gives rise to an inconsistency zone, in which correct inference is not assured even in the limit of infinite amount of data. In addition, we validate and characterize this inconsistency zone through a simulation study that suggests that differential rates of recombination between closely related taxa can amplify the effect of incomplete lineage sorting and contribute to inconsistency.

Max Hill, Sebastien Roch
QT-GILD: Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data

Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present QT-GILD, an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing (NLP), which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical data sets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but it can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data. QT-GILD is freely available in open source form at .

Sazan Mahbub, Shashata Sawmya, Arpita Saha, Rezwana Reaz, M. Sohel Rahman, Md. Shamsuzzoha Bayzid
Safety and Completeness in Flow Decompositions for RNA Assembly

Flow decomposition has numerous applications, ranging from networking to bioinformatics. Some applications require any valid decomposition that optimizes some property as number of paths, robustness, or path lengths. Many bioinformatic applications require the specific decomposition which relates to the underlying data that generated the flow. Thus, no optimization criteria guarantees to identify the correct decomposition for real inputs. We propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition.Ma et al. [WABI 2020] addressed the existence of multiple optimal solutions in a probabilistic framework, which is referred to as non-identifiability. Later, they gave a quadratic-time algorithm [RECOMB 2021] based on a global criterion for solving a problem called AND-Quant, which generalizes the problem of reporting whether a given path is safe.We present the first local characterization of safe paths for flow decompositions in directed acyclic graphs, giving a practical algorithm for finding the complete set of safe paths. We also evaluated our algorithm against the trivial safe algorithms (unitigs, extended unitigs) and a popular heuristic (greedy-width) for flow decomposition on RNA transcripts datasets. Despite maintaining perfect precision our algorithm reports $$\approx $$ ≈ 50% higher coverage over trivial safe algorithms. Though greedy-width reports better coverage, it has significantly lower precision on complex graphs. On a unified metric (F-Score) of coverage and precision, our algorithm outperforms greedy-width by $$\approx $$ ≈ 20%, when the evaluated dataset has significant number of complex graphs. Also, it has superior time (3–5 $$\times $$ × ) and space efficiency (1.2–2.2 $$\times $$ × ), resulting in a better and more practical approach for bioinformatics applications of flow decomposition.

Shahbaz Khan, Milla Kortelainen, Manuel Cáceres, Lucia Williams, Alexandru I. Tomescu
NetMix2: Unifying Network Propagation and Altered Subnetworks

A standard paradigm in computational biology is to use interaction networks to analyze high-throughput biological data. Two common approaches for leveraging interaction networks are: (1) network ranking, where one ranks vertices in the network according to both vertex scores and network topology; (2) altered subnetwork identification, where one identifies one or more subnetworks in an interaction network using both vertex scores and network topology. The dominant approach in network ranking is network propagation which smooths vertex scores over the network using a random walk or diffusion process, thus utilizing the global structure of the network. For altered subnetwork identification, existing algorithms either restrict solutions to subnetworks in subnetwork families with simple topological constraints, such as connected subnetworks, or utilize ad hoc heuristics that lack a rigorous statistical foundation. In this work, we unify the network propagation and altered subnetwork approaches. We derive a subnetwork family which we call the propagation family that approximates the subnetworks ranked highly by network propagation. We introduce NetMix2, a principled algorithm for identifying altered subnetworks from a wide range of subnetwork families, including the propagation family, thus combining the advantages of the network propagation and altered subnetwork approaches. We show that NetMix2 outperforms network propagation on data simulated using the propagation family. Furthermore, NetMix2 outperforms other methods at recovering known disease genes in pan-cancer somatic mutation data and in genome-wide association data from multiple human diseases. NetMix2 is publicly available at .

Uthsav Chitra, Tae Yoon Park, Benjamin J. Raphael
Multi-modal Genotype and Phenotype Mutual Learning to Enhance Single-Modal Input Based Longitudinal Outcome Prediction

In recent years, due to the advance of modern sensory devices, the collection of multiple biomedical data modalities such as imaging genetics has gotten feasible, and multimodal data analysis has attracted significant attention in bioinformatics. Although existing multimodal learning methods have shown superior ability in combining data from multiple sources, they are not directly applicable for many real-world biological and biomedical studies that suffer from missing data modalities due to the high expenses of collecting all modalities. Thus, in practice, usually, only a main modality containing a major ‘diagnostic signal’ is used for decision making as auxiliary modalities are not available. In addition, during the examination of a subject regarding a chronic disease (with longitudinal progression) in a visit, typically, two diagnosis-related questions are of main interest that are what their status currently is (diagnosis) and how it will change before their next visit (longitudinal outcome) if they maintain their disease trajectory and lifestyle. Accurate answers to these questions can distinguish vulnerable subjects and enable clinicians to start early treatments for them. In this paper, we propose a new adversarial mutual learning framework for longitudinal prediction of disease progression such that we properly leverage several modalities of data available in training set to develop a more accurate model using single-modal for prediction. Specifically, in our framework, a single-modal model (that utilizes the main modality) learns from a pretrained multimodal model (which takes both main and auxiliary modalities as input) in a mutual learning manner to 1) infer outcome-related representations of the auxiliary modalities based on its own representations for the main modality during adversarial training and 2) effectively combine them to predict the longitudinal outcome. We apply our new method to analyze the retinal imaging genetics for the early diagnosis of Age-related Macular Degeneration (AMD) disease in which we formulate prediction of longitudinal AMD progression outcome of subjects as a classification problem of simultaneously grading their current AMD severity as well as predicting their condition in their next visit with a preselected time duration between visits. Our experiments on the Age-Related Eye Disease Study (AREDS) dataset demonstrate the superiority of our model compared to baselines for simultaneously grading and predicting future AMD severity of subjects.

Alireza Ganjdanesh, Jipeng Zhang, Wei Chen, Heng Huang
Fast, Flexible, and Exact Minimum Flow Decompositions via ILP

Minimum flow decomposition (MFD)—the problem of finding a minimum set of paths that perfectly decomposes a flow—is a classical problem in Computer Science, and variants of it are powerful models in multiassembly problems in Bioinformatics (e.g. RNA assembly). However, because this problem and its variants are NP-hard, practical multiassembly tools either use heuristics or solve simpler, polynomial-time solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Many RNA assemblers also use integer linear programming (ILP) formulations of such practical variants, having the major limitation they need to encode all the potentially exponentially many solution paths. Moreover, the only exact solver for MFD does not scale to large instances, and cannot be efficiently generalized to practical MFD variants.In this work, we provide the first practical ILP formulation for MFD (and thus the first fast and exact solver for MFD), based on encoding all of the exponentially many solution paths using only a quadratic number of variables. On both simulated and real flow graphs, our approach runs in under 13 s on average. We also show that our ILP formulation can be easily and efficiently adapted for many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors.We hope that our results can remove the current tradeoff between the complexity of a multiassembly model and its tractability, and can lie at the core of future practical RNA assembly tools. Our implementations are freely available at .

Fernando H. C. Dias, Lucia Williams, Brendan Mumey, Alexandru I. Tomescu
Co-linear Chaining with Overlaps and Gap Costs

Co-linear chaining has proven to be a powerful heuristic for finding near-optimal alignments of long DNA sequences (e.g., long reads or a genome assembly) to a reference. It is used as an intermediate step in several alignment tools that employ a seed-chain-extend strategy. Despite this popularity, efficient subquadratic-time algorithms for the general case where chains support anchor overlaps and gap costs are not currently known. We present algorithms to solve the co-linear chaining problem with anchor overlaps and gap costs in $$\tilde{O}(n)$$ O ~ ( n ) time, where n denotes the count of anchors. We also establish the first theoretical connection between co-linear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal ‘anchored’ edit distance equals the optimal co-linear chaining cost. Finally, we demonstrate experimentally that optimal co-linear chaining cost under the proposed cost function can be computed orders of magnitude faster than edit distance, and achieves correlation coefficient above 0.9 with edit distance for closely as well as distantly related sequences.

Chirag Jain, Daniel Gibney, Sharma V. Thankachan
The Complexity of Approximate Pattern Matching on de Bruijn Graphs

Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with |E| edges that exactly matches a pattern of length m, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $$\mathcal {O}(|E|m)$$ O ( | E | m ) time is unlikely [Equi et al., ICALP 2019]. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time [Bowe et al., WABI 2012]. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is again solvable in $$\mathcal {O}(|E|m)$$ O ( | E | m ) time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets [Jain et al., RECOMB 2019]. These results hold even when edits are restricted to only substitutions. Despite the popularity of de Bruijn graphs in Computational Biology, the complexity of approximate pattern matching on de Bruijn graphs remained open. We investigate this problem and show that the properties that make de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. Specifically, we prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. In addition, we demonstrate that an algorithm significantly faster than $$\mathcal {O}(|E|m)$$ O ( | E | m ) is unlikely for de Bruijn graphs in the case where only substitutions are allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, like on de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic $$\tilde{O}(n\sqrt{m})$$ O ~ ( n m ) time, where n is the text’s length [Abrahamson, SIAM J. Computing 1987].

Daniel Gibney, Sharma V. Thankachan, Srinivas Aluru
ProTranslator: Zero-Shot Protein Function Prediction Using Textual Description

Accurately finding proteins and genes that have a certain function is the prerequisite for a broad range of biomedical applications. Despite the encouraging progress of existing computational approaches in protein function prediction, it remains challenging to annotate proteins to a novel function that is not collected in the Gene Ontology and does not have any annotated proteins. This limitation, a “side effect” from the widely-used multi-label classification problem setting of protein function prediction, hampers the progress of studying new pathways and biological processes, and further slows down research in various biomedical areas. Here, we tackle this problem by annotating proteins to a function only based on its textual description so that we don’t need to know any associated proteins for this function. The key idea of our method ProTranslator is to redefine protein function prediction as a machine translation problem, which translates the description word sequence of a function to the amino acid sequence of a protein. We can then transfer annotations from functions that have similar textual description to annotate a novel function. We observed substantial improvement in annotating novel functions and sparsely annotated functions on CAFA3, SwissProt and GOA datasets. We further demonstrated how our method accurately predicted gene members for a given pathway in Reactome, KEGG and MSigDB only based on the pathway description. Finally, we showed how ProTranslator enabled us to generate the textual description instead of the function label for a set of proteins, providing a new scheme for protein function prediction. We envision ProTranslator will give rise to a protein function “search engine” that returns a list of proteins based on the free text queried by the user.

Hanwen Xu, Sheng Wang

Short Papers

Single-Cell Multi-omic Velocity Infers Dynamic and Decoupled Gene Regulation

Computational approaches can leverage single-cell snapshots to infer sequential gene expression changes during developmental processes. For example, cell trajectory inference algorithms use pairwise cell similarities to map cells onto a “pseudotime” axis corresponding to predicted developmental progress. However, trajectory inference based on similarity cannot predict the directions or relative rates of cellular transitions.

Chen Li, Maria Virgilio, Kathleen L. Collins, Joshua D. Welch
Ultrafast and Interpretable Single-Cell 3D Genome Analysis with Fast-Higashi

The advent of high-throughput whole-genome mapping methods for the three-dimensional (3D) genome organization such as Hi-C has revealed distinct features of chromatin folding in various scales within the cell nucleus. These multiscale 3D genome features collectively contribute to vital genome functions such as transcription.

Ruochi Zhang, Tianming Zhou, Jian Ma
DiffDomain Enables Identification of Structurally Reorganized Topologically Associating Domains

The recent development of mapping technologies, such as Hi-C, that probes the 3D genome organization reveals that a chromosome is divided into topologically associating domains (TADs). TADs are genomic regions where chromatin loci are more frequently interacting with chromatin loci from the same TADs rather than from other TADs. TADs are functional units for transcriptional regulation, such as constraining interactions between enhancers and promoters.

Dunming Hua, Ming Gu, Yanyi Du, Li Qi, Xiangjun Du, Zhidong Bai, Xiaopeng Zhu, Dechao Tian
Joint Inference of Repeated Evolutionary Trajectories and Patterns of Clonal Exclusivity or Co-occurrence from Tumor Mutation Trees

Cancer progression is an evolutionary process shaped by both deterministic and stochastic forces. Recent advances in multi-region sequencing, single-cell sequencing, and phylogenetic tree inference empower more precise reconstruction of the mutational history of each tumor. At the same time, the increased resolution also highlights the extensive diversity across tumors and patients. Resolving the interactions among mutations and recovering the recurrent evolutionary processes may offer greater opportunities for designing successful therapeutic strategies.

Xiang Ge Luo, Jack Kuipers, Niko Beerenwinkel
Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds

We present a novel A $$^\star $$ ⋆ seed heuristic that enables fast and optimal sequence-to-graph alignment, guaranteed to minimize the edit distance of the alignment assuming non-negative edit costs.We phrase optimal alignment as a shortest path problem and solve it by instantiating the A $$^\star $$ ⋆ algorithm with our seed heuristic. The seed heuristic first extracts non-overlapping substrings (seeds) from the read, finds exact seed matches in the reference, marks preceding reference positions by crumbs, and uses the crumbs to direct the A $$^\star $$ ⋆ search. The key idea is to punish paths for the absence of foreseeable seed matches. We prove admissibility of the seed heuristic, thus guaranteeing alignment optimality.Our implementation extends the free and open source aligner and demonstrates that the seed heuristic outperforms all state-of-the-art optimal aligners including GraphAligner, Vargas, PaSGAL, and the prefix heuristic previously employed by AStarix. Specifically, we achieve a consistent speedup of >60 $$\times $$ × on both short Illumina reads and long HiFi reads (up to 25 kbp), on both the E. coli linear reference genome (1 Mbp) and the MHC variant graph (5 Mbp). Our speedup is enabled by the seed heuristic consistently skipping >99.99% of the table cells that optimal aligners based on dynamic programming compute.AStarix Aligner and Evaluations: .Full Paper: .

Pesho Ivanov, Benjamin Bichsel, Martin Vechev
CLMB: Deep Contrastive Learning for Robust Metagenomic Binning

The reconstruction of microbial genomes from large metagenomic datasets is a critical procedure for finding uncultivated microbial populations and defining their microbial functional roles. To achieve that, we need to perform metagenomic binning, clustering the assembled contigs into draft genomes. Despite the existing computational tools, most of them neglect one important property of the metagenomic data, that is, the noise. To further improve the metagenomic binning step and reconstruct better metagenomes, we propose a deep Contrastive Learning framework for Metagenome Binning (CLMB), which can efficiently eliminate the disturbance of noise and produce more stable and robust results. Essentially, instead of denoising the data explicitly, we add simulated noise to the training data and force the deep learning model to produce similar and stable representations for both the noise-free data and the distorted data. Consequently, the trained model will be robust to noise and handle it implicitly during usage. CLMB outperforms the previous state-of-the-art binning methods significantly, recovering the most near-complete genomes on almost all the benchmarking datasets (up to 17% more reconstructed genomes compared to the second-best method). It also improves the performance of bin refinement, reconstructing 8–22 more high-quality genomes and 15–32 more middle-quality genomes more than the second-best result. Impressively, in addition to being compatible with the binning refiner, single CLMB even recovers on average 15 more HQ genomes than the refiner of VAMB and Maxbin on the benchmarking datasets. On a real mother-infant microbiome dataset with 110 samples, CLMB is scalable and practical to recover 365 high-quality and middle-quality genomes (including 21 new ones), providing insights into the microbiome transmission. CLMB is open-source and available at .

Pengfei Zhang, Zhengyuan Jiang, Yixuan Wang, Yu Li
Unsupervised Cell Functional Annotation for Single-Cell RNA-Seq

One of the first steps in the analysis of single cell RNA-Sequencing data (scRNA-Seq) is the assignment of cell types.

Dongshunyi Li, Jun Ding, Ziv Bar-Joseph
A Novel Matrix Factorization Model for Interpreting Single-Cell Gene Expression from Biologically Heterogeneous Data

Single-cell RNA sequencing (scRNA-seq) technologies enable gene expression measurement at a single-cell resolution, and have opened a new frontier to understand animal development, physiology, and disease-associated molecular mechanisms [1, 2].

Kun Qian, Shiwei Fu, Hongwei Li, Wei Vivian Li
Tractable and Expressive Generative Models of Genetic Variation Data

Generative models of genetic sequence data play a central role in population genomics.

Meihua Dang, Anji Liu, Xinzhu Wei, Sriram Sankararaman, Guy Van den Broeck
Concert: Genome-Wide Prediction of Sequence Elements That Modulate DNA Replication Timing

Proper control of the replication timing (RT) program is of vital importance to maintain genome integrity.

Yang Yang, Yuchuan Wang, Yang Zhang, Jian Ma
CORSID Enables de novo Identification of Transcription Regulatory Sequences and Genes in Coronaviruses

Coronaviruses are comprised of a single-stranded RNA genome that is ready to be translated by the host ribosome.

Chuanyi Zhang, Palash Sashittal, Mohammed El-Kebir
Learning Probabilistic Protein-DNA Recognition Codes from DNA-Binding Specificities Using Structural Mappings

Knowledge of how proteins interact with DNA is essential for understanding gene regulation.

Joshua L. Wetzel, Kaiqian Zhang, Mona Singh
Uncertainty Quantification Using Subsampling for Assembly-Free Estimates of Genomic Distance and Phylogenetic Relationships

Computing the distance between two genomes, often without using alignments or even access to assembled sequences, is fundamental to many downstream analyses.

Eleonora Rachtman, Shahab Sarmashghi, Vineet Bafna, Siavash Mirarab
SOPHIE: Viral Outbreak Investigation and Transmission History Reconstruction in a Joint Phylogenetic and Network Theory Framework

Reconstruction of transmission networks from viral genomes sampled from infected individuals is a major computational problem of genomic epidemiology. For this problem, we propose a maximum likelihood framework SOPHIE (SOcial and PHilogenetic Investigation of Epidemics) based on the integration of phylogenetic and random graph models. SOPHIE is scalable, accounts for intra-host diversity and accurately infers transmissions without case-specific epidemiological data.

Pavel Skums, Fatemeh Mohebbi, Vyacheslav Tsyvina, Pelin Icer, Sumathi Ramachandran, Yury Khudyakov
Identifying Systematic Variation at the Single-Cell Level by Leveraging Low-Resolution Population-Level Data

A major limitation in single-cell genomics is a lack of ability to conduct cost-effective population-level studies. As a result, much of the current research in single-cell genomics focuses on biological processes that are broadly conserved across individuals, such as cellular organization and tissue development. This limitation prevents us from studying the etiology of experimental or clinical conditions that may be inconsistent across individuals owing to molecular variation and a wide range of effects in the population. In order to address this gap, we developed “kernel of integrated single cells” (Keris), a novel model-based framework to inform the analysis of single-cell gene expression data with population-level effects of a condition of interest. By inferring cell-type-specific moments and their variation across conditions using large tissue-level bulk data representing a population, Keris allows us to generate testable hypotheses at the single-cell level that would otherwise require collecting single-cell data from a large number of donors. Within the Keris framework, we show how the combination of low-resolution, large bulk data with small but high-resolution single-cell data enables the identification of changes in cell-subtype compositions and the characterization of subpopulations of cells that are affected by a condition of interest. Using Keris we estimate linear and non-linear age-associated changes in cell-type expression in large bulk peripheral blood mononuclear cells (PBMC) data. Combining with three independent single-cell PBMC datasets, we demonstrate that Keris can identify changes in cell-subtype composition with age and capture cell-type-specific subpopulations of senescent cells. This demonstrates the promise of enhancing single-cell data with population-level information to study compositional changes and to profile condition-affected subpopulations of cells, and provides a potential resource of targets for future clinical interventions.

Elior Rahmani, Michael I. Jordan, Nir Yosef
Belayer: Modeling Discrete and Continuous Spatial Variation in Gene Expression from Spatially Resolved Transcriptomics

Motivation: Spatially resolved transcriptomics (SRT) technologies simultaneously measure gene expression and spatial location of cells in a 2D tissue slice, enabling the study of spatial patterns of gene expression.

Cong Ma, Uthsav Chitra, Shirley Zhang, Benjamin J. Raphael
Lossless Indexing with Counting de Bruijn Graphs

Motivation: High-throughput sequencing data is rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing.

Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, André Kahles
Uncovering Hidden Assembly Artifacts: When Unitigs are not Safe and Bidirected Graphs are not Helpful (ABSTRACT)

Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources.

Amatur Rahman, Paul Medvedev
Mapping Single-Cell Transcriptomes to Copy Number Evolutionary Trees

Cancer is an evolutionary process in which cells accumulate mutations and form different subclones with potentially varying phenotypes. The diversity of cell states that make up a tumor is caused by genomic and epigenomic changes, as well as interactions with its microenvironment. The resulting intra-tumor heterogeneity is a major cause of treatment failure and relapse. In particular, the manner in which different genomic changes, as opposed to other factors, contribute to cell-specific states is highly relevant in treatment decision making, but remains an open question.

Pedro F. Ferreira, Jack Kuipers, Niko Beerenwinkel
ImmunoTyper-SR: A Novel Computational Approach for Genotyping Immunoglobulin Heavy Chain Variable Genes Using Short Read Data

The human immunoglobulin heavy chain (IGH) locus on chromosome 14 includes more than 40 functional copies of the variable gene (IGHV).

Michael Ford, Ananth Hari, Oscar Rodriguez, Junyan Xu, Justin Lack, Cihan Oguz, Yu Zhang, Sarah Weber, Mary Magliocco, Jason Barnett, Sandhya Xirasagar, Smilee Samuel, Luisa Imberti, Paolo Bonfanti, Andrea Biondi, Clifton L. Dalgard, Stephen Chanock, Lindsey Rosen, Steven Holland, Helen Su, Luigi Notarangelo, Uzi Vishkin, Corey Watson, S. Cenk Sahinalp
AutoComplete: Deep Learning-Based Phenotype Imputation for Large-Scale Biomedical Data

Biomedical datasets that aim to collect diverse phenotypic and genomic data across large numbers of individuals are plagued by the large fraction of missing data The ability to accurately impute or “fill-in” missing entries in these datasets is critical to a number of downstream applications.

Ulzee An, Na Cai, Andy Dahl, Sriram Sankararaman
Resistor: An Algorithm for Predicting Resistance Mutations Using Pareto Optimization over Multistate Protein Design and Mutational Signatures

Resistance to pharmacological treatments, such as to antibiotics or cancer therapeutics, ablates the efficacy of drugs and is a major public health challenge [2, 7]. Accurate, prospective prediction of resistance mutations would allow for design of drugs less susceptible to conferred resistance. Here we report Resistor—a novel structure- and sequence-based algorithm for predicting resistance mutations.

Nathan Guerin, Teresa Kaserer, Bruce R. Donald
Ultra High Diversity Factorizable Libraries for Efficient Therapeutic Discovery

The successful discovery of novel biological therapeutics by selection requires highly diverse libraries of candidate sequences that contain a high proportion of desirable candidates.

Zheng Dai, Sachit D. Saksena, Geraldine Horny, Christine Banholzer, Stefan Ewert, David K. Gifford
Research in Computational Molecular Biology
herausgegeben von
Itsik Pe'er
Electronic ISBN
Print ISBN

Premium Partner