Evolution

Frontmatter

On the Comparison of Bacteriophage Populations

Abstract

The production of cheese and other dairy products relies on the constant monitoring of viruses, called bacteriophages, that attack the organisms responsible for the fermentation process. Bacteriophage species are characterized by a stable core genome, and a ‘genetic reservoir’ of gene variants that are exchanged through recombination. Phylogenetic analysis of phage populations are notably difficult due not only to extreme levels of horizontal exchange at the borders of functional modules, but also inside of them.

In this paper we present the first known attempt at directly modeling gene flux between phage populations. This represents an important departure from gene-based alignment and phylogenetic reconstruction, shifting focus to a genetic reservoir-based evolutionary inference. We present a combinatorial framework for the comparison of bacteriophage populations, and use it to compute recombination scenarios that generate one population from another. We apply our heuristic, based on this framework, to four populations sampled from Dutch dairy factories by Murphy [14]. We find that, far from being random, these scenarios are highly constrained. We use our method to test for factory-specific diversity, and find that there was likely a large amount of recombination in the ancestral population.

Find instructions for reproducing the results at: https://bitbucket.org/thekswenson/phage_population_comparison

The code is publicly available at: https://bitbucket.org/thekswenson/phagerecombination

Anne Bergeron, Marie-Jean Meurs, Romy Valiquette-Labonté, Krister M. Swenson

Syntenic Dimensions of Genomic Evolution

Abstract

We compare several types of evolutionary divergence of synteny blocks: sequence level divergence of the genes in the block, loss of genes affecting the length and structure of the blocks and spatial position of the block in relation to the chromosomal centromere, and suggest other dimensions, such as the predominant functional characteristic of the genes in the block. We focus on the evolutionary history of the allotetraploid Coffea arabica genome and its two progenitor genomes through three major genomic events spanning 120 million years.

Zhe Yu, David Sankoff

Phylogenetics

Frontmatter

Fast and Accurate Branch Support Calculation for Distance-Based Phylogenetic Placements

Abstract

Placing a new sequence onto an existing phylogenetic tree is increasingly used in downstream applications ranging from microbiome analyses to epidemic tracking. Most such applications deal with noisy data, incomplete references, and model misspecifications, all of which make the correct placement uncertain. While recent placement methods have increasingly enabled placement on ultra-large backbone trees with tens to hundreds of thousands of species, they have mostly ignored the issue of uncertainty. Here, we build on the recently developed distance-based phylogenetic placement methodology and show how the distribution of placements can be estimated per input sequence. We compare parametric and non-parametric sampling methods, showing that non-parametric bootstrapping is far more accurate in estimating uncertainty. Finally, we design and implement a linear algebraic implementation of bootstrapping that makes it faster, and we incorporate the computation of support values as a new feature in the APPLES software.

Navid Bin Hasan, Avijit Biswas, Metin Balaban, Siavash Mirarab, Md. Shamsuzzoha Bayzid

The Sackin Index of Simplex Networks

Abstract

A phylogenetic network is a simplex network if the child of every reticulation node is a network leaf and each tree node has at most one reticulation node as its child. Simplex networks are a superclass of phylogenetic trees and a subclass of tree-child networks. Generalizing the Sackin index to phylogenetic networks, we prove that the expected Sackin index of a random simplex network is asymptotically \(\varTheta (n^{7/4})\) in the uniform model.

Louxin Zhang

Phylogenetic Placement Problem: A Hyperbolic Embedding Approach

Abstract

Phylogenetic trees define a metric space over their vertices, an observation that underlines distance-based phylogenetic inference. Several authors, including Layer and Rhodes (2017), have noted that we can embed leaves of a phylogenetic tree into high-dimensional Euclidean spaces in such a way that it minimizes the distortion of the tree distances. Jiang et al. (2021) use a deep learning approach to build a mapping from the space of sequences to the Euclidean space such that the mapped sequences accurately preserve the leaf distances on a given tree. Their tool, DEPP, uses this map to place a new query sequence onto the tree by first embedding it, an idea that was particularly promising for updating a species tree given data from a single gene despite the potential discordance of the gene tree and the species tree. In focusing on Euclidean spaces, these recent papers have ignored the strong theory that suggests hyperbolic spaces are more appropriate for embedding vertices of a tree. In this paper, we show that by moving to hyperbolic spaces and addressing challenges related to non-linearity and precision, we can reduce the distortion of distances for any given number of dimensions. The distortion of distances obtained using hyperbolic embeddings is lower than Euclidean embeddings with the same number of dimensions, both in training (backbone) and testing (query). The low-distortion distances of embeddings result in better topological accuracy in updating species trees using a single gene compared to its Euclidean counterpart. It also improves accuracy in placing queries for some datasets but not all.

Yueyu Jiang, Puoya Tabaghi, Siavash Mirarab

Phylogenetic Network Dissimilarity Measures that Take Branch Lengths into Account

Abstract

Dissimilarity measures for phylogenetic trees have long been used for analyzing inferred trees and understanding the performance of phylogenetic methods. Given their importance, a wide array of such measures have been developed, some of which are based on the tree topologies alone, and others that also take branch lengths into account. Similarly, a number of dissimilarity measures of phylogenetic networks have been developed in the last two decades. However, to the best of our knowledge, all these measures are based solely on the topologies of phylogenetic networks and ignore branch lengths. In this paper, we propose two phylogenetic network dissimilarity measures that take both topology and branch lengths into account. We demonstrate the behavior of these two measures on pairs of related networks. Furthermore, we show how these measures can be used to cluster a set of phylogenetic networks obtained by an inference method, illustrating this application on the posterior sample of phylogenetic networks. Both measures are implemented in the publicly available software package PhyloNet.

Berk A. Yakici, Huw A. Ogilvie, Luay Nakhleh

Homology and Reconciliation

Frontmatter

The Complexity of Finding Common Partitions of Genomes with Predefined Block Sizes

Abstract

Partitioning genomes into syntenic blocks has many uses in comparative genomics, such as inferring phylogenies or ancestral gene order. These blocks are usually required to contain enough genes to be qualified as syntenic. This leads to the problem of finding a common partition of the genomes in which the size of the blocks are above a certain threshold (usually at least two). When this is not feasible, one can ask to remove a minimum number of “noisy” genes so that such a partition exists. This is known as the Strip Recovery problem and is similar to the well-known Minimum Common String Partition problem, but also quite different since the latter has no restriction on the block sizes.

The algorithmic aspects of Strip Recovery are not well-understood, especially in the presence of duplicated genes. In this work, we present several new complexity results. First, we solve an open problem mentioned by Bulteau and Weller in 2019 who asked whether, in polynomial time, one can decide if a common partition with block sizes at least two can be achieved without deleting any genes. We show that the problem is actually NP-hard for any fixed list of allowed block sizes, unless blocks sizes are all multiples of the minimum allowed size. The problem is also hard on fixed alphabets if this list is part of the input. However, the minimum number of required gene deletions can be found in polynomial time if both the allowed blocks sizes and alphabet are fixed.

Manuel Lafond, Adiesha Liyanage, Binhai Zhu, Peng Zou

Reconciliation with Segmental Duplication, Transfer, Loss and Gain

Abstract

We generalize the reconciliation approach, used for inferring the evolution of a single gene family given a species tree, to groups of co-localized genes, also called syntenies. More precisely, given a set \({\mathcal X}\) of syntenies in a set \(\varSigma \) of genomes, a tree T for \({\mathcal X}\) and a tree S for \(\varSigma \), the problem is to find a most parsimonious history for \({\mathcal X}\) with respect to a given evolutionary model. We extend a previous model involving segmental duplications and losses, to also include segmental horizontal gene transfers (HGTs) and gene gains. We present a polynomial-time dynamic programming algorithm to solve the problem. We apply it to CRISPR-associated (Cas) gene syntenies. These genes are part of CRISPR-Cas systems, one of its members (CRISPR-Cas9) well-known as currently the most reliable and accurate molecular scissor technology for genome editing. The inferred evolutionary scenario is a plausible explanation of the diversification of this system into its different types. An implementation of the algorithm presented in this paper is available at: https://github.com/UdeM-LBIT/superrec2/releases/tag/rcg2022.

Yoann Anselmetti, Mattéo Delabre, Nadia El-Mabrouk

Quantifying Hierarchical Conflicts in Homology Statements

Abstract

A fundamental step in any comparative whole genome analysis is the annotation of homology relationships between segments of the genomes. Traditionally, this annotation has been based on coding segments, where orthologous genes are inferred and then syntenic blocks are computed by agglomerating sets of homologous genes into homologous regions. More recently, whole genomes, including intergenic regions, are being aligned de novo as whole genome alignments (WGA). In this article we develop a test to measure to what extent sets of homology relationships given by two different software are hierarchically related to one another, where matched segments from one software may contain matched segments from the other and vice versa. Such a test should be used as a sanity check for an agglomerative syntenic block software, and provides a mapping between the blocks that can be used for further downstream analyses. We show that, in practice, it is rare that two collections of homology relationships are perfectly hierarchically related. Therefore we present an optimization problem to measure how far they are from being so. We show that this problem, which is a generalization of the assignment problem, is NP-Hard and give a heuristic solution and implementation. We apply our distance measure to data from the Alignathon competition, as well as to Mycobacterium tuberculosis, showing that many factors affect how hierarchically related two collections are, including sensitivities to guide trees and the use or omission of an outgroup. These findings inform practitioners on the pitfalls of homology relationship inference, and can inform development of more robust inference tools.

Krister M. Swenson, Afif Elghraoui, Faramarz Valafar, Siavash Mirarab, Mathias Weller

On Partial Gene Transfer and Its Impact on Gene Tree Reconstruction

Abstract

Horizontal transfer of genetic material between different organisms is one of the most important evolutionary processes in microbial evolution. Such horizontal transfer events can result in the transfer of genomic fragments containing multiple complete genes, complete single genes, or partial genes. However, partial gene transfer (PGT) remains poorly understood and generally underappreciated. Indeed, existing phylogenetic approaches for studying microbial evolution and horizontal gene transfer largely ignore PGT, leading to potential biases and errors in downstream inferences.

In this work, we (i) perform a systematic study of the impact of PGT on the ability to correctly reconstruct the evolutionary histories of gene families (i.e., gene trees) and (ii) propose a simple, yet effective approach, called trippd, to detect if a given gene family has been affected by PGT. Our analysis, using simulated and real biological datasets, reveals many interesting insights related to when and how PGT affects gene tree reconstruction, demonstrates the utility of trippd, and sheds light on the importance of detecting and accounting for PGT when studying microbial evolution.

Sumaira Zaman, Mukul S. Bansal

Genome Rearrangements

Frontmatter

Sorting by k-Cuts on Signed Permutations

Abstract

Sorting by Genome Rearrangements is a classic problem in Computational Biology. Several models have been considered so far, each of them defines how a genome is modeled (for example, permutations when assuming no duplicated genes, strings if duplicated genes are allowed, and/or use of signs on each element when gene orientation is known), and which rearrangements are allowed. Recently, a new problem, called Sorting by Multi-Cut Rearrangements, was proposed. It uses the k-Cut rearrangement which cuts a permutation (or a string) at \(k \ge 2\) places and rearranges the generated blocks to obtain a new permutation (or string) of same size. This new rearrangement may model chromoanagenesis, a phenomenon consisting of massive simultaneous rearrangements. Similarly as the Double-Cut-and-Join, this new rearrangement also generalizes several genome rearrangements such as reversals, transpositions, revrevs, transreversals, and block-interchanges. In this paper, we extend a previous work based on unsigned permutations and strings to signed permutations. We show the complexity of this problem for different values of k, that the approximation algorithm proposed for unsigned permutations with any value of k can be adapted to signed permutations, and a 1.5-approximation algorithm for the specific case \(k=4\).

Andre Rodrigues Oliveira, Alexsandro Oliveira Alexandrino, Géraldine Jean, Guillaume Fertin, Ulisses Dias, Zanoni Dias

A New Approach for the Reversal Distance with Indels and Moves in Intergenic Regions

Abstract

Genome rearrangement events are widely used to estimate a minimum-size sequence of mutations capable of transforming a genome into another. The length of this sequence is called distance, and determining it is the main goal in genome rearrangement distance problems. Problems in the genome rearrangement field differ regarding the set of rearrangement events allowed and the genome representation. In this work, we consider the scenario where the genomes share the same set of genes, gene orientation is known, and intergenic regions (structures between a pair of genes and at the extremities of the genome) are taken into account. We use two models, the first model allows only conservative events (reversals and moves), and the second model includes non-conservative events (insertions and deletions) in the intergenic regions. We show that both models result in NP-hard problems and we present algorithms with an approximation factor of 2.

Klairton Lima Brito, Andre Rodrigues Oliveira, Alexsandro Oliveira Alexandrino, Ulisses Dias, Zanoni Dias

Chromothripsis Rearrangements Are Informed by 3D-Genome Organization

Abstract

Chromothripsis is a mutational phenomenon representing a unique type of tremendously complex genomic structural alteration. It was initially described and was broadly observed in cancer with lower frequencies in other genetic disorders. Chromothripsis manifests massive genomic structural alterations during a single catastrophic event in the cell. It is considered to be characterized by the simultaneous shattering of chromosomes followed by random reassembly of the DNA fragments, ultimately resulting in newly formed, mosaic derivative chromosomes and with a potential for a drastic oncogenic transformation. Here, we consider a question of whether the genomic locations involved in chromothripsis rearrangements’ are randomly distributed in 3D genomic packing space or have some spatial organization’s predispositions. To that end, we investigated the structural variations (SVs) observed in previously sequenced cancer genomes via juxtaposition of involved breakpoints onto the Hi-C contact genome map of normal tissue. We found that the average Hi-C contact score for SVs breakpoints appearing at the same chromosome (cis-SVs) in an individual patient is significantly higher than the average Hi-C matrix signal, which indicates that SVs tend to involve spatially proximal regions of the chromosome. Furthermore, we overlaid the chromothripsis annotation of groups of SVs’ breakpoints and demonstrated that the Hi-C signals for both chromothripsis breakpoint regions as well as regular SVs breakpoints are statistically significantly higher than random control, suggesting that chromothripsis cis-SVs have the same tendency to rearrange the proximal sites in 3D-genome space. Last but not least, our analysis revealed a statistically higher Hi-C score for all pairwise combinations of breakpoints involved in chromothripsis event when compared to both background Hi-C signal as well as to combination of non-chromothripsis breakpoint pairs. This observation indicates that breakpoints could be assumed to describe a given chromothripsis 3D-cluster as a proximal bundle in genome space. These results provide valuable new insights into the spatial relationships of the SVs loci for both chromothripsis and regular genomic alterations, laying the foundation for the development of a more precise method for chromothripsis identification and annotation.

Natalia Petukhova, Alexey Zabelkin, Vitaly Dravgelis, Sergey Aganezov, Nikita Alexeev

Metagenomics

Frontmatter

Using Computational Synthetic Biology Tools to Modulate Gene Expression Within a Microbiome

Abstract

The microbiome is an interconnected network of microorganisms, which exist and influence a wide array of natural and synthetic environments. Genetic information is constantly spread across the members of the microbial community in a process called horizontal gene transfer, causing exposure of genetic alterations and modifications to all members of the community.

In order to accurately and effectively engineer microbiomes, genetic modifications must be introduced to certain species, as selectivity is a key factor in creating and fixing functional abilities within microbial environments. Moreover, introduction of genes into unwanted hosts may cause unprecedented ecological impacts, posing a major biosafety issue. Technologies in the field are usually experimentally developed for a specific host or environment, and the lack of automization and generalization limit them to a specific microbiome. Additionally, they only deal with the transformation process itself at best and do not modulate the different elements of the genetic material, neglecting considerations related to the interactions between the new genetic material and the population.

This work presents a set of computational models that automatically design a microbiome-specific plasmid that is selectively expressed in certain parts of the bacterial population. The underlying algorithm fine-tunes genetic information to be optimally expressed in the wanted hosts of the plasmid, while simultaneously impairing expression in unwanted hosts. We take into account and selectively optimize the main elements linked to gene expression and heredity. In addition, we have provided both in-silico and in-vitro analysis supporting our claim. This study was part of the TAU IGEM 2021 project (https://2021.igem.org/Team:TAU_Israel).

Liyam Chitayat Levi, Ido Rippin, Moran Ben Tulila, Rotem Galron, Tamir Tuller

Metagenomics Binning of Long Reads Using Read-Overlap Graphs

Abstract

Metagenomics sequencing enables the direct study of microbial communities revealing important information such as taxonomy and relative abundance of species. Metagenomics binning facilitates the separation of these genetic materials into different taxonomic groups. Moving from second-generation sequencing to third-generation sequencing techniques enables the binning of reads before assembly thanks to the increased read lengths. The limited number of long-read binning tools that exist, still suffer from unreliable coverage estimation for individual long reads and face challenges in recovering low-abundance species. In this paper, we present a novel binning approach to bin long reads using the read-overlap graph. The read-overlap graph (1) enables a fast and reliable estimation of the coverage of individual long reads; (2) allows to incorporate the overlapping information between reads into the binning process; (3) facilitates a more uniform sampling of long reads across species of varying abundances. Experimental results show that our new binning approach produces better binning results of long reads and results in better assemblies especially for recovering low abundant species. The source code and a functional Google Colab Notebook are available at https://www.github.com/anuradhawick/oblr.

Anuradha Wickramarachchi, Yu Lin

A Mixed Integer Linear Programming Algorithm for Plasmid Binning

Abstract

The problem of analysing bacterial isolates in order to detect plasmids has been widely studied. With the development of Whole Genome Sequencing (WGS) technologies, several approaches have been proposed to bin contigs into putative plasmids. Reference-based approaches aim to bin contigs by mapping or comparing their sequences against databases of previously identified plasmids or plasmid genes. On the other hand, de novo approaches use contig features such as read coverage and length for plasmid binning. Hybrid approaches that combine both strategies have also been proposed recently.

We present PlasBin a mixed integer linear programming based hybrid approach for plasmid binning. We evaluate the performance of several binning methods on a real data set of bacterial samples.

Aniket Mane, Mahsa Faizrahnemoon, Cedric Chauve

Genomic Sequencing

Frontmatter

Benchmarking Penalized Regression Methods in Machine Learning for Single Cell RNA Sequencing Data

Abstract

Single Cell RNA Sequencing (scRNA-seq) technology has enabled the biological research community to explore gene expression at a single-cell resolution. By studying differences in gene expression, it is possible to differentiate cell clusters and types within tissues. One of the major challenges in a scRNA-seq study is feature selection in high dimensional data. Several statistical and machine learning algorithms are available to solve this problem, but their performances across data sets lack systematic comparison. In this research, we benchmark different penalized regression methods, which are suitable for scRNA-seq data. Results from four different scRNA-seq data sets show that Sparse Group Lasso (SGL) implemented by the SGL package in R performs better than other methods in terms of area under the receiver operating curve (AUC). The computation time for different algorithms varies between data sets with SGL having the least average computation time. Based on our findings, we propose a new method that applies SGL on a smaller pre-selected subset of genes to select the differentially expressed genes in scRNA-seq data. The reduction in the number of genes before SGL reduce the computation hardware requirement from 32 GB RAM to 8 GB RAM. The proposed method also demonstrates a consistent improvement in AUC over SGL.

Bhavithry Sen Puliparambil, Jabed Tomal, Yan Yan

Deciphering the Tissue-Specific Regulatory Role of Intronless Genes Across Cancers

Abstract

Intronless genes (IGs) or single-exon genes lacking introns are found across Eukaryotes. IGs are not regulated by the splicing machinery and may be subject to lower post-transcriptional gene expression variability. Therefore, IGs might be potential candidates for biomarkers with better predictability and easier regulation as targets for therapy. Cancer is a complex disease that relies on progressive uncontrolled cell division linked with multiple dysfunctional biological processes. Tumor heterogeneity remains the most challenging feature in cancer diagnosis and treatment. Given the clinical relevance of IGs, we aim to identify their unique expression profiles and interactome, that may act as functional signatures across eight different cancers. We identified 940 protein-coding IGs in the human genome, of which about 35% were differentially expressed across the analyzed cancer datasets. Specifically, 78% of differentially expressed IGs were undergoing transcriptional reprogramming with elevated expression in tumor cells. Remarkably, in all the studied tumors, a highly conserved induction of a group of deacetylase-histones located in a region of chromosome 6 enriched in nucleosome and chromatin condensation processes. This study highlights that differentially expressed human intronless genes across cancer types are prevalent in epigenetic regulatory roles participating in specific protein-protein interaction (PPI) networks for ESCA, GBM, and LUAD tumors. We determine that IGs play a key role in the tumor phenotype at transcriptional and post-transcriptional levels, with important mechanisms such as interactomics rewiring.

Katia Aviña-Padilla, José Antonio Ramírez-Rafael, Octavio Zambada-Moreno, Gabriel Emilio Herrera-Oropeza, Guillermo Romero, Ishaan Gupta, Maribel Hernández-Rosales

Springer Professional

Comparative Genomics

19th International Conference, RECOMB-CG 2022, La Jolla, CA, USA, May 20–21, 2022, Proceedings

About this book

Table of Contents

Frontmatter

Evolution

Frontmatter

On the Comparison of Bacteriophage Populations

Syntenic Dimensions of Genomic Evolution

Phylogenetics

Frontmatter

Fast and Accurate Branch Support Calculation for Distance-Based Phylogenetic Placements

The Sackin Index of Simplex Networks

Phylogenetic Placement Problem: A Hyperbolic Embedding Approach

Phylogenetic Network Dissimilarity Measures that Take Branch Lengths into Account

Homology and Reconciliation

Frontmatter

The Complexity of Finding Common Partitions of Genomes with Predefined Block Sizes

Reconciliation with Segmental Duplication, Transfer, Loss and Gain

Quantifying Hierarchical Conflicts in Homology Statements

On Partial Gene Transfer and Its Impact on Gene Tree Reconstruction

Genome Rearrangements

Frontmatter

Sorting by k-Cuts on Signed Permutations

A New Approach for the Reversal Distance with Indels and Moves in Intergenic Regions

Chromothripsis Rearrangements Are Informed by 3D-Genome Organization

Metagenomics

Frontmatter

Using Computational Synthetic Biology Tools to Modulate Gene Expression Within a Microbiome

Metagenomics Binning of Long Reads Using Read-Overlap Graphs

A Mixed Integer Linear Programming Algorithm for Plasmid Binning

Genomic Sequencing

Frontmatter

Benchmarking Penalized Regression Methods in Machine Learning for Single Cell RNA Sequencing Data

Deciphering the Tissue-Specific Regulatory Role of Intronless Genes Across Cancers

Backmatter

Premium Partner