Top

2010 | Book

Read chapter Read first chapter

Comparative Genomics

International Workshop, RECOMB-CG 2010, Ottawa, Canada, October 9-11, 2010. Proceedings

Editor: Eric Tannier

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

The complexity of genome evolution has given birth to exciting challenges for computational biologists. A various range of algorithmic, statistical, mathem- ical techniques to elucidate the histories of molecules are developed each year and many are presented at the RECOMB satellite workshop on Comparative Genomics. It is a place where scientists working on all aspects of comparative genomics can share ideas on the development of tools and their application to relevant questions. This volume contains the papers presented at RECOMB-CG 2010, held on October 9–11 in Ottawa. The ?eld is still ?ourishing as seen from the papers presented this year: many developments enrich the combinatorics of genome rearrangements, while gene order phylogenies are becoming more and more - curate, thanks to a mixing of combinatorial and statistical principles, associated with rapid and thoughtful heuristics. Several papers tend to re?ne the models of genome evolution, and more and more genomic events can be modeled, from single nucleotide substitutions in whole genome alignments to large structural mutations or horizontal gene transfers.

Frontmatter

Genome Aliquoting Revisited

Abstract

We prove that the genome aliquoting problem, the problem of finding a recent polyploid ancestor of a genome, with breakpoint distance can be solved in polynomial time. We propose an aliquoting algorithm that is a 2-approximation for the genome aliquoting problem with double cut and join distance, improving upon the previous best solution to this problem, Feijão and Meidanis’ 4-approximation algorithm.

Robert Warren, David Sankoff

The Problem of Chromosome Reincorporation in DCJ Sorting and Halving

Abstract

We study two problems in the double cut and join (DCJ) model: sorting – transforming one multilinear genome into another and halving – transforming a duplicated genome into a perfectly duplicated one. The DCJ model includes rearrangement operations such as reversals, translocations, fusions and fissions. We can also mimic transpositions or block interchanges by two operations: we extract an appropriate segment of a chromosome, creating a temporary circular chromosome, and in the next step we reinsert it in its proper place. Existing linear-time algorithms solving both problems ignore the constraint of reincorporating the temporary circular chromosomes immediately after their creation. For the restricted sorting problem only a quadratic algorithm was known, whereas the restricted halving problem was stated as open by Tannier, Zheng, and Sankoff. In this paper we address this constraint and show how to solve the problem of sorting in O(nlogn) time and halving in O(n ^3/2) time.

Jakub Kováč, Marília D. V. Braga, Jens Stoye

Advances on Genome Duplication Distances

Abstract

Given a phylogenetic tree involving Whole Genome Duplication events, we contribute to the problem of computing the rearrangement distance on a branch of a tree linking a duplication node d to a speciation node or a leaf s. In the case of a genome G at s containing exactly two copies of each gene, the genome halving problem is to find a perfectly duplicated genome D at d minimizing the rearrangement distance with G. We generalize the existing exact linear-time algorithm for genome halving to the case of a genome G with missing gene copies. In the case of a known ancestral duplicated genome D, we develop a greedy approach for computing the distance between G and D that is shown time-efficient and very accurate for both the rearrangement and DCJ distances.

Yves Gagnon, Olivier Tremblay Savard, Denis Bertrand, Nadia El-Mabrouk

Listing All Parsimonious Reversal Sequences: New Algorithms and Perspectives

Abstract

In comparative genomics studies, finding a minimum length sequences of reversals, so called sorting by reversals, has been the topic of a huge literature. Since there are many minimum length sequences, another important topic has been the problem of listing all parsimonious sequences between two genomes, called the All Sorting Sequences by Reversals (ASSR) problem. In this paper, we revisit the ASSR problem for uni-chromosomal genomes when no duplications are allowed and when the relative order of the genes is known. We put the current body of work in perspective by illustrating the fundamental framework that is common for all of them, a perspective that allows us for the first time to theoretically compare their running times. The paper also proposes an improved framework that empirically speeds up all known algorithms.

Ghada Badr, Krister M. Swenson, David Sankoff

Ultra-Perfect Sorting Scenarios

Abstract

Perfection has been used as a criteria to select rearrangement scenarios since 2004. However, there is a fundamental bias towards extant species in the original definition: ancestral species are not bound to perfection. Here we develop a new theory of perfection that takes an egalitarian view of species, and apply it to the complex evolution of mammal chromosome X.

Aïda Ouangraoua, Anne Bergeron, Krister M. Swenson

On Sorting Genomes with DCJ and Indels

Abstract

A previous work of Braga, Willing and Stoye compared two genomes with unequal content, but without duplications, and presented a new linear time algorithm to compute the genomic distance, considering double cut and join (DCJ) operations, insertions and deletions. Here we derive from this approach an algorithm to sort one genome into another one also using DCJ, insertions and deletions. The optimal sorting scenarios can have different compositions and we compare two types of sorting scenarios: one that maximizes and one that minimizes the number of DCJ operations with respect to the number of insertions and deletions.

Marília D. V. Braga

The Zero Exemplar Distance Problem

Abstract

Given two genomes with duplicate genes, Zero Exemplar Distance is the problem of deciding whether the two genomes can be reduced to the same genome without duplicate genes by deleting all but one copy of each gene in each genome. Blin, Fertin, Sikora, and Vialette recently proved that Zero Exemplar Distance for monochromosomal genomes is NP-hard even if each gene appears at most two times in each genome, thereby settling an important open question on genome rearrangement in the exemplar model. In this paper, we give a very simple alternative proof of this result. We also study the problem Zero Exemplar Distance for multichromosomal genomes without gene order: from one direction, we show that this problem is NP-hard even if each gene appears at most two times in each genome; from the other direction, we show that this problem admits a polynomial-time algorithm if only one of the two genomes has duplicate genes, and is fixed-parameter tractable if the parameter is the maximum number of chromosomes in each genome.

Minghui Jiang

Scaffold Filling under the Breakpoint Distance

Abstract

Motivated by the trend of genome sequencing without completing the sequence of the whole genomes, Muñoz et al. recently studied the problem of filling an incomplete multichromosomal genome (or scaffold) I with respect to a complete target genome G such that the resulting genomic distance between I′ and G is minimized, where I′ is the corresponding filled scaffold. We call this problem the one-sided scaffold filling problem. In this paper, we follow Muñoz et al. to investigate the scaffold filling problem under the breakpoint distance for the simplest unichromosomal genomes. When the input genome contains no gene repetition (i.e., is a fragment of a permutation), we show that the two-sided scaffold filling problem is polynomially solvable. However, when the input genome contains some genes which appear twice, even the one-sided scaffold filling problem becomes NP-complete. Finally, using the ideas for solving the two-sided scaffold filling problem under the breakpoint distance we show that the two-sided scaffold filling problem under the genomic/rearrangement distance is also polynomially solvable.

Haitao Jiang, Chunfang Zheng, David Sankoff, Binhai Zhu

An Efficient Algorithm for Gene/Species Trees Parsimonious Reconciliation with Losses, Duplications and Transfers

Abstract

Tree reconciliation methods aim at estimating the evolutionary events that cause discrepancy between gene trees and species trees. We provide a discrete computational model that considers duplications, transfers and losses of genes. The model yields a fast and exact algorithm to infer time consistent and most parsimonious reconciliations. Then we study the conditions under which parsimony is able to accurately infer such events. Overall, it performs well even under realistic rates, transfers being in general less accurately recovered than duplications. An implementation is freely available at http://www.atgc-montpellier.fr/MPR .

Jean-Philippe Doyon, Celine Scornavacca, K. Yu. Gorbunov, Gergely J. Szöllősi, Vincent Ranwez, Vincent Berry

Detecting Highways of Horizontal Gene Transfer

Abstract

In a horizontal gene transfer (HGT) event a gene is transferred between two species that do not share an ancestor-descendant relationship. Typically, no more than a few genes are horizontally transferred between any two species. However, several studies identified pairs of species between which many different genes were horizontally transferred. Such a pair is said to be linked by a highway of gene sharing. We present a method for inferring such highways. Our method is based on the fact that the evolutionary histories of horizontally transferred genes disagree with the corresponding species phylogeny. Specifically, given a set of gene trees and a trusted rooted species tree, each gene tree is first decomposed into its constituent quartet trees and the quartets that are inconsistent with the species tree are identified. Our method finds a pair of species such that a highway between them explains the largest (normalized) fraction of inconsistent quartets. For a problem on n species, our method requires O(n ⁴) time, which is optimal with respect to the quartets input size. An application of our method to a dataset of 1128 genes from 11 cyanobacterial species, as well as to simulated datasets, illustrates the efficacy of our method.

Mukul S. Bansal, J. Peter Gogarten, Ron Shamir

On Exploring Genome Rearrangement Phylogenetic Patterns

Abstract

The study of genome rearrangement is much harder than the corresponding problems on DNA and protein sequences, because of the occurrences of numerous combinatorial structures. By explicitly exploring these combinatorial structures, the recently developed adequate subgraph theory shows that a family of these structures, adequate subgraphs, are informative in finding the optimal solutions to the rearrangement median problem. Its extension gives rise to the tree scoring method GASTS, which provides quick and accurate estimation of the number of rearrangement events, for any given topology. With a similar motivation, this paper discusses and provides solid but somewhat initial results, on combinatorial structures that are informative in phylogenetic inference. These structures, called rearrangement phylogenetic patterns, provide more insights than algorithmic approaches, and may provide statistical significance for inferred phylogenies and lead to efficient and robust phylogenetic inference methods on large sets of taxa.

We explore rearrangement phylogenetic patterns with respect to both the breakpoint distance and the DCJ distance. The latter has a simple formulation and well approximates other edit distances. On four genomes, we prove that a contrasting shared adjacency, where a gene forms one adjacency in two genomes and a different adjacency in the other two genomes, is a rearrangement phylogenetic pattern. Phylogenetic inferences based on the numbers of this pattern, are very accurate and robust against short internal edges, tested on 55,000 datasets simulated by random inversions. Further analysis shows that the numbers of this pattern well explain the variations in the number of rearrangement events over different topologies.

Andrew Wei Xu

Fast and Accurate Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Novel Robustness Estimator

Abstract

The rapid accumulation of whole-genome data has renewed interest in the study of genomic rearrangements. Comparative genomics, evolutionary biology, and cancer research all require models and algorithms to elucidate the mechanisms, history, and consequences of these rearrangements. However, even simple models lead to NP-hard problems, particularly in the area of phylogenetic analysis. Current approaches are limited to small collections of genomes and low-resolution data (typically a few hundred syntenic blocks). Moreover, whereas phylogenetic analyses from sequence data are deemed incomplete unless bootstrapping scores (a measure of confidence) are given for each tree edge, no equivalent to bootstrapping exists for rearrangement-based phylogenetic analysis.

We describe a fast and accurate algorithm for rearrangement analysis that scales up, in both time and accuracy, to modern high-resolution genomic data. We also describe a novel approach to estimate the robustness of results—an equivalent to the bootstrapping analysis used in sequence-based phylogenetic reconstruction. We present the results of extensive testing on both simulated and real data showing that our algorithm returns very accurate results, while scaling linearly with the size of the genomes and cubically with their number. We also present extensive experimental results showing that our approach to robustness testing provides excellent estimates of confidence, which, moreover, can be tuned to trade off thresholds between false positives and false negatives. Together, these two novel approaches enable us to attack heretofore intractable problems, such as phylogenetic inference for high-resolution vertebrate genomes, as we demonstrate on a set of six vertebrate genomes with 8,380 syntenic blocks.

Availability: a copy of the software is available on demand.

Yu Lin, Vaibhav Rajan, Bernard M. E. Moret

A Simple Measure of the Dynamics of Segmented Genomes: An Application to Influenza

Abstract

The severity of influenza epidemics, which can potentially become a pandemic, has been very difficult to predict. However, past efforts were focusing on gene-by-gene approaches, while it is acknowledged that the whole genome dynamics contribute to the severity of an epidemic. Here, putting this rationale into action, I describe a simple measure of the amount of reassortment that affects influenza at a genomic scale during a particular year. The analysis of 530 complete genomes of the H1N1 subtype, sampled over eleven years, shows that the proposed measure explains 58% of the variance in the prevalence of H1 influenza in the US population. The proposed measure, denoted nRF, could therefore improve influenza surveillance programs at a minimal cost.

Stéphane Aris-Brosou

Novel Definition and Algorithm for Chaining Fragments with Proportional Overlaps

Abstract

Chaining fragments is a crucial step in genome alignment. Existing chaining algorithms compute a maximum weighted chain with no overlaps allowed between adjacent fragments. In practice, using local alignments as fragments, instead of MEMs, generates frequent overlaps between fragments, due to combinatorial reasons and biological factors, i.e. variable tandem repeat structures that differ in number of copies between genomic sequences. In this paper, in order to raise this limitation, we formulate a novel definition of a chain, allowing overlaps proportional to the fragments lengths, and exhibit an efficient algorithm for computing such a maximum weighted chain. We tested our algorithm on a dataset composed of 694 genome couples and accounted for significant improvements in terms of coverage, while keeping the running times below reasonable limits.

Raluca Uricaru, Alban Mancheron, Eric Rivals

Assessing the Robustness of Complete Bacterial Genome Segmentations

Abstract

Comparison of closely related bacterial genomes has revealed the presence of highly conserved sequences forming a ”backbone” that is interrupted by numerous, less conserved, DNA fragments. Segmentation of bacterial genomes into backbone and variable regions is particularly useful to investigate bacterial genome evolution. Several software tools have been designed to compare complete bacterial chromosomes and a few online databases store pre-computed genome comparisons. However, very few statistical methods are available to evaluate the reliability of these software tools and to compare the results obtained with them. To fill this gap, we have developed two local scores to measure the robustness of bacterial genome segmentations. Our method uses a simulation procedure based on random perturbations of the compared genomes. The scores presented in this paper are simple to implement and our results show that they allow to discriminate easily between robust and non-robust bacterial genome segmentations when using aligners such as MAUVE and MGA.

Hugo Devillers, Hélène Chiapello, Sophie Schbath, Meriem El Karoui

An Algorithm to Solve the Motif Alignment Problem for Approximate Nested Tandem Repeats

Abstract

An approximate nested tandem repeat (NTR) in a string T is a complex repetitive structure consisting of many approximate copies of two substrings x and X (“motifs”) interspersed with one another. NTRs have been found in real DNA sequences and are expected to have applications for evolutionary studies, both as a tool to understand concerted evolution, and as a potential marker in population studies.

In this paper we describe software tools developed for database searches for NTRs. After a first program NTRFinder identifies putative NTR motifs, a confirmation step requires the application of the alignment of the putative NTR against exact NTRs built from the putative template motifs x and X. In this paper we describe an algorithm to solve this alignment problem in O(|T|(| x| + | X|)) space and time. Our alignment algorithm is based on Fischetti et al.’s wrap-around dynamic programming.

Atheer A. Matroud, Michael D. Hendy, Christopher P. Tuffley

Limited Lifespan of Fragile Regions in Mammalian Evolution

Abstract

An important question in genome evolution is whether there exist fragile regions (rearrangement hotspots) where chromosomal rearrangements are happening over and over again. Although nearly all recent studies supported the existence of fragile regions in mammalian genomes, the most comprehensive phylogenomic study of mammals (Ma et al. (2006) Genome Research 16, 1557-1565) raised some doubts about their existence. We demonstrate that fragile regions are subject to a “birth and death” process, implying that fragility has limited evolutionary lifespan. This finding implies that fragile regions migrate to different locations in different mammals, explaining why there exist only a few chromosomal breakpoints shared between different lineages. The birth and death of fragile regions phenomenon reinforces the hypothesis that rearrangements are promoted by matching segmental duplications and suggests putative locations of the currently active fragile regions in the human genome.

Max A. Alekseyev, Pavel A. Pevzner

Mapping Association between Long-Range Cis-Regulatory Regions and Their Target Genes Using Comparative Genomics

Abstract

In chordates, long-range cis-regulatory regions are involved in the control of transcription initiation (either as repressors or enhancers). They can be located as far as 1 Mb from the transcription start site of the target gene and can regulate more than one gene. Therefore, proper characterization of functional interactions between long-range cis-regulatory regions and their target genes remains problematic. We present a novel method to predict such interactions based on the analysis of rearrangements between the human and 16 other vertebrate genomes. Our method is based on the assumption that genome rearrangements that would disrupt the functional interaction between a cis-regulatory region and its target gene are likely to be deleterious. Therefore, conservation of synteny through evolution would be an indication of a functional interaction. We use our algorithm to classify a set of 1,406,084 putative associations from the human genome. This genome-wide map of interactions has many potential applications, including the selection of candidate regions prior to in vivo experimental characterization, a better characterization of regulatory regions involved in position effect diseases, and an improved understanding of the mechanisms and importance of long-range regulation.

Emmanuel Mongin, Ken Dewar, Mathieu Blanchette

A New Genomic Evolutionary Model for Rearrangements, Duplications, and Losses That Applies across Eukaryotes and Prokaryotes

Abstract

Background: Genomic rearrangements have been studied since the beginnings of modern genetics and models for such rearrangements have been the subject of many papers over the last 10 years. However, none of the extant models can predict the evolution of genomic organization into circular unichromosomal genomes (as in most prokaryotes) and linear multichromosomal genomes (as in most eukaryotes). Very few of these models support gene duplications and losses—yet these events may be more common in evolutionary history than rearrangements and themselves cause apparent rearrangements.

Results: We propose a new evolutionary model that integrates gene duplications and losses with genome rearrangements and that leads to genomes with either one (or a very few) circular chromosome or a collection of linear chromosomes. Moreover, our model predictions fit observations about the evolution of gene family sizes as well as existing predictions about the growth in the number of chromosomes in eukaryotic genomes. Finally, our model is based on the existing inversion/translocation models and inherits their linear-time algorithm for pairwise distance computation.

Yu Lin, Bernard M. E. Moret

A Simulation Tool for the Study of Symmetric Inversions in Bacterial Genomes

Abstract

We present the tool SIB that simulates genomic inversions in bacterial chromosomes. The tool simulates symmetric inversions but allows the appearance of nonsymmetric inversions by simulating small syntenic blocks frequently observed on bacterial genome comparisons. We evaluate SIB by comparing its results to real genome alignments. We develop measures that allow quantitative comparisons between real pairwise alignments (in terms of dotplots) and simulated ones. These measures allow an evaluation of SIB in terms of dendrograms. We evaluate SIB by comparing its results to whole chromosome alignments and maximum likelihood trees for three bacterial groups (the Pseudomonadaceae family and the Xanthomonas and Shewanella genera). We demonstrate an application of SIB by using it to evaluate the ancestral genome reconstruction tool MGR.

Ulisses Dias, Zanoni Dias, João C. Setubal

Consistency of Sequence-Based Gene Clusters

Abstract

In comparative genomics, various combinatorial models can be used to specify gene clusters — groups of genes that are co-located in a set of genomes. Several approaches have been proposed to reconstruct putative ancestral gene clusters based on the gene order of contemporary species. One prevalent and natural reconstruction criterion is consistency: For a set of reconstructed gene clusters, there should exist a gene order that comprises all given clusters.

In this paper, we discuss the consistency problem for different gene cluster models on sequences with restricted gene multiplicities. Our results range from linear-time algorithms for the simple model of adjacencies to NP completeness for more complex models like common intervals.

Roland Wittler, Jens Stoye

Efficient Computation of Approximate Gene Clusters Based on Reference Occurrences

Abstract

Whole genome comparison based on the analysis of gene cluster conservation has become a popular approach in comparative genomics. While gene order and gene content as a whole randomize over time, it is observed that certain groups of genes which are often functionally related remain co-located across species. However, the conservation is usually not perfect which turns the identification of these structures, often referred to as approximate gene clusters, into a challenging task. In this paper, we present a polynomial time algorithm that computes approximate gene clusters based on reference occurrences. We show that our approach yields highly comparable results to a more general approach and allows for approximate gene cluster detection in parameter ranges currently not feasible for non-reference based approaches.

Katharina Jahn

The Complexity of the Gapped Consecutive-Ones Property Problem for Matrices of Bounded Maximum Degree

Abstract

The Gapped Consecutive-Ones Property (C1P) Problem, or the (k,δ)-C1P Problem is: given a binary matrix M and integers k and δ, decide if the columns of M can be ordered such that each row contains at most k blocks of 1’s, and no two neighboring blocks of 1’s are separated by a gap of more than δ 0’s. This problem was introduced in [3]. The classical polynomial-time solvable C1P Problem is equivalent to the (1,0)-C1P problem. It has been shown that for every unbounded or bounded k ≥ 2 and unbounded or bounded δ ≥ 1, except when (k,δ) = (2,1), the (k,δ)-C1P Problem is NP-complete [10,6].

In this paper we study the Gapped C1P Problem with a third parameter d, namely the bound on the maximum number of 1’s in any row of M, or the bound on the maximum degree of M. This is motivated by problems in comparative genomics and paleogenomics, where the genome data is often sparse [4]. The (d,k,δ)-C1P Problem has been shown to be polynomial-time solvable when all three parameters are fixed [3]. Since fixing d also fixes k (k ≤ d), the only case left to consider is the case when δ is unbounded, or the (d,k, ∞ )-C1P Problem. Here we show that for every d > k ≥ 2, the (d,k, ∞ )-C1P Problem is NP-complete.

Ján Maňuch, Murray Patterson

An Approximation Algorithm for Computing a Parsimonious First Speciation in the Gene Duplication Model

Abstract

We consider the following problem: given a forest of gene family trees on a set of genomes, find a first speciation which splits these genomes into two subsets and minimizes the number of gene duplications that happened before this speciation. We call this problem the Minimum Duplication Bipartition Problem. Using a generalization of the Minimum Edge-Cut Problem, known as Submodular Function Minimization, we propose a polynomial time and space 2-approximation algorithm for the Minimum Duplication Bipartition Problem. We illustrate the potential of this algorithm on both synthetic and real data.

Aïda Ouangraoua, Krister M. Swenson, Cedric Chauve

Backmatter

Title: Comparative Genomics
Editor: Eric Tannier
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-16181-0
Print ISBN: 978-3-642-16180-3
DOI: https://doi.org/10.1007/978-3-642-16181-0

Springer Professional

About this book

Table of Contents

Frontmatter

Genome Aliquoting Revisited

The Problem of Chromosome Reincorporation in DCJ Sorting and Halving

Advances on Genome Duplication Distances

Listing All Parsimonious Reversal Sequences: New Algorithms and Perspectives

Ultra-Perfect Sorting Scenarios

On Sorting Genomes with DCJ and Indels

The Zero Exemplar Distance Problem

Scaffold Filling under the Breakpoint Distance

An Efficient Algorithm for Gene/Species Trees Parsimonious Reconciliation with Losses, Duplications and Transfers

Detecting Highways of Horizontal Gene Transfer

On Exploring Genome Rearrangement Phylogenetic Patterns

Fast and Accurate Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Novel Robustness Estimator

A Simple Measure of the Dynamics of Segmented Genomes: An Application to Influenza

Novel Definition and Algorithm for Chaining Fragments with Proportional Overlaps

Assessing the Robustness of Complete Bacterial Genome Segmentations

An Algorithm to Solve the Motif Alignment Problem for Approximate Nested Tandem Repeats

Limited Lifespan of Fragile Regions in Mammalian Evolution

Mapping Association between Long-Range Cis-Regulatory Regions and Their Target Genes Using Comparative Genomics

A New Genomic Evolutionary Model for Rearrangements, Duplications, and Losses That Applies across Eukaryotes and Prokaryotes

A Simulation Tool for the Study of Symmetric Inversions in Bacterial Genomes

Consistency of Sequence-Based Gene Clusters

Efficient Computation of Approximate Gene Clusters Based on Reference Occurrences

The Complexity of the Gapped Consecutive-Ones Property Problem for Matrices of Bounded Maximum Degree

An Approximation Algorithm for Computing a Parsimonious First Speciation in the Gene Duplication Model

Backmatter