Tractatus: An Exact and Subquadratic Algorithm for Inferring Identical-by-Descent Multi-shared Haplotype Tracts

In this work we present graph theoretic algorithms for the identification of all identical-by-descent (IBD) multi-shared haplotype tracts for an

m

×

n

haplotype matrix. We introduce Tractatus, an exact algorithm for computing all IBD haplotype tracts in time linear in the size of the input,

O

(

mn

). Tractatus resolves a long standing open problem, breaking optimally the (worst-case) quadratic time barrier of

O

(

m

2

n

) of previous methods often cited as a bottleneck in haplotype analysis of genome-wide association study-sized data. This advance in algorithm efficiency makes an impact in a number of areas of population genomics rooted in the seminal Li-Stephens framework for modeling multi-loci linkage disequilibrium (LD) patterns with applications to the estimation of recombination rates, imputation, haplotype-based LD mapping, and haplotype phasing. We extend the Tractatus algorithm to include computation of haplotype tracts with allele mismatches and shared homozygous haplotypes in a set of genotypes. Lastly, we present a comparison of algorithmic runtime, power to infer IBD tracts, and false positive rates for simulated data and computations of homozygous haplotypes in genome-wide association study data of autism. The Tractatus algorithm is available for download at

http://www.brown.edu/Research/Istrail_Lab/

.

Derek Aguiar, Eric Morrow, Sorin Istrail

HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data

Using standard genotype calling tools, it is possible to accurately identify the number of “wild type” and “mutant” alleles (A, C, G, or T) for each singlenucleotide polymorphism (SNP) site. In the case of two heterozygous SNP sites however, genotype calling tools cannot determine whether “mutant” alleles from different SNP loci are on the same or different chromosomes. While in many cases the former would be healthy, the latter can cause loss of function; it is therefore important to identify the phase—the copies of a chromosome on which the mutant alleles occur—in addition to the genotype. This need necessitates efficient algorithms to obtain an accurate and comprehensive haplotype reconstruction (the phase of heterozygous SNPs in the genome) directly from the next-generation sequencing (NGS) read data. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes of higher ploidy; however, computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as studies into the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes.

Emily Berger, Deniz Yorukoglu, Jian Peng, Bonnie Berger

Changepoint Analysis for Efficient Variant Calling

We present

CAGe

, a statistical algorithm which exploits high sequence identity between sampled genomes and a reference assembly to streamline the variant calling process. Using a combination of changepoint detection, classification, and online variant detection,

CAGe

is able to call simple variants quickly and accurately on the 90-95% of a sampled genome which differs little from the reference, while correctly learning the remaining 5-10% that must be processed using more computationally expensive methods.

CAGe

runs on a deeply sequenced human whole genome sample in approximately 20 minutes, potentially reducing the burden of variant calling by an order of magnitude after one memory-efficient pass over the data.

Adam Bloniarz, Ameet Talwalkar, Jonathan Terhorst, Michael I. Jordan, David Patterson, Bin Yu, Yun S. Song

On the Representation of de Bruijn Graphs

The de Bruijn graph plays an important role in bioinformatics, especially in the context of

de novo

assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitations of these types of approaches. We further design and implement a general data structure (

dbgfm

) and demonstrate its use on a human whole-genome dataset, achieving space usage of 1.5 GB and a 46% improvement over previous approaches. As part of

dbgfm

, we develop the notion of frequency-based minimizers and show how it can be used to enumerate all maximal simple paths of the de Bruijn graph using only 43 MB of memory. Finally, we demonstrate that our approach can be integrated into an existing assembler by modifying the ABySS software to use

dbgfm

.

Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared T. Simpson, Paul Medvedev

Exact Learning of RNA Energy Parameters from Structure

We consider the problem of exact learning of parameters of a linear RNA energy model from secondary structure data. A necessary and sufficient condition for learnability of parameters is derived, which is based on computing the convex hull of union of translated Newton polytopes of input sequences [15]. The set of learned energy parameters is characterized as the convex cone generated by the normal vectors to those facets of the resulting polytope that are incident to the origin. In practice, the sufficient condition may not be satisfied by the entire training data set; hence, computing a maximal subset of training data for which the sufficient condition is satisfied is often desired. We show that problem is NP-hard in general for an arbitrary dimensional feature space. Using a randomized greedy algorithm, we select a subset of RNA STRAND v2.0 database that satisfies the sufficient condition for separate A-U, C-G, G-U base pair counting model. The set of learned energy parameters includes experimentally measured energies of A-U, C-G, and G-U pairs; hence, our parameter set is in agreement with the Turner parameters.

Hamidreza Chitsaz, Mohammad Aminisharifabad

An Alignment-Free Regression Approach for Estimating Allele-Specific Expression Using RNA-Seq Data

RNA-seq technology enables large-scale studies of allele-specific expression (ASE), or the expression difference between maternal and paternal alleles. Here, we study ASE in animals for which parental RNA-seq data are available. While most methods for determining ASE rely on read alignment, read alignment either leads to reference bias or requires knowledge of genomic variants in each parental strain. When RNA-seq data are available for both parental strains of a hybrid animal, it is possible to infer ASE with minimal reference bias and without knowledge of parental genomic variants. Our approach first uses parental RNA-seq reads to discover maternal and paternal versions of transcript sequences. Using these alternative transcript sequences as features, we estimate abundance levels of transcripts in the hybrid animal using a modified lasso linear regression model.

We tested our methods on synthetic data from the mouse transcriptome and compared our results with those of Trinity, a state-of-the-art

de novo

RNA-seq assembler. Our methods achieved high sensitivity and specificity in both identifying expressed transcripts and transcripts exhibiting ASE. We also ran our methods on real RNA-seq mouse data from two F1 samples with wild-derived parental strains and were able to validate known genes exhibiting ASE, as well as confirm the expected maternal contribution ratios in all genes and genes on the X chromosome.

Chen-Ping Fu, Vladimir Jojic, Leonard McMillan

The Generating Function Approach for Peptide Identification in Spectral Networks

Tandem mass (MS/MS) spectrometry has become the method of choice for protein identification and has launched a quest for the identification of every translated protein and peptide. However, computational developments have lagged behind the pace of modern data acquisition protocols and have become a major bottleneck in proteomics analysis of complex samples. As it stands today, attempts to identify MS/MS spectra against large databases (e.g., the human microbiome or 6-frame translation of the human genome) face a search space that is 10-100 times larger than the human proteome where it becomes increasingly challenging to separate between true and false peptide matches. As a result, the sensitivity of current state of the art database search methods drops by nearly 38% to such low identification rates that almost 90% of all MS/MS spectra are left as unidentified. We address this problem by extending the generating function approach to rigorously compute the joint spectral probability of multiple spectra being matched to peptides with overlapping sequences, thus enabling the confident assignment of higher significance to overlapping peptide-spectrum matches (PSMs). We find that these joint spectral probabilities can be several orders of magnitude more significant than individual PSMs, even in the ideal case when perfect separation between signal and noise peaks could be achieved per individual MS/MS spectrum. After benchmarking this approach on a typical lysate MS/MS dataset, we show that the proposed

intersecting spectral probabilities

for spectra from overlapping peptides improve peptide identification by 30-62%.

Adrian Guthals, Christina Boucher, Nuno Bandeira

Decoding Coalescent Hidden Markov Models in Linear Time

In many areas of computational biology, hidden Markov models (HMMs) have been used to model local genomic features. In particular, coalescent HMMs have been used to infer ancient population sizes, migration rates, divergence times, and other parameters such as mutation and recombination rates. As more loci, sequences, and hidden states are added to the model, however, the runtime of coalescent HMMs can quickly become prohibitive. Here we present a new algorithm for reducing the runtime of coalescent HMMs from quadratic in the number of hidden time states to linear, without making any additional approximations. Our algorithm can be incorporated into various coalescent HMMs, including the popular method PSMC for inferring variable effective population sizes. Here we implement this algorithm to speed up our demographic inference method diCal, which is equivalent to PSMC when applied to a sample of two haplotypes. We demonstrate that the linear-time method can reconstruct a population size change history more accurately than the quadratic-time method, given similar computation resources. We also apply the method to data from the 1000 Genomes project, inferring a high-resolution history of size changes in the European population.

Kelley Harris, Sara Sheehan, John A. Kamm, Yun S. Song

AptaCluster – A Method to Cluster HT-SELEX Aptamer Pools and Lessons from Its Application

Systematic Evolution of Ligands by EXponential Enrichment (SELEX) is a well established experimental procedure to identify aptamers - synthetic single-stranded (ribo)nucleic molecules that bind to a given molecular target. Recently, new sequencing technologies have revolutionized the SELEX protocol by allowing for deep sequencing of the selection pools after each cycle. The emergence of High Throughput SELEX (HT-SELEX) has opened the field to new computational opportunities and challenges that are yet to be addressed. To aid the analysis of the results of HT-SELEX and to advance the understanding of the selection process itself, we developed AptaCluster. This algorithm allows for an efficient clustering of whole HT-SELEX aptamer pools; a task that could not be accomplished with traditional clustering algorithms due to the enormous size of such datasets. We performed HT-SELEX with Interleukin 10 receptor alpha chain (IL-10RA) as the target molecule and used AptaCluster to analyze the resulting sequences. AptaCluster allowed for the first survey of the relationships between sequences in different selection rounds and revealed previously not appreciated properties of the SELEX protocol. As the first tool of this kind, AptaCluster enables novel ways to analyze and to optimize the HT-SELEX procedure. Our AptaCluster algorithm is available as a very fast multiprocessor implementation upon request.

Jan Hoinka, Alexey Berezhnoy, Zuben E. Sauna, Eli Gilboa, Teresa M. Przytycka

Learning Sequence Determinants of Protein: Protein Interaction Specificity with Sparse Graphical Models

In studying the strength and specificity of interaction between members of two protein families, key questions center on

which

pairs of possible partners actually interact,

how well

they interact, and

why

they interact while others do not. The advent of large-scale experimental studies of interactions between members of a target family and a diverse set of possible interaction partners offers the opportunity to address these questions. We develop here a method,

DgSpi

(Data-driven Graphical models of Specificity in Protein:protein Interactions), for learning and using graphical models that explicitly represent the amino acid basis for interaction specificity (

why

) and extend earlier classification-oriented approaches (

which

) to predict the

$\varDelta{G}$

of binding (

how well

). We demonstrate the effectiveness of our approach in analyzing and predicting interactions between a set of 82 PDZ recognition modules, against a panel of 217 possible peptide partners, based on data from MacBeath and colleagues. Our predicted

$\varDelta{G}$

values are highly predictive of the experimentally measured ones, reaching correlation coefficients of 0.69 in 10-fold cross-validation and 0.63 in leave-one-PDZ-out cross-validation. Furthermore, the model serves as a compact representation of amino acid constraints underlying the interactions, enabling protein-level

$\varDelta{G}$

predictions to be naturally understood in terms of residue-level constraints. Finally, the model,

DgSpi

readily enables the design of new interacting partners, and we demonstrate that designed ligands are novel and diverse.

Hetunandan Kamisetty, Bornika Ghosh, Christopher James Langmead, Chris Bailey-Kellogg

On Sufficient Statistics of Least-Squares Superposition of Vector Sets

Superposition by orthogonal transformation of vector sets by minimizing the least-squares error is a fundamental task in many areas of science, notably in structural molecular biology. Its widespread use for structural analyses is facilitated by exact solutions of this problem, computable in linear time. However, in several of these analyses it is common to invoke this superposition routine a very large number of times, often operating (through addition or deletion) on previously superposed vector sets. This paper derives a set of

sufficient statistics

for the least-squares orthogonal transformation problem. These sufficient statistics are additive. This property allows for the superposition parameters (rotation, translation, and root mean square deviation) to be computable as

constant time

updates from the statistics of partial solutions. We demonstrate that this results in a massive speed up in the computational effort, when compared to the method that recomputes superpositions

ab initio

. Among others, protein structural alignment algorithms stand to benefit from our results.

Arun S. Konagurthu, Parthan Kasarapu, Lloyd Allison, James H. Collier, Arthur M. Lesk

IDBA-MTP: A Hybrid MetaTranscriptomic Assembler Based on Protein Information

Metatranscriptomic analysis provides information on how a microbial community reacts to environmental changes. Using next-generation sequencing (NGS) technology, biologists can study microbe community by sampling short reads from a mixture of mRNAs (metatranscriptomic data). As most microbial genome sequences are unknown, it would seem that de novo assembly of the mRNAs is needed. However, NGS reads are short and mRNAs share many similar regions and differ tremendously in abundance levels, making de novo assembly challenging. The existing assembler, IDBA-MT, designed specifically for the assembly of metatranscriptomic data only performs well on high-expressed mRNAs.

This paper introduces IDBA-MTP, which adopts a novel approach to metatranscriptomic assembly that makes use of the fact that there is a database of millions of known protein sequences associated with mRNAs. How to effectively use the protein information is non-trivial given the size of the database and given that different mRNAs might lead to proteins with similar functions (because different amino acids might have similar characteristics). IDBA-MTP employs a similarity measure between mRNAs and protein sequences, dynamic programming techniques and seed-and-extend heuristics to tackle the problem effectively and efficiently. Experimental results show that IDBA-MTP outperforms existing assemblers by reconstructing 14% more mRNAs.

Availability:

www.cs.hku.hk/~alse/hkubrg/.

Henry C. M. Leung, S. M. Yiu, Francis Y. L. Chin

MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

Sequence-based protein homology detection has been extensively studied, but it still remains very challenging for remote homologs with divergent sequences. So far the most sensitive method for homology detection is based upon comparison of protein sequence profiles, which are usually derived from multiple sequence alignment (MSA) of sequence homologs in a protein family and represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model). HMM is more sensitive than PSSM because the former contains position-specific gap information and also takes into account correlation among sequentially adjacent residues. The main issue with HMM lies in that it makes use of only position-specific amino acid mutation patterns and very short-range residue correlation, but not long-range residue interaction. However, remote homologs may have very divergent sequences and are only similar at the level of (long-range) residue interaction pattern, which is not encoded in current popular PSSM or HMM models.

Jianzhu Ma, Sheng Wang, Zhiyong Wang, Jinbo Xu

An Integrated Model of Multiple-Condition ChIP-Seq Data Reveals Predeterminants of Cdx2 Binding

Profiling the genomic binding activity of regulatory proteins in multiple cell types is important for understanding cellular function, as a single regulator can bind to distinct sets of genomic targets depending on the cellular context in which it is expressed. Characterizing the determinants of such differential binding specificity is key to understanding how a single regulator can play multiple roles during development and other dynamic cellular processes. For example, pre-existing chromatin context such as chromatin accessibility or the binding of other regulators may determine the binding of some developmental transcription factors (TFs) [1,2], while other pioneer TFs may find their binding targets independently of the established chromatin state [3]. In order to reliably characterize such condition-specific regulatory activity, we require methods that can integrate analysis across multiple related experiments in a principled way.

Shaun Mahony, Matthew D. Edwards, Esteban O. Mazzoni, Richard I. Sherwood, Akshay Kakumanu, Carolyn A. Morrison, Hynek Wichterle, David K. Gifford

PASTA: Ultra-Large Multiple Sequence Alignment

In this paper, we introduce a new and highly scalable algorithm, PASTA, for large-scale multiple sequence alignment estimation. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy of the leading alignment methods on large datasets, and is able to analyze much larger datasets than the current methods. We also show that trees estimated on PASTA alignments are highly accurate – slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is very fast, highly parallelizable, and requires relatively little memory.

Siavash Mirarab, Nam Nguyen, Tandy Warnow

Fast Flux Module Detection Using Matroid Theory

Flux balance analysis

(FBA) is one of the most often applied methods on genome-scale metabolic networks. Although FBA uniquely determines the optimal yield, the pathway that achieves this is usually not unique. The analysis of the optimal-yield flux space has been an open challenge.

Flux variability analysis

is only capturing some properties of the flux space, while

elementary mode analysis

is intractable due to the enormous number of elementary modes. However, it has been found by Kelk et al. 2012, that the space of optimal-yield fluxes decomposes into

flux modules

. These decompositions allow a much easier but still comprehensive analysis of the optimal-yield flux space.

Using the mathematical definition of module introduced by Müller and Bockmayr 2013, we discovered that flux modularity is rather a local than a global property which opened connections to matroid theory. Specifically, we show that our modules correspond one-to-one to so-called

separators

of an appropriate matroid. Employing efficient algorithms developed in matroid theory we are now able to compute the decomposition into modules in a few seconds for genome-scale networks. Using that every module can be represented by one reaction that represents its function, in this paper, we also present a method that uses this decomposition to visualize the interplay of modules. We expect the new method to replace flux variability analysis in the pipelines for metabolic networks.

Arne C. Müller, Frank J. Bruggeman, Brett G. Olivier, Leen Stougie

Building a Pangenome Reference for a Population

A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalise the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes’ ordering and orientation, creating a pangenome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pangenome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test.

Ngan Nguyen, Glenn Hickey, Daniel R. Zerbino, Brian Raney, Dent Earl, Joel Armstrong, David Haussler, Benedict Paten

CSAX: Characterizing Systematic Anomalies in eXpression Data

Methods for translating gene expression signatures into clinically relevant information have typically relied upon having many samples from patients with similar molecular phenotypes. Here, we address the question of what can be done when it is relatively easy to obtain healthy patient samples, but when abnormalities corresponding to disease states may be rare and one-of-a-kind. The associated computational challenge, anomaly detection, is a well-studied machine learning problem. However, due to the dimensionality and variability of expression data, existing methods based on feature space analysis or individual anomalously-expressed genes are insufficient. We present a novel approach, CSAX, that identifies pathways in an individual sample in which the normal expression relationships are disrupted. To evaluate our approach, we have compiled and released a compendium of public microarray data sets, reformulated to create a testbed for anomaly detection. We demonstrate the accuracy of CSAX on the data sets in our compendium, compare it to other leading anomaly-detection methods, and show that CSAX aids both in identifying anomalies and in explaining their underlying biology. We note the potential for the use of such methods in identifying subclasses of disease. We also describe an approach to characterizing the difficulty of specific expression anomaly detection tasks and discuss how one can estimate the feasibility of a specific task. Our approach provides an important step towards identification of individual disease patterns in the era of personalized medicine.

Keith Noto, Carla Brodley, Saeed Majidi, Diana W. Bianchi, Donna K. Slonim

WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads

The human genome is diploid, that is each of its chromosomes comes in two copies. This requires to

phase

the

single nucleotide polymorphisms (SNPs)

, that is, to assign them to the two copies, beyond just detecting them. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which avoid making use of direct read information, constitute the state-of-the-art.

Haplotype assembly

, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing.

Future sequencing technologies, however, bear the promise to generate reads of lengths and error rates that allow to bridge all SNP positions in the genome at sufficient amounts of SNPs per read. Existing haplotype assembly approaches, however, profit precisely, in terms of computational complexity, from the limited length of current-generation reads, because their runtime is usually exponential in the number of SNPs per sequencing read. This implies that such approaches will not be able to exploit the benefits of long enough, future-generation reads.

Here, we suggest

WhatsHap

, a novel dynamic programming approach to haplotype assembly. It is the first approach that yields provably optimal solutions to the

weighted minimum error correction (wMEC)

problem in runtime linear in the number of SNPs per sequencing read, making it suitable for future-generation reads.

WhatsHap

is a

fixed parameter tractable (FPT)

approach with coverage as the parameter. We demonstrate that

WhatsHap

can handle datasets of coverage up to 20x, processing chromosomes on standard workstations in only 1-2 hours. Our simulation study shows that the quality of haplotypes assembled by

WhatsHap

significantly improves with increasing read length, both in terms of genome coverage as well as in terms of switch errors. The switch error rates we achieve in our simulations are superior to those obtained by state-of-the-art statistical phasers.

Murray Patterson, Tobias Marschall, Nadia Pisanti, Leo van Iersel, Leen Stougie, Gunnar W. Klau, Alexander Schönhuth

Simultaneous Inference of Cancer Pathways and Tumor Progression from Cross-Sectional Mutation Data

Recent cancer sequencing studies provide a wealth of somatic mutation data from a large number of patients. One of the most intriguing and challenging questions arising from this data is to determine whether the temporal order of somatic mutations in a cancer follows any common progression. Since we usually obtain only one sample from a patient, such inferences are commonly made from cross-sectional data from different patients. This analysis is complicated by the extensive variation in the somatic mutations across different patients, variation that is reduced by examining combinations of mutations in various pathways. Thus far, methods to reconstruction tumor progression at the pathway level have restricted attention to known,

a priori

defined pathways.

In this work we show how to

simultaneously

infer pathways and the temporal order of their mutations from cross-sectional data, leveraging on the exclusivity property of driver mutations within a pathway. We define the Pathway Linear Progression Model, and derive a combinatorial formulation for the problem of finding the optimal model from mutation data. We show that while this problem is NP-hard, with enough samples its optimal solution uniquely identifies the correct model with high probability even when errors are present in the mutation data. We then formulate the problem as an integer linear program (ILP), which allows the analysis of datasets from recent studies with large number of samples. We use our algorithm to analyze somatic mutation data from three cancer studies, including two studies from The Cancer Genome Atlas (TCGA) on large number of samples on colorectal cancer and glioblastoma. The models reconstructed with our method capture most of the current knowledge of the progression of somatic mutations in these cancer types, while also providing new insights on the tumor progression at the pathway level.

Benjamin J. Raphael, Fabio Vandin

dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes

While the number of sequenced diploid genomes of interest have been steadily increasing in the last few years, assembly of highly polymorphic (HP) diploid genomes remains challenging. As a result, there is shortage of tools for assembling HP genomes from NGS data. The initial approaches to assembling HP genomes were proposed in the pre-NGS era and are not well suited for NGS projects. We present the first de Bruijn graph assembler

dipSPAdes

for HP genomes and demonstrate that it significantly improves on the state-of-the-art in the HP genome assembly.

Yana Safonova, Anton Bankevich, Pavel A. Pevzner

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes

NP

-hard in the presence of duplicate genes. In this paper, we propose an ILP (integer linear programming) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse and rat genomes, where once again our method outperforms MSOAR.

Mingfu Shao, Yu Lin, Bernard Moret

HIT’nDRIVE: Multi-driver Gene Prioritization Based on Hitting Time

A key challenge in cancer genomics is the identification and prioritization of genomic aberrations that potentially act as drivers of cancer. In this paper we introduce HIT’nDRIVE, a combinatorial method to identify aberrant genes that can collectively influence possibly distant “outlier” genes based on what we call the “random-walk facility location” (RWFL) problem on an interaction network. RWFL differs from the standard facility location problem by its use of “multi-hitting time”, the expected minimum number of hops in a random walk originating from any aberrant gene to reach an outlier. HIT’nDRIVE thus aims to find the smallest set of aberrant genes from which one can reach outliers within a desired multi-hitting time. For that it estimates multi-hitting time based on the independent hitting times from the drivers to any given outlier and reduces the RWFL to a weighted multi-set cover problem, which it solves as an integer linear program (ILP). We apply HIT’nDRIVE to identify aberrant genes that potentially act as drivers in a cancer data set and make phenotype predictions using only the potential drivers - more accurately than alternative approaches.

Raunak Shrestha, Ermin Hodzic, Jake Yeung, Kendric Wang, Thomas Sauerwald, Phuong Dao, Shawn Anderson, Himisha Beltran, Mark A. Rubin, Colin C. Collins, Gholamreza Haffari, S. Cenk Sahinalp

Modeling Mutual Exclusivity of Cancer Mutations

Recent years in cancer research were characterized by both accumulation of data and growing awareness of its overwhelming complexity. Consortia like The Cancer Genome Atlas [1] generated large collections of tumor samples, recording presence or absence of genomic alterations, such as somatic point mutations, amplifications, or deletions of genes. One of the basic tasks in the analysis of tumor genomic data is to elucidate sets of genes involved in a common oncogenic pathway. A

de novo

approach to this task is to search for mutually exclusive patterns in cancer genomic data [2, 3, 4, 5], where these alterations tend not to occur together in the same patient. Such patterns are commonly evaluated and ranked by their coverage and impurity. Coverage is defined as the number of patient samples in which at least one alteration occurred, while impurity refers to non-exclusive, additional alterations that violate strict mutual exclusivity. Mutually exclusive patterns have frequently been observed in cancer data, and were associated with functional pathways [6].

Ewa Szczurek, Niko Beerenwinkel

Viral Quasispecies Assembly via Maximal Clique Enumeration

Genetic variability of virus populations within individual hosts is a key determinant of pathogenesis, virulence, and treatment outcome. It is of clinical importance to identify and quantify the intra-host ensemble of viral haplotypes, called viral quasispecies. Ultra-deep next-generation sequencing (NGS) of mixed samples is currently the only efficient way to probe genetic diversity of virus populations in greater detail. Major challenges with this bulk sequencing approach are (i) to distinguish genetic diversity from sequencing errors, (ii) to assemble an unknown number of different, unknown, haplotype sequences over a genomic region larger than the average read length, (iii) to estimate their frequency distribution, and (iv) to detect structural variants, such as large insertions and deletions (indels) that are due to erroneous replication or alternative splicing. Even though NGS is currently introduced in clinical diagnostics, the

de-facto

standard procedure to assess the quasispecies structure is still single-nucleotide variant (SNV) calling. Viral phenotypes cannot be predicted solely from individual SNVs, as epistatic interactions are abundant in RNA viruses. Therefore, reconstruction of long-range viral haplotypes has the potential to be adopted, as data is already available.

Armin Töpfer, Tobias Marschall, Rowena A. Bull, Fabio Luciani, Alexander Schönhuth, Niko Beerenwinkel

Correlated Protein Function Prediction via Maximization of Data-Knowledge Consistency

Protein function prediction in conventional computational approaches is usually conducted one function at a time, fundamentally. As a result, the functions are treated as separate target classes. However, biological processes are highly correlated, which makes functions assigned to proteins are not independent. Therefore, it would be beneficial to make use of function category correlations in predicting protein function. We propose a novel Maximization of Data-Knowledge Consistency (MDKC) approach to exploit function category correlations for protein function prediction. Our approach banks on the assumption that two proteins are likely to have large overlap in their annotated functions if they are highly similar according to certain experimental data. We first establish a new pairwise protein similarity using protein annotations from knowledge perspective. Then by maximizing the consistency between the established

knowledge similarity

upon annotations and the

data similarity

upon biological experiments, putative functions are assigned to unannotated proteins. Most importantly, function category correlations are elegantly incorporated through the knowledge similarity. Comprehensive experimental evaluations on

Saccharomyces cerevisiae

data demonstrate promising results that validate the performance of our methods.

Hua Wang, Heng Huang, Chris Ding

Bayesian Multiple Protein Structure Alignment

Multiple protein structure alignment is an important tool in computational biology, with numerous algorithms published in the past two decades. However, recently literature highlights a growing recognition of the inconsistencies among alignments from different algorithms, and the instability of alignments obtained by individual algorithms under small fluctuations of the input structures. Here we present a probabilistic model-based approach to the problem of multiple structure alignment, using an explicit statistical model. The resulting algorithm produces a Bayesian posterior distribution over alignments which accounts for alignment uncertainty arising from evolutionary variability, experimental noise, and thermal fluctuation, as well as sensitivity to alignment algorithm parameters. We demonstrate the robustness of this approach on alignments identified previously in the literature as “difficult” for existing algorithms. We also show the potential for significant stabilization of tree reconstruction in structural phylogenetics.

Rui Wang, Scott C. Schmidler

Gene-Gene Interactions Detection Using a Two-Stage Model

Genome wide association studies (GWAS) have discovered numerous loci involved in genetic traits. Virtually all studies have reported associations between individual single nucleotide polymorphism (SNP) and traits. However, it is likely that complex traits are influenced by interaction of multiple SNPs. One approach to detect interactions of SNPs is the brute force approach which performs a pairwise association test between a trait and each pair of SNPs. The brute force approach is often computationally infeasible because of the large number of SNPs collected in current GWAS studies. We propose a two-stage model, Threshold-based Efficient Pairwise Association Approach (TEPAA), to reduce the number of tests needed while maintaining almost identical power to the brute force approach. In the first stage, our method performs the single marker test on all SNPs and selects a subset of SNPs that achieve a certain significance threshold. In the second stage, we perform a pairwise association test between traits and pairs of the SNPs selected from the first stage. The key insight of our approach is that we derive the joint distribution between the association statistics of a single SNP and the association statistics of pairs of SNPs. This joint distribution allows us to provide guarantees that the statistical power of our approach will closely approximate the brute force approach. We applied our approach to the Northern Finland Birth Cohort data and achieved 63 times speedup while maintaining 99% of the power of the brute force approach.

Zhanyong Wang, Jae Hoon Sul, Sagi Snir, Jose A. Lozano, Eleazar Eskin

A Geometric Clustering Algorithm and Its Applications to Structural Data

An important feature of structural data especially those from structural determination and protein-ligand docking programs is that their distribution could be both uniform and non-uniform. Traditional clustering algorithms developed specifically for non-uniformly distributed data may not be adequate for their classification. Here we present a geometric partitional algorithm that could be applied to both uniformly and non-uniformly distributed data. The algorithm is a top-down approach that recursively selects the outliers as the seeds to form new clusters until all the structures within a cluster satisfy certain requirements. The applications of the algorithm to a diverse set of data from NMR structure determination, protein-ligand docking and simulation show that it is superior to the previous clustering algorithms for the identification of the correct but minor clusters. The algorithm should be useful for the identification of correct docking poses and for speeding up an iterative process widely used in NMR structure determination.

Shutan Xu, Shuxue Zou, Lincong Wang

A Spatial-Aware Haplotype Copying Model with Applications to Genotype Imputation

Ever since its introduction, the haplotype copy model has proven to be one of the most successful approaches for modeling genetic variation in human populations with applications ranging from ancestry inference to genotype phasing and imputation. Motivated by coalescent theory, this approach assumes that any chromosome (haplotype) can be modeled as a mosaic of segments copied from a set of chromosomes sampled from the same population. At the core of the model is the assumption that any chromosome from the sample is equally likely to contribute a priori to the copying process. Motivated by recent works that model genetic variation in a geographic continuum, we propose a new spatial-aware haplotype copy model that jointly models geography and the haplotype copying process. We extend hidden Markov models of haplotype diversity such that at any given location, haplotypes that are closest in the genetic-geographic continuum map are a priori more likely to contribute to the copying process than distant ones. Through simulations starting from the 1000 Genomes data, we show that our model achieves superior accuracy in genotype imputation over the standard spatial-unaware haplotype copy model. In addition, we show the utility of our model in selecting a small personalized reference panel for imputation that leads to both improved accuracy as well as to a lower computational runtime than the standard approach. Finally, we show our proposed model can be used to localize individuals on the genetic-geographical map on the basis of their genotype data.

Wen-Yun Yang, Farhad Hormozdiari, Eleazar Eskin, Bogdan Pasaniuc

Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification

It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a

k

-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the

k

-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence-based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets.

Availability: An implementation of our method, RQS, is available for download at:

http://rqs.csail.mit.edu/

.

Y. William Yu, Deniz Yorukoglu, Bonnie Berger

Reconstructing Breakage Fusion Bridge Architectures Using Noisy Copy Numbers

The

Breakage Fusion Bridge

(

BFB

) process is a key marker for genomic instability, producing highly rearranged genomes in relatively small number of cell cycles. While the process itself was observed during the late 1930’s, little is known about the extent of BFB in tumor genome evolution. This is partly due to methodological requiring the rare observation of a spontaneous BFB occurence, or rigorous assays for identifying BFB-modified genomes after the process has ceased. Moreover, BFB can dramatically increase copy numbers of chromosomal segments, which in turn hardens the tasks of both reference assisted and

ab initio

genome assembly.

Based on available data such as

Next Generation Sequencing

(NGS) and

Array Comparative Genomic Hybridization

(aCGH) data, we show here how BFB evidence may be identified, and how to predict all possible evolutions of the process with respect to observed data. Specifically, we describe practical algorithms that, given a chromosomal arm segmentation and noisy segment copy number estimates, produce all segment count vectors supported by the data that can be produced by BFB, and all corresponding BFB architectures. This extends the scope of analyses described in our previous work, which produced a single count vector and architecture per instance.

We apply these analyses to a comprehensive human cancer dataset, demonstrate the effectiveness and efficiency of the computation, and suggest methods for further assertions of candidate BFB samples. An online Appendix, the source code of our tool, and analyses results, are available at

http://cseweb.ucsd.edu/~vbafna/bfb

.

Shay Zakov, Vineet Bafna

Reconciliation with Non-binary Gene Trees Revisited

By reconciling the phylogenetic tree of a gene family with the corresponding species tree, it is possible to infer lineage-specific duplications and losses with high confidence and hence annotate orthologs and paralogs. However, the currently available reconciliation methods for non-binary gene trees are computationally expensive for being applied on a genomic level. Here, an

O

(|

G

| + |

S

|) algorithm is presented to reconcile an arbitrary gene tree

G

with its corresponding species tree

S

, where |·| denotes the number of nodes in the corresponding tree. The improvement is achieved through two innovations: a fast computation of compressed child-image subtrees and efficient reconstruction of irreducible duplication histories.

Yu Zheng, Louxin Zhang

Learning Protein-DNA Interaction Landscapes by Integrating Experimental Data through Computational Models

Transcriptional regulation is directly enacted by the interactions between DNA and many proteins, including transcription factors, nucleosomes, and polymerases. A critical step in deciphering transcriptional regulation is to infer, and eventually predict, the precise locations of these interactions, along with their strength and frequency. While recent datasets yield great insight into these interactions, individual data sources often provide only noisy information regarding one specific aspect of the complete interaction landscape. For example, chromatin immunoprecipitation (ChIP) reveals the precise binding positions of a protein, but only for one protein at a time. In contrast, nucleases like MNase and DNase reveal binding positions for many different proteins at once, but cannot easily determine the identities of those proteins. Here, we develop a novel statistical framework that integrates different sources of experimental information within a thermodynamic model of competitive binding to jointly learn a holistic view of the

in vivo

protein-DNA interaction landscape. We show that our framework learns an interaction landscape with increased accuracy, explaining multiple sets of data in accordance with thermodynamic principles of competitive DNA binding. The resulting model of genomic occupancy provides a precise, mechanistic vantage point from which to explore the role of protein-DNA interactions in transcriptional regulation.

Jianling Zhong, Todd Wasson, Alexander J. Hartemink

Imputation of Quantitative Genetic Interactions in Epistatic MAPs by Interaction Propagation Matrix Completion

A popular large-scale gene interaction discovery platform is the Epistatic Miniarray Profile (E-MAP). E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, thus completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to largely incomplete data sets. We introduce a new interaction data imputation method called interaction propagation matrix completion (IP-MC). The core part of IP-MC is a low-rank (latent) probabilistic matrix completion approach that considers additional knowledge presented through a gene network. IP-MC assumes that interactions are transitive, such that latent gene interaction profiles depend on the profiles of their direct neighbors in a given gene network. As the IP-MC inference algorithm progresses, the latent interaction profiles propagate through the branches of the network. In a study with three different E-MAP data assays and the considered protein-protein interaction and Gene Ontology similarity networks, IP-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allows IP-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.

Marinka Žitnik, Blaž Zupan

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Tractatus: An Exact and Subquadratic Algorithm for Inferring Identical-by-Descent Multi-shared Haplotype Tracts

HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data

Changepoint Analysis for Efficient Variant Calling

On the Representation of de Bruijn Graphs

Exact Learning of RNA Energy Parameters from Structure

An Alignment-Free Regression Approach for Estimating Allele-Specific Expression Using RNA-Seq Data

The Generating Function Approach for Peptide Identification in Spectral Networks

Decoding Coalescent Hidden Markov Models in Linear Time

AptaCluster – A Method to Cluster HT-SELEX Aptamer Pools and Lessons from Its Application

Learning Sequence Determinants of Protein: Protein Interaction Specificity with Sparse Graphical Models

On Sufficient Statistics of Least-Squares Superposition of Vector Sets

IDBA-MTP: A Hybrid MetaTranscriptomic Assembler Based on Protein Information

MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

An Integrated Model of Multiple-Condition ChIP-Seq Data Reveals Predeterminants of Cdx2 Binding

PASTA: Ultra-Large Multiple Sequence Alignment

Fast Flux Module Detection Using Matroid Theory

Building a Pangenome Reference for a Population

CSAX: Characterizing Systematic Anomalies in eXpression Data

WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads

Simultaneous Inference of Cancer Pathways and Tumor Progression from Cross-Sectional Mutation Data

dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

HIT’nDRIVE: Multi-driver Gene Prioritization Based on Hitting Time

Modeling Mutual Exclusivity of Cancer Mutations

Viral Quasispecies Assembly via Maximal Clique Enumeration

Correlated Protein Function Prediction via Maximization of Data-Knowledge Consistency

Bayesian Multiple Protein Structure Alignment

Gene-Gene Interactions Detection Using a Two-Stage Model

A Geometric Clustering Algorithm and Its Applications to Structural Data

A Spatial-Aware Haplotype Copying Model with Applications to Genotype Imputation

Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification

Reconstructing Breakage Fusion Bridge Architectures Using Noisy Copy Numbers

Reconciliation with Non-binary Gene Trees Revisited

Learning Protein-DNA Interaction Landscapes by Integrating Experimental Data through Computational Models

Imputation of Quantitative Genetic Interactions in Epistatic MAPs by Interaction Propagation Matrix Completion

Backmatter

Premium Partner