Efficient Alignment Free Sequence Comparison with Bounded Mismatches

Alignment free sequence comparison methods are attracting persistent interest, driven by data-intensive applications in genome-wide molecular taxonomy and phylogentic reconstruction. Among the methods based on substring composition, the

Average Common Substring

(

$$\mathsf {ACS}$$

) measure proposed by Burstein

et al.

(RECOMB 2005) admits a straightforward linear time sequence comparison algorithm, while yielding impressive results in multiple applications. An important direction of research is to extend the approach to permit a bounded edit/hamming distance between substrings, so as to reflect more accurately the evolutionary process. To date, however, algorithms designed to incorporate

$$k \ge 1$$

mismatches have

$$O(kn^2)$$

worst-case complexity, worse than the

$$O(n^2)$$

alignment algorithms they are meant to replace. On the other hand, accounting for mismatches does show to lead to much improved classification, while heuristics can improve practical performance. In this paper, we close the gap by presenting the first provably efficient algorithm for the

$$k$$

-mismatch average common string

(

$$\mathsf {ACS}_k$$

) problem that takes

$$O(n)$$

space and

$$O(n\log ^{k+1} n)$$

time in the worst case for any constant

$$k$$

. Our method extends the generalized suffix tree model to incorporate a carefully selected bounded set of perturbed suffixes, and can be applicable to other complex approximate sequence matching problems.

Srinivas Aluru, Alberto Apostolico, Sharma V. Thankachan

DockStar: A Novel ILP Based Integrative Method for Structural Modelling of Multimolecular Protein Complexes (Extended Abstract)

Atomic resolution modelling of large multimolecular protein complexes is a key task in Structural Cell Biology. A single cell consists of hundreds of different functional complexes.

Naama Amir, Dan Cohen, Haim J. Wolfson

CRISPR Detection from Short Reads Using Partial Overlap Graphs

Clustered regularly interspaced short palindromic repeats (CRISPR) are structured regions in bacterial and archaeal genomes, which are part of an adaptive immune system against phages. Most of the automated tools that detect CRISPR loci rely on assembled genomes. However, many assemblers do not successfully handle repetitive regions. The first tool to work directly on raw sequence data is Crass, which requires that reads are long enough to contain two copies of the same repeat. We developed a method to identify CRISPR repeats from a raw sequence data of short reads. The algorithm is based on an observation differentiating CRISPR repeats from other types of repeats, and it involves a series of partial constructions of the overlap graph. A preliminary implementation of the algorithm shows good results and detects CRISPR repeats in cases where other tools fail to do so.

Ilan Ben-Bassat, Benny Chor

HapTree-X: An Integrative Bayesian Framework for Haplotype Reconstruction from Transcriptome and Genome Sequencing Data

By running standard genotype calling tools, it is possible to accurately identify the number of “wild type” and “mutant” alleles for each single-nucleotide polymorphism (SNP) site. However, in the case of two heterozygous SNP sites, genotype calling tools cannot determine whether “mutant” alleles from different SNP loci are on the same chromosome or on different homologous chromosomes (i.e. compound heterozygote).

Emily Berger, Deniz Yorukoglu, Bonnie Berger

Read Clouds Uncover Variation in Complex Regions of the Human Genome

The rapid advance of next-generation sequencing (NGS) technologies has decreased the cost of genomic sequencing dramatically, enabling accurate variant discovery across whole genomes of many individuals. Current large-scale and cost-effective resequencing platforms produce reads of limited length, and as a result, reliable identification of variants within highly homologous regions of a target genome remains challenging.

Alex Bishara, Yuling Liu, Dorna Kashef-Haghighi, Ziming Weng, Daniel E. Newburger, Robert West, Arend Sidow, Serafim Batzoglou

Learning Microbial Interaction Networks from Metagenomic Count Data

Many microbes associate with higher eukaryotes and impact their vitality. In order to engineer microbiomes for host benefit, we must understand the rules of community assembly and maintenence, which in large part, demands an understanding of the direct interactions between community members. Toward this end, we’ve developed a Poisson-multivariate normal hierarchical model to learn direct interactions from the count-based output of standard metagenomics sequencing experiments. Our model controls for confounding predictors at the Poisson layer, and captures direct taxon-taxon interactions at the multivariate normal layer using an

$$\ell _1$$

penalized precision matrix. We show in a synthetic experiment that our method handily outperforms state-of-the-art methods such as SparCC and the graphical lasso (glasso). In a real,

in planta

perturbation experiment of a nine member bacterial community, we show our model, but not SparCC or glasso, correctly resolves a direct interaction structure among three community members that associate with

Arabidopsis thaliana

roots. We conclude that our method provides a structured, accurate, and distributionally reasonable way of modeling correlated count based random variables and capturing direct interactions among them.

Code Availability:

Our model is available on CRAN as an R package, MInt.

Surojit Biswas, Meredith McDonald, Derek S. Lundberg, Jeffery L. Dangl, Vladimir Jojic

Immunoglobulin Classification Using the Colored Antibody Graph

The somatic recombination of V, D, and J gene-segments in B-cells, introduces a great deal of diversity, and divergence from reference segments. Many recent studies of antibodies focus on the population of antibody transcripts that show which V, D, and J gene-segments have been favored for a particular antigen, a repertoire. To properly describe the antibody repertoire, each antibody must be labeled by its constituting V, D, and J gene-segment, a task made difficult by somatic recombination and hypermutation events. While previous approaches to repertoire analysis were based on sequential alignments, we describe a new de Bruijn graph based algorithm to perform VDJ labeling, and benchmark its performance.

Stefano R. Bonissone, Pavel A. Pevzner

CIDANE: Comprehensive Isoform Discovery and Abundance Estimation

High-throughput sequencing of cellular RNA (RNA-seq) allows to assess the set of all RNA molecules, the transcriptome, produced by a cell at a high resolution, under various conditions. The assembly of short sequencing reads to full-length transcripts, however, poses profound challenges to bioinformatics tools.

Stefan Canzar, Sandro Andreotti, David Weese, Knut Reinert, Gunnar W. Klau

Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks

Complex biological systems have been successfully modeled by biochemical and genetic interaction networks, typically gathered from high-throughput (HTP) data. These networks can be used to infer functional relationships between genes or proteins.

Hyunghoon Cho, Bonnie Berger, Jian Peng

Fragmentation Trees Reloaded

Metabolites, small molecules that are involved in cellular reactions, provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually relies on tandem mass spectrometry to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. Fragmentation trees have become a powerful tool for the interpretation of tandem mass spectrometry data of small molecules. These trees are found by combinatorial optimization, and aim at explaining the experimental data via fragmentation cascades. To obtain biochemically meaningful results requires an elaborate optimization function.

We present a new scoring for computing fragmentation trees, transforming the combinatorial optimization into a maximum a posteriori estimator. We demonstrate the superiority of the new scoring for two tasks: Both for the de novo identification of molecular formulas of unknown compounds, and for searching a database for structurally similar compounds, our methods performs significantly better than the previous scoring, as well as other methods for this task. Our method can expedite the workflow for untargeted metabolomics, allowing researchers to investigate unknowns using automated computational methods.

Kai Dührkop, Sebastian Böcker

KGSrna: Efficient 3D Kinematics-Based Sampling for Nucleic Acids

Noncoding ribonucleic acids (RNA) play a critical role in a wide variety of cellular processes, ranging from regulating gene expression to post-translational modification and protein synthesis. Their activity is modulated by highly dynamic exchanges between three-dimensional conformational substates, which are difficult to characterize experimentally and computationally. Here, we present an innovative, entirely kinematic computational procedure to efficiently explore the native ensemble of RNA molecules. Our procedure projects degrees of freedom onto a subspace of conformation space defined by distance constraints in the tertiary structure. The dimensionality reduction enables efficient exploration of conformational space. We show that the conformational distributions obtained with our method broadly sample the conformational landscape observed in NMR experiments. Compared to normal mode analysis-based exploration, our procedure diffuses faster through the experimental ensemble while also accessing conformational substates to greater precision. Our results suggest that conformational sampling with a highly reduced but fully atomistic representation of noncoding RNA expresses key features of their dynamic nature.

Rasmus Fonseca, Henry van den Bedem, Julie Bernauer

Locating a Tree in a Phylogenetic Network in Quadratic Time

A fundamental problem in the study of phylogenetic networks is to determine whether or not a given phylogenetic network contains a given phylogenetic tree. We develop a quadratic-time algorithm for this problem for binary nearly-stable phylogenetic networks. We also show that the number of reticulations in a reticulation visible or nearly stable phylogenetic network is bounded from above by a function linear in the number of taxa.

Philippe Gambette, Andreas D. M. Gunawan, Anthony Labarre, Stéphane Vialette, Louxin Zhang

Constructing Structure Ensembles of Intrinsically Disordered Proteins from Chemical Shift Data

Modeling the structural ensemble of intrinsically disordered proteins (IDPs), which lack fixed structures, is essential in understanding their cellular functions and revealing their regulation mechanisms in signaling pathways of related diseases (e.g., cancers and neurodegenerative disorders). Though the ensemble concept has been widely believed to be the most accurate way to depict 3D structures of IDPs, few of the traditional ensemble-based approaches effectively address the

degeneracy problem

which occurs when multiple solutions are consistent with experimental data and is the main challenge in the IDP ensemble construction task. In this paper, based on a predefined conformational library, we formalize the structure ensemble construction problem into a least squares framework, which provides the optimal solution when the data constraints outnumber unknown variables. To deal with the degeneracy problem, we further propose a regularized regression approach based on the elastic net technique with the assumption that the weights to be estimated for individual structures in the ensemble are sparse. We have validated our methods through a reference ensemble approach as well as by testing the real biological data of three proteins, including alpha-synuclein, the translocation domain of Colocin N and the K18 domain of Tau protein.

Huichao Gong, Sai Zhang, Jiangdian Wang, Haipeng Gong, Jianyang Zeng

Comets (Constrained Optimization of Multistate Energies by Tree Search): A Provable and Efficient Algorithm to Optimize Binding Affinity and Specificity with Respect to Sequence

Practical protein design problems require designing sequences with a combination of affinity, stability, and specificity requirements.

Multistate

protein design algorithms model multiple structural or binding “states" of a protein to address these requirements.

comets

provides a new level of versatile, efficient, and provable multistate design. It provably returns the minimum with respect to sequence of any desired linear combination of the energies of multiple protein states, subject to constraints on other linear combinations. Thus, it can target nearly any combination of affinity (to one or multiple ligands), specificity, and stability (for multiple states if needed). Empirical calculations on 52 protein design problems showed

comets

is far more efficient than the previous state of the art for provable multistate design (exhaustive search over sequences).

comets

can handle a very wide range of protein flexibility and can enumerate a gap-free list of the best constraint-satisfying sequences in order of objective function value.

Mark A. Hallen, Bruce R. Donald

Efficient and Accurate Multiple-Phenotypes Regression Method for High Dimensional Data Considering Population Structure

A typical GWAS tests correlation between a single phenotype and each genotype one at a time. However, it is often very useful to analyze many phenotypes simultaneously. For example, this may increase the power to detect variants by capturing unmeasured aspects of complex biological networks that a single phenotype might miss. There are several multivariate approaches that try to detect variants related to many phenotypes, but none of them consider population structure and each may result in a significant number of false positive identifications. Here, we introduce a new methodology, referred to as GAMMA, that could both simultaneously analyze many phenotypes as well as correct for population structure. In a simulated study, GAMMA accurately identifies true genetic effects without false positive identifications, while other methods either fail to detect true effects or result in many false positive identifications. We further apply our method to genetic studies of yeast and gut microbiome from mouse and show that GAMMA identifies several variants that are likely to have a true biological mechanism.

Jong Wha J. Joo, Eun Yong Kang, Elin Org, Nick Furlotte, Brian Parks, Aldons J. Lusis, Eleazar Eskin

BWM*: A Novel, Provable, Ensemble-Based Dynamic Programming Algorithm for Sparse Approximations of Computational Protein Design

Current dynamic programming protein design algorithms that exploit the optimal substructure induced by

sparse energy functions

compute only the Global Minimum Energy Conformation (GMEC). This disproportionately favors the sequence of a single, static conformation and overlooks better sequences with multiple low-energy conformations. We propose a novel, provable, dynamic programming algorithm called

Branch-Width Minimization

$$^*$$

(BWM

$$^*$$

) to enumerate a gap-free ensemble of conformations in order of increasing energy. Given a branch-decomposition of branch-width

$$w$$

for an

$$n$$

-residue protein design with at most

$$q$$

discrete side-chain conformations per residue, BWM

$$^*$$

returns the

sparse GMEC

in O(

$$nw^2q^{\frac{3}{2}w}$$

) time, and enumerates each additional conformation in O(

$$n\log q$$

) time. BWM

$$^*$$

outperforms the classical search algorithm A

$$^*$$

in 49 of 67 protein design problems, computing the full ensemble or a close approximation up to two orders of magnitude faster. Performance of BWM

$$^*$$

can be predicted cheaply beforehand, allowing selection of the most efficient algorithm for each design problem.

Jonathan D. Jou, Swati Jain, Ivelin Georgiev, Bruce R. Donald

An Efficient Nonlinear Regression Approach for Genome-Wide Detection of Marginal and Interacting Genetic Variations

Genome-wide association studies have revealed individual genetic variants associated with phenotypic traits such as disease risk and gene expressions. However, detecting pairwise interaction effects of genetic variants on traits still remains a challenge due to a large number of combinations of variants (

$$\sim 10^{11}$$

SNP pairs in the human genome), and relatively small sample sizes (typically

$$< 10^{4}$$

). Despite recent breakthroughs in detecting interaction effects, there are still several open problems, including: (1) how to quickly process a large number of SNP pairs, (2) how to distinguish between true signals and SNPs/SNP pairs merely correlated with true signals, (3) how to detect non-linear associations between SNP pairs and traits given small sample sizes, and (4) how to control false positives? In this paper, we present a unified framework, called SPHINX, which addresses the aforementioned challenges. We first propose a piecewise linear model for interaction detection because it is simple enough to estimate model parameters given small sample sizes but complex enough to capture non-linear interaction effects. Then, based on the piecewise linear model, we introduce randomized group lasso under stability selection, and a screening algorithm to address the statistical and computational challenges mentioned above. In our experiments, we first demonstrate that SPHINX achieves better power than existing methods for interaction detection under false positive control. We further applied SPHINX to late-onset Alzheimer’s disease dataset, and report 16 SNPs and 17 SNP pairs associated with gene traits. We also present a highly scalable implementation of our screening algorithm which can screen

$$\sim $$

118 billion candidates of associations on a 60-node cluster in

$$<{}5.5$$

hours. SPHINX is available at http://www.cs.cmu.edu/

$$\sim $$

seunghak/SPHINX/.

Seunghak Lee, Aurélie Lozano, Prabhanjan Kambadur, Eric P. Xing

Exploration of Designability of Proteins Using Graph Features of Contact Maps: Beyond Lattice Models

Highly designable structures can be distinguished based on certain geometric graphical features of the interactions confirming the fact that the topology of a protein structure and its residue-residue interaction network are important determinants of its designability. The most designable structures and poorly designable structures obtained for sets of proteins having the same number of residues are compared, and it is shown that the most designable structures predicted by the graph features of the contact diagrams are more densely packed whereas the poorly designable structures are more open loop type structures or structures that are loosely packed. Interestingly enough, it can also be seen that these highly designable structures obtained are also common structural motifs found in nature.

Sumudu P. Leelananda, Robert L. Jernigan, Andrzej Kloczkowski

CoMEt: A Statistical Approach to Identify Combinations of Mutually Exclusive Alterations in Cancer

A major goal of large-scale cancer sequencing studies is to identify the genetic and epigenetic alterations that drive cancer development and to distinguish these events from random passenger mutations that have no consequence for cancer. Identifying driver mutations is a significant challenge due to the mutational heterogeneity of tumors: different combinations of somatic mutations drive different tumors, even those of the same cancer type.

Mark D. M. Leiserson, Hsin-Ta Wu, Fabio Vandin, Benjamin J. Raphael

Deep Feature Selection: Theory and Application to Identify Enhancers and Promoters

Sparse linear models approximate target variable(s) by a sparse linear combination of input variables. The sparseness is realized through a regularization term. Since they are simple, fast, and able to select features, they are widely used in classification and regression. Essentially linear models are shallow feed-forward neural networks which have three limitations: (1) incompatibility to model non-linearity of features, (2) inability to learn high-level features, and (3) unnatural extensions to select features in multi-class case. Deep neural networks are models structured by multiple hidden layers with non-linear activation functions. Compared with linear models, they have two distinctive strengths: the capability to (1) model complex systems with non-linear structures, (2) learn high-level representation of features. Deep learning has been applied in many large and complex systems where deep models significantly outperform shallow ones. However, feature selection at the input level, which is very helpful to understand the nature of a complex system, is still not well-studied. In genome research, the

cis

-regulatory elements in non-coding DNA sequences play a key role in the expression of genes. Since the activity of regulatory elements involves highly interactive factors, a deep tool is strongly needed to discover informative features. In order to address the above limitations of shallow and deep models for selecting features of a complex system, we propose a deep feature selection model that (1) takes advantages of deep structures to model non-linearity and (2) conveniently selects a subset of features right at the input level for multi-class data. We applied this model to the identification of active enhancers and promoters by integrating multiple sources of genomic information. Results show that our model outperforms elastic net in terms of size of discriminative feature subset and classification accuracy.

Yifeng Li, Chih-Yu Chen, Wyeth W. Wasserman

Protein Contact Prediction by Integrating Joint Evolutionary Coupling Analysis and Supervised Learning

Residue-residue contacts play an important role in maintaining the native fold of a protein and guiding protein folding. However, contact prediction from sequence is very challenging, as indicated by CASP10 [1], which shows that long-range contact prediction accuracy on hard targets is only ~20%.

Jianzhu Ma, Sheng Wang, Zhiyong Wang, Jinbo Xu

ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching

Next-generation sequencing (NGS) is a powerful technology as it can produce millions of short read pairs covering whole genome; however, a complete genome assembly remains challenging. Usually, assembled genome pieces (i.e., contigs) are merged into chains (i.e., scaffolds) using read pairs mapped to pairs of contigs. A recent comprehensive evaluation of available software shows that the scaffolding problem is still open [

1

]. In this paper we present a novel scaffolding tool ScaffMatch based on the maximum weight matching of pairs of reverse complement strands representing contigs and further filling the scaffold with skipped short contigs.

Igor Mandric, Alex Zelikovsky

A Symmetric Length-Aware Enrichment Test

Young et al. [

14

] showed that due to gene length bias the popular Fisher Exact Test should not be used to study the association between a group of differentially expressed (DE) genes and a specific Gene Ontology (GO) category. Instead they suggest a test where one conditions on the genes in the GO category and draws the pseudo DE expressed genes according to a length-dependent distribution. The same model was presented in a different context by Kazemian et al. who went on to offer a dynamic programming (DP) algorithm to exactly estimate the significance of the proposed test [

8

]. Here we point out that while valid, the test proposed by these authors is no longer symmetric as Fisher’s Exact Test is: one gets different answers if one conditions on the observed GO category than on the DE set. As an alternative we offer a symmetric generalization of Fisher’s Exact Test and provide efficient algorithms to evaluate its significance.

David Manescu, Uri Keich

Functional Alignment of Metabolic Networks

Network alignment has become a standard tool in comparative biology, allowing the inference of protein function, interaction and orthology. However, current alignment techniques are based on topological properties of networks and do not take into account their functional implications. Here we propose, for the first time, an algorithm to align two metabolic networks by taking advantage of their coupled metabolic models. These models allow us to assess the functional implications of genes or reactions, captured by the metabolic fluxes that are altered following their deletion from the network. Such implications may spread far beyond the region of the network where the gene or reaction lies. We apply our algorithm to align metabolic networks from various organisms, ranging from bacteria to humans, showing that our alignment can reveal functional orthology relations that are missed by conventional topological alignments.

Arnon Mazza, Allon Wagner, Eytan Ruppin, Roded Sharan

Joint Inference of Genome Structure and Content in Heterogeneous Tumor Samples

For a genomically unstable cancer, a single tumour biopsy will often contain a mixture of competing tumour clones. These tumour clones frequently differ with respect to their genomic content (copy number of each chromosome segment) and structure (order/adjacency of segments on tumour chromosomes). Whole genome sequencing mixes the signals of tumour clones and contaminating normal cells. The ability to unmix these signals and infer divergent genome structure and content is relevant to current avenues of cancer research. We propose a method to unmix tumour and contaminating normal signals and jointly predict genome structure and content of each tumour clone.

Andrew McPherson, Andrew Roth, Cedric Chauve, S. Cenk Sahinalp

Ultra-Large Alignments Using Ensembles of Hidden Markov Models

Many biological questions rely upon multiple sequence alignments (MSAs) and phylogenetic trees of large datasets. However, accurate MSA estimation is difficult for large datasets, especially when the dataset evolved under high rates of evolution or contains fragmentary sequences.

Nam-phuong Nguyen, Siavash Mirarab, Keerthana Kumar, Tandy Warnow

Topological Signatures for Population Admixture

As populations with multilinear transmission (i.e., mixing of genetic material from two parents, say) evolve over generations, the genetic transmission lines constitute complicated networks. In contrast, unilinear transmission leads to simpler network structures (trees). The genetic exchange in multilinear transmission is further influenced by migration, incubation, mixing and so on. The task we address in the paper is to tease apart subtle admixtures from the usual interrelationships of related populations. We present a combinatorial approach based on persistence in topology to detect admixture in populations. We show, based on controlled simulations, that topological characteristics have the potential for detecting subtle admixture in related populations. We then apply the technique successfully to a set of avocado germplasm data indicating that the approach has the potential for novel characterizations of relatedness in populations. We believe that this approach also has the potential for not only detecting but also discriminating ancient from recent admixture.

Laxmi Parida, Filippo Utro, Deniz Yorukoglu, Anna Paola Carrieri, David Kuhn, Saugata Basu

Haplotype Allele Frequency (HAF) Score: Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Adaptive Allele

Methods for detecting the genomic signatures of natural selection are heavily studied, and have been successful in identifying many selective sweeps. For the vast majority of these sweeps the adaptive allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. Because carriers of ongoing selective sweeps are likely to contain a future most recent common ancestor, identifying them may prove useful in predicting the evolutionary trajectory– for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the

Haplotype Allele Frequency (HAF) score

, assigned to individual haplotypes in a sample. The HAF score naturally captures many of the properties shared by haplotypes carrying an adaptive allele. We provide a theoretical model for the behavior of the HAF score under different evolutionary scenarios, and validate the interpretation of the statistic with simulated data. We develop an algorithm (

$$\text {PreCIOSS}$$

: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the adaptive allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft selective sweeps, as well as on data from well-known sweeps in human populations.

Roy Ronen, Glenn Tesler, Ali Akbari, Shay Zakov, Noah A. Rosenberg, Vineet Bafna

Gap Filling as Exact Path Length Problem

One of the last steps in a genome assembly project is filling the gaps between consecutive contigs in the scaffolds. This problem can be naturally stated as finding an

$$s$$

-

$$t$$

path in a directed graph whose sum of arc costs belongs to a given range (the estimate on the gap length). Here

$$s$$

and

$$t$$

are any two contigs flanking a gap. This problem is known to be NP-hard in general. Here we derive a simpler dynamic programming solution than already known, pseudo-polynomial in the maximum value of the input range. We implemented various practical optimizations to it, and compared our exact gap filling solution experimentally to popular gap filling tools. Summing over all the bacterial assemblies considered in our experiments, we can in total fill 28% more gaps than the best previous tool and the gaps filled by our method span 80% more sequence. Furthermore, the error level of the newly introduced sequence is comparable to that of the previous tools.

Leena Salmela, Kristoffer Sahlin, Veli Mäkinen, Alexandru I. Tomescu

Deconvolution of Ensemble Chromatin Interaction Data Reveals the Latent Mixing Structures in Cell Subpopulations

Chromosome conformation capture (3C) experiments provide a window into the spatial packing of a genome in three dimensions within the cell. This structure has been shown to be highly correlated with gene regulation, cancer mutations, and other genomic functions. However, 3C provides mixed measurements on a population of typically millions of cells, each with a different genome structure due to the fluidity of the genome and differing cell states. Here, we present several algorithms to deconvolve these measured 3C matrices into estimations of the contact matrices for each subpopulation of cells and relative densities of each subpopulation. We formulate the problem as that of choosing matrices and densities that minimize the Frobenius distance between the observed 3C matrix and the weighted sum of the estimated subpopulation matrices. Results on

HeLa

5C and mouse and bacteria Hi-C data demonstrate the methods’ effectiveness. We also show that domain boundaries from deconvolved matrices are often more enriched or depleted for regulatory chromatin markers when compared to boundaries from convolved matrices.

Emre Sefer, Geet Duggal, Carl Kingsford

A Fast and Exact Algorithm for the Exemplar Breakpoint Distance

A fundamental problem in comparative genomics is to compute the distance between two genomes. For two genomes without duplicate genes, we can easily compute a variety of distance measures in linear time, but the problem is NP-hard under most models when genomes contain duplicate genes. Sankoff proposed the use of exemplars to tackle the problem of duplicates genes and gene families: each gene family is represented by a single gene (the exemplar for that family), chosen so as to optimize some metric. Unfortunately, choosing exemplars is itself an NP-hard problem. In this paper, we propose a very fast and exact algorithm to compute the exemplar breakpoint distance, based on new insights in the underlying structure of genome rearrangements and exemplars. We evaluate the performance of our algorithm on simulation data and compare its performance to the best effort to date (a divide-and-conquer approach), showing that our algorithm runs much faster and scales much better. We also devise a new algorithm for the generalized breakpoint distance problem, which can then be applied to assign orthologs. We compare our algorithm with the state-of-the-art method MSOAR by assigning orthologs among five well annotated mammalian genomes, showing that our algorithm runs much faster and is slightly more accurate than MSOAR.

Mingfu Shao, Bernard M. E. Moret

Deciding When to Stop: Efficient Experimentation to Learn to Predict Drug-Target Interactions (Extended Abstract)

An active learning method for identifying drug-target interactions is presented which considers the interaction between multiple drugs and multiple targets at the same time. The goal of the proposed method is not simply to predict such interactions from experiments that have already been conducted, but to iteratively choose as few new experiments as possible to improve the accuracy of the predictive model. Kernelized Bayesian matrix factorization (KBMF) is used to model the interactions. We demonstrate on four previously characterized drug effect data sets that active learning driven experimentation using KBMF can result in highly accurate models while performing as few as 14% of the possible experiments, and more accurately than random sampling of an equivalent number. We also provide a method for estimating the accuracy of the current model based on the learning curve; and show how it can be used in practice to decide when to stop an active learning process.

Maja Temerinac-Ott, Armaghan W. Naik, Robert F. Murphy

On the Sample Complexity of Cancer Pathways Identification

In this work we propose a framework to analyze the sample complexity of problems that arise in the study of genomic datasets. Our framework is based on tools from combinatorial analysis and statistical learning theory that have been used for the analysis of machine learning and probably approximately correct (PAC) learning. We use our framework to analyze the problem of the identification of cancer pathways through mutual exclusivity analysis of mutations from large cancer sequencing studies. We analytically derive matching upper and lower bounds on the sample complexity of the problem, showing that sample sizes much larger than currently available may be required to identify all the cancer genes in a pathway. We also provide two algorithms to find a cancer pathway from a large genomic dataset. On simulated and cancer data, we show that our algorithms can be used to identify cancer pathways from large genomic datasets.

Fabio Vandin, Benjamin J. Raphael, Eli Upfal

A Novel Probabilistic Methodology for eQTL Analysis of Signaling Networks

Quantitative trait loci (QTL) studies of common diseases have been spectacularly successful in the last few years and became routine in medical genetics. These studies typically identify genetic variants (QTLs) that are associated with organismal phenotypes, such as susceptibility to disease. Genetic variants can underlie the abundance of gene transcripts, forming ‘expression QTLs’ (eQTLs; [1]). Despite the biomedical importance of understanding how such loci truly affect quantitative traits, several questions remain unsolved: What is the particular mechanism by which a genomic locus affects a quantitative trait? Which specific signaling pathways are responsible for propagating the inherited variation from an eQTL to the gene expression or physiological trait? On which component within such a pathway does the genetic variant act? While it is clear that genetic variants play a critical role in quantitative traits, it is still not fully understood how such variants lead to the inherited variation.

Roni Wilentzik, Irit Gat-Viks

Rapidly Registering Identity-by-Descent Across Ancestral Recombination Graphs

The genomes of remotely related individuals occasionally contain long segments that are Identical By Descent (IBD). Sharing of IBD segments has many applications in population and medical genetics, and it is thus desirable to study their properties in simulations. However, no current method provides a direct, efficient means to extract IBD segments from simulated genealogies. Here, we introduce computationally efficient approaches to extract ground-truth IBD segments from a sequence of genealogies, or equivalently, an ancestral recombination graph. Specifically, we use a two-step scheme, where we first identify putative shared segments by comparing the common ancestors of all pairs of individuals at some distance apart. This reduces the search space considerably, and we then proceed by determining the true IBD status of the candidate segments. Under some assumptions and when allowing a limited resolution of segment lengths, our run-time complexity is reduced from

$$O(n^3\log n)$$

for the naïve algorithm to

$$O(n\log n)$$

, where

$$n$$

is the number of individuals in the sample.

Shuo Yang, Shai Carmi, Itsik Pe’er

Computational Protein Design Using AND/OR Branch-and-Bound Search

The computation of the global minimum energy conformation (GMEC) is an important and challenging topic in structure-based computational protein design. In this paper, we propose a new protein design algorithm based on the AND/OR branch-and-bound (AOBB) search, which is a variant of the traditional branch-and-bound search algorithm, to solve this combinatorial optimization problem. By integrating with a powerful heuristic function, AOBB is able to fully exploit the graph structure of the underlying residue interaction network of a backbone template to significantly accelerate the design process. Tests on real protein data show that our new protein design algorithm is able to solve many problems that were previously unsolvable by the traditional exact search algorithms, and for the problems that can be solved with traditional provable algorithms, our new method can provide a large speedup by several orders of magnitude while still guaranteeing to find the global minimum energy conformation (GMEC) solution.

Yichao Zhou, Yuexin Wu, Jianyang Zeng

Springer Professional

About this book

Table of Contents

Frontmatter

Efficient Alignment Free Sequence Comparison with Bounded Mismatches

DockStar: A Novel ILP Based Integrative Method for Structural Modelling of Multimolecular Protein Complexes (Extended Abstract)

CRISPR Detection from Short Reads Using Partial Overlap Graphs

HapTree-X: An Integrative Bayesian Framework for Haplotype Reconstruction from Transcriptome and Genome Sequencing Data

Read Clouds Uncover Variation in Complex Regions of the Human Genome

Learning Microbial Interaction Networks from Metagenomic Count Data

Immunoglobulin Classification Using the Colored Antibody Graph

CIDANE: Comprehensive Isoform Discovery and Abundance Estimation

Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks

Fragmentation Trees Reloaded

KGSrna: Efficient 3D Kinematics-Based Sampling for Nucleic Acids

Locating a Tree in a Phylogenetic Network in Quadratic Time

Constructing Structure Ensembles of Intrinsically Disordered Proteins from Chemical Shift Data

Comets (Constrained Optimization of Multistate Energies by Tree Search): A Provable and Efficient Algorithm to Optimize Binding Affinity and Specificity with Respect to Sequence

Efficient and Accurate Multiple-Phenotypes Regression Method for High Dimensional Data Considering Population Structure

BWM*: A Novel, Provable, Ensemble-Based Dynamic Programming Algorithm for Sparse Approximations of Computational Protein Design

An Efficient Nonlinear Regression Approach for Genome-Wide Detection of Marginal and Interacting Genetic Variations

Exploration of Designability of Proteins Using Graph Features of Contact Maps: Beyond Lattice Models

CoMEt: A Statistical Approach to Identify Combinations of Mutually Exclusive Alterations in Cancer

Deep Feature Selection: Theory and Application to Identify Enhancers and Promoters

Protein Contact Prediction by Integrating Joint Evolutionary Coupling Analysis and Supervised Learning

ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching

A Symmetric Length-Aware Enrichment Test

Functional Alignment of Metabolic Networks

Joint Inference of Genome Structure and Content in Heterogeneous Tumor Samples

Ultra-Large Alignments Using Ensembles of Hidden Markov Models

Topological Signatures for Population Admixture

Haplotype Allele Frequency (HAF) Score: Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Adaptive Allele

Gap Filling as Exact Path Length Problem

Deconvolution of Ensemble Chromatin Interaction Data Reveals the Latent Mixing Structures in Cell Subpopulations

A Fast and Exact Algorithm for the Exemplar Breakpoint Distance

Deciding When to Stop: Efficient Experimentation to Learn to Predict Drug-Target Interactions (Extended Abstract)

On the Sample Complexity of Cancer Pathways Identification

A Novel Probabilistic Methodology for eQTL Analysis of Signaling Networks

Rapidly Registering Identity-by-Descent Across Ancestral Recombination Graphs

Computational Protein Design Using AND/OR Branch-and-Bound Search

Backmatter

Premium Partner