Reconciliation Revisited: Handling Multiple Optima When Reconciling with Duplication, Transfer, and Loss

Phylogenetic tree reconciliation is a powerful approach for inferring evolutionary events like gene duplication, horizontal gene transfer, and gene loss, which are fundamental to our understanding of molecular evolution. While Duplication-Loss (DL) reconciliation leads to a unique maximum-parsimony solution, Duplication-Transfer-Loss (DTL) reconciliation yields a multitude of optimal solutions, making it difficult the infer the true evolutionary history of the gene family.

Here, we present an effective, efficient, and scalable method for dealing with this fundamental problem in DTL reconciliation. Our approach works by sampling the space of optimal reconciliations uniformly at random and aggregating the results. We present an algorithm to efficiently sample the space of optimal reconciliations uniformly at random in

O

(

mn

2

) time, where

m

and

n

denote the number of genes and species, respectively. We use these samples to understand how different optimal reconciliations vary in their node mapping and event assignments, and to investigate the impact of varying event costs.

Mukul S. Bansal, Eric J. Alm, Manolis Kellis

SEME: A Fast Mapper of Illumina Sequencing Reads with Statistical Evaluation

Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform get longer and their quality scores get higher. According to our calculation, this allows perfect

k

-mer seed match for almost all reads when a close reference genome is available subject to reasonable specificity. Our another observation is that the majority reads contain at most one short INDEL polymorphism. Based on these observations, we propose a fast mapping approach, referred to as “SEME”, which has two core steps: first it scans a read sequentially in a specific order for a

k

-mer exact match seed; next it extends the alignment on both sides allowing at most one short-INDEL each, using a novel method “auto-match function”. We decompose the evaluation of the sensitivity and specificity into two parts corresponding to the seed and extension step, and the composite result provides an approximate overall reliability estimate of each mapping. We compare SEME with some existing mapping methods on several data sets, and SEME shows better performance in terms of both running time and mapping rates.

Shijian Chen, Anqi Wang, Lei M. Li

Dissecting Cancer Heterogeneity with a Probabilistic Genotype-Phenotype Model

Developing an approach to model heterogeneity of cancer has emerged as is an urgent need in cancer studies. To address this challenge we propose an approach for a probabilistic modeling of cancer. Starting with the assumption that each cancer case should be consider as a mixture of cancer subtypes, our model links phenotypic similarities with putative causes. Specifically, building on the idea of a topic model [1], our approach is based on two components (i) a measure of phenotypic similarity between the patients and (ii) a list of features -such as mutations, copy number variation, microRNA level etc. to be used as proposed explanations. The main idea is to define (probabilistic) disease subtypes and, for each patient, identify the mixture of the subtypes that best explain the patient similarity network. Our approach does not assume predefined subtypes nor does it assume that such subtypes have to be uniquely defined. That is, we do not assume that there exist “the” disease subtype model but rather we consider a distribution of such models providing a probabilistic context.

Dong-Yeon Cho, Teresa M. Przytycka

eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data

The recent advances in high-throughput sequencing technologies bring the potential of a better characterization of the genetic variation in humans and other organisms. In many occasions, either by design or by necessity, the sequencing procedure is performed on a pool of DNA samples with different abundances, where the abundance of each sample is unknown. Such a scenario is naturally occurring in the case of metagenomics analysis where a pool of bacteria is sequenced, or in the case of population studies involving DNA pools by design. Particularly, various pooling designs were recently suggested that can identify carriers of rare alleles in large cohorts, dramatically reducing the cost of such large-scale sequencing projects.

A fundamental problem with such approaches for population studies is that the uncertainly of DNA proportions from different individuals in the pools might lead to spurious associations. Fortunately, it is often the case that the genotype data of at least some of the individuals in the pool is known. Here, we propose a method (eALPS) that uses the genotype data in conjunction with the pooled sequence data in order to accurately estimate the proportions of the samples in the pool, even in cases where not all individuals in the pool were genotyped (eALPS-LD). Using real data from a sequencing pooling study of Non-Hodgkin’s Lymphoma, we demonstrate that the estimation of the proportions is crucial, since otherwise there is a risk for false discoveries. Additionally, we demonstrate that our approach is also applicable to the problem of quantification of species in metagenomics samples (eALPS-BCR), and is particularly suitable for metagenomic quantification of closely-related species.

Itamar Eskin, Farhad Hormozdiari, Lucia Conde, Jacques Riby, Chris Skibola, Eleazar Eskin, Eran Halperin

Analysis of Metabolic Evolution in Bacteria Using Whole-Genome Metabolic Models

Recent advances in the automation of metabolic model reconstruction have led to the availability of draft-quality metabolic models (predicted reaction complements) for multiple bacterial species. These reaction complements can be considered as trait representations and can be used for ancestral state reconstruction, to infer the most likely metabolic complements of common ancestors of all bacteria with generated metabolic models. We present here an ancestral state reconstruction for 141 extant bacteria and analyse the reaction gains and losses for these bacteria with respect to their lifestyles and pathogenic nature. A simulated annealing approach is used to look at coordinated metabolic gains and losses in two bacteria. The main losses of

Onion yellows phytoplasma

OY-M, an obligate intracellular pathogen, are shown (as expected) to be in cell wall biosynthesis. The metabolic gains made by

Clostridium difficile

CD196 in adapting to its current habitat in the human colon is also analysed. Our analysis shows that the capability to utilize N-Acetyl-neuraminic acid as a carbon source has been gained, rather than having been present in the

Clostridium

ancestor, as has the capability to synthesise phthiocerol dimycocerosate which could potentially aid the evasion of the host immune response. We have shown that the availability of large numbers of metabolic models, along with conventional approaches, has enabled a systematic method to analyse metabolic evolution in the bacterial domain.

Ali A. Faruqi, William A. Bryant, John W. Pinney

Detecting Protein Conformational Changes in Interactions via Scaling Known Structures

Conformational changes frequently occur when proteins interact with other proteins. How to detect such changes in silico is a major problem. Existing methods for docking with conformational changes remain time-consuming, and they solve the problem only for a small portion of protein-protein complexes accurately. This work presents a more accurate method (FlexDoBi) for docking with conformational changes. FlexDoBi generates the possible conformational changes of the interface residues that transform the proteins from their unbound states to bound states. Based on the generated conformational changes, multi-dimensional scaling is performed to construct candidates for the bound structure. We develop the new energy items for determining the orientation of proteins and selecting of plausible conformational changes. Experimental results illustrate that FlexDoBi achieves better results than other methods for the same purpose. On 20 complexes, we obtained an average iRMSD of 1.55Å, which compares favorably with the average iRMSD of 1.94Å in the predictions from FiberDock. Compared with ZDOCK, our results are of 0.35Å less in average iRMSD on the medium difficulty group, and 0.81Å less on the difficulty group.

Fei Guo, Shuai Cheng Li, Wenji Ma, Lusheng Wang

IPED: Inheritance Path Based Pedigree Reconstruction Algorithm Using Genotype Data

The problem of inference of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. Various methods have been proposed to automate the process of pedigree reconstruction given the genotypes or haplotypes of a set of individuals. Current methods, unfortunately, are very time consuming and inaccurate for complicated pedigrees such as pedigrees with inbreeding. In this work, we propose an efficient algorithm which is able to reconstruct large pedigrees with reasonable accuracy. Our algorithm reconstructs the pedigrees generation by generation backwards in time from the extant generation. We predict the relationships between individuals in the same generation using an inheritance path based approach implemented using an efficient dynamic programming algorithm. Experiments show that our algorithm runs in linear time with respect to the number of reconstructed generations and therefore it can reconstruct pedigrees which have a large number of generations. Indeed it is the first practical method for reconstruction of large pedigrees from genotype data.

Dan He, Zhanyong Wang, Buhm Han, Laxmi Parida, Eleazar Eskin

An Optimal Algorithm for Building the Majority Rule Consensus Tree

A deterministic algorithm for building the majority rule consensus tree of an input collection of conflicting phylogenetic trees with identical leaf labels is presented. Its worst-case running time is

O

(

n

k

), where

n

is the size of the leaf label set and

k

is the number of input phylogenetic trees. This is optimal since the input size is Ω(

n

k

). Experimental results show that the algorithm is fast in practice.

Jesper Jansson, Chuanqi Shen, Wing-Kin Sung

UniNovo : A Universal Tool for de Novo Peptide Sequencing

Mass spectrometry (MS) instruments and experimental protocols are rapidly advancing, but

de novo

peptide sequencing algorithms to analyze tandem mass (MS/MS) spectra are lagging behind. While existing

de novo

sequencing tools perform well on certain types of spectra (e.g., Collision Induced Dissociation (CID) spectra of tryptic peptides), their performance often deteriorates on other types of spectra, such as Electron Transfer Dissociation (ETD), Higher-energy Collisional Dissociation (HCD) spectra, or spectra of non-tryptic digests. Thus, rather than developing a new algorithm for each type of spectra, we develop a

universal

de novo

sequencing algorithm called UniNovo that works well for all types of spectra or even for spectral pairs (e.g., CID/ETD spectral pairs). The performance of UniNovo is compared with PepNovo+, PEAKS, and pNovo using various types of spectra. The results show that the performance of UniNovo is superior to other tools for ETD spectra and superior or comparable to others for CID and HCD spectra. UniNovo also estimates the probability that each reported reconstruction is correct, using simple statistics that are readily obtained from a small training dataset. We demonstrate that the estimation is accurate for all tested types of spectra (including CID, HCD, ETD, CID/ETD, and HCD/ETD spectra of trypsin, LysC, or AspN digested peptides). The appendix is available online at

http://proteomics.ucsd.edu/Software/UniNovo.html

Kyowon Jeong, Sangtae Kim, Pavel A. Pevzner

Efficiently Identifying Significant Associations in Genome-Wide Association Studies

Over the past several years, genome wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome which harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits where only a handful of phenotypes are analyzed per study, in (eQTL) studies, tens of thousands of gene expression levels are measured and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed-models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the SNPs. In the first-stage a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions which may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to state of the art testing approaches by a factor of 75.

Emrah Kostem, Eleazar Eskin

Identification of Ultramodified Proteins Using Top-Down Spectra

Post-translational modifications (PTMs) play an important role in various biological processes through changing protein structure and function. Some

ultramodified

proteins (like histones) have multiple PTMs forming

PTM patterns

that define the functionality of a protein. While bottom-up mass spectrometry (MS) has been successful in identifying

individual

PTMs within short peptides, it is unable to identify PTM patterns spread along entire proteins in a coordinated fashion. In contrast, top-down MS analyzes intact proteins and reveals PTM patterns along the entire proteins. However, while recent advances in instrumentation have made top-down MS accessible to many laboratories, most computational tools for top-down MS focus on proteins with few PTMs and are unable to identify complex PTM patterns. We propose a new algorithm, MS-Align-E, that identifies both expected and unexpected PTMs in ultramodified proteins. We demonstrate that MS-Align-E identifies many protein forms of histone H4 and benchmark it against the currently accepted software tools.

Xiaowen Liu, Shawna Hengel, Si Wu, Nikola Tolić, Ljiljana Pasa-Tolić, Pavel A. Pevzner

Distinguishing between Genomic Regions Bound by Paralogous Transcription Factors

Transcription factors (TFs) regulate gene expression by binding to specific DNA sites in cis regulatory regions of genes. Most eukaryotic TFs are members of protein families that share a common DNA binding domain and often recognize highly similar DNA sequences. Currently, it is not well understood why closely related TFs are able to bind different genomic regions

in vivo

, despite having the potential to interact with the same DNA sites. Here, we use the Myc/Max/Mad family as a model system to investigate whether interactions with additional proteins (co-factors) can explain why paralogous TFs with highly similar DNA binding preferences interact with different genomic sites

in vivo

. We use a classification approach to distinguish between targets of c-Myc versus Mad2, using features that reflect the DNA binding specificities of putative co-factors. When applied to c-Myc/Mad2 DNA binding data, our algorithm can distinguish between genomic regions bound uniquely by c-Myc versus Mad2 with 87% accuracy.

Alina Munteanu, Raluca Gordân

Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads

Recent advances in single-cell genomics provide an alternative to gene-centric metagenomics studies, enabling whole genome sequencing of uncultivated bacteria. However, single-cell assembly projects are challenging due to (i) the highly non-uniform read coverage, and (ii) a greatly elevated number of chimeric reads and read pairs. While recently developed single-cell assemblers have addressed the former challenge, methods for assembling highly chimeric reads remain poorly explored. We present algorithms for identifying chimeric edges and resolving complex bulges in de Bruijn graphs, which significantly improve single-cell assemblies. We further describe applications of the single-cell assembler

SPAdes

to a new approach for capturing and sequencing “dark matter of life” that forms small pools of randomly selected single cells (called a

mini-metagenome

) and further sequences all genomes from the mini-metagenome at once. We demonstrate that

SPAdes

enables sequencing mini-metagenomes and benchmark it against various assemblers. On single-cell bacterial datasets,

SPAdes

improves on the recently developed E+V-SC and IDBA-UD assemblers specifically designed for single-cell sequencing. For standard (multicell) datasets,

SPAdes

also improves on A5, ABySS, CLC, EULER-SR, Ray, SOAPdenovo, and Velvet.

Sergey Nurk, Anton Bankevich, Dmitry Antipov, Alexey Gurevich, Anton Korobeynikov, Alla Lapidus, Andrey Prjibelsky, Alexey Pyshkin, Alexander Sirotkin, Yakov Sirotkin, Ramunas Stepanauskas, Jeffrey McLean, Roger Lasken, Scott R. Clingenpeel, Tanja Woyke, Glenn Tesler, Max A. Alekseyev, Pavel A. Pevzner

Inferring Intra-tumor Heterogeneity from High-Throughput DNA Sequencing Data

Cancer is a disease driven in part by somatic mutations that accumulate during the lifetime of an individual. The clonal theory [1] posits that the cancerous cells in a tumor are descended from a single founder cell and that descendants of this cell acquired multiple mutations beneficial for tumor growth through rounds of selection and

clonal

expansion. A tumor is thus a heterogeneous population of cells, with different subpopulations of cells containing both clonal mutations from the founder cell or early rounds of clonal expansion, and

subclonal

mutations that occurred after the most recent clonal expansion. Most cancer sequencing projects sequence a mixture of cells from a tumor sample including admixture by normal (non-cancerous) cells and different subpopulations of cancerous cells. In addition most solid tumors exhibit extensive aneuploidy and copy number aberrations. Intra-tumor heterogeneity and aneuploidy conspire to complicate analysis of somatic mutations in sequenced tumor samples.

Layla Oesper, Ahmad Mahmoody, Benjamin J. Raphael

NP-MuScL: Unsupervised Global Prediction of Interaction Networks from Multiple Data Sources

Inference of gene interaction networks from expression data usually focuses on either supervised or unsupervised edge prediction from a single data source. However, in many real world applications, multiple data sources, such as microarray and ISH measurements of mRNA abundances, are available to offer multi-view information about the same set of genes. We propose NP-MuScL (nonparanormal multi-source learning) to estimate a gene interaction network that is consistent with such multiple data sources, which are expected to reflect the same underlying relationships between the genes. NP-MuScL casts the network estimation problem as estimating the structure of a sparse undirected graphical model. We use the semiparametric Gaussian copula to model the distribution of the different data sources, with the different copulas sharing the same precision (i.e., inverse covariance) matrix, and we present an efficient algorithm to estimate such a model in the high dimensional scenario. Results are reported on synthetic data, where NP-MuScL outperforms baseline algorithms significantly, even in the presence of noisy data sources. Experiments are also run on two real-world scenarios: two yeast microarray data sets, and three Drosophila embryonic gene expression data sets, where NP-MuScL predicts a higher number of known gene interactions than existing techniques.

Kriti Puniyani, Eric P. Xing

High Resolution Modeling of Chromatin Interactions

Sprout

is a novel generative model for ChIA-PET data that characterizes physical chromatin interactions and points of contact at high spatial resolution.

Sprout

improves upon other methods by learning empirical distributions for pairs of reads that reflect ligation events between genomic locations that are bound by a protein of interest. Using these learned empirical distributions Sprout is able to accurately position interaction anchors, infer whether read pairs were created by self-ligation or inter-ligation, and accurately assign read pairs to anchors which allows for the identification of high confidence interactions. When

Sprout

is run on CTCF ChIA-PET data it identifies more interaction anchors that are supported by CTCF motif matches than other approaches with competitive positional accuracy.

Sprout

rejects interaction events that are not supported by pairs of reads that fit the empirical model for inter-ligation read pairs, producing a set of interactions that are more consistent across CTCF biological replicates than established methods.

Christopher Reeder, David Gifford

A Linear Inside-Outside Algorithm for Correcting Sequencing Errors in Structured RNAs

Analysis of the sequence-structure relationship in RNA molecules are essential to evolutionary studies but also to concrete applications such as error-correction methodologies in sequencing technologies. The prohibitive sizes of the mutational and conformational landscapes combined with the volume of data to proceed require efficient algorithms to compute sequence-structure properties. More specifically, here we aim to calculate which mutations increase the most the likelihood of a sequence to a given structure and RNA family.

In this paper, we introduce

RNApyro

, an efficient linear-time and space inside-outside algorithm that computes exact mutational probabilities under secondary structure and evolutionary constraints given as a multiple sequence alignment with a consensus structure. We develop a scoring scheme combining classical stacking base pair energies to novel isostericity scales, and apply our techniques to correct point-wise errors in 5s rRNA sequences. Our results suggest that

RNApyro

is a promising algorithm to complement existing tools in the NGS error-correction pipeline.

Vladimir Reinharz, Yann Ponty, Jérôme Waldispühl

An Accurate Method for Inferring Relatedness in Large Datasets of Unphased Genotypes via an Embedded Likelihood-Ratio Test

Studies that map disease genes rely on accurate annotations that indicate whether individuals in the studied cohorts are related to each other or not. For example, in genome-wide association studies, the cohort members are assumed to be unrelated to one another. Investigators can correct for individuals in a cohort with previously-unknown shared familial descent by detecting genomic segments that are shared between them, which are considered to be identical by descent (IBD). Alternatively, elevated frequencies of IBD segments near a particular locus among affected individuals can be indicative of a disease-associated gene. As genotyping studies grow to use increasingly large sample sizes and meta-analyses begin to include many data sets, accurate and efficient detection of hidden relatedness becomes a challenge. To enable disease-mapping studies of increasingly large cohorts, a fast and accurate method to detect IBD segments is required.

We present PARENTE, a novel method for detecting related pairs of individuals and shared haplotypic segments within these pairs. PARENTE is a computationally-efficient method based on an embedded likelihood ratio test. As demonstrated by the results of our simulations, our method exhibits better accuracy than the current state of the art, and can be used for the analysis of large genotyped cohorts. PARENTE’s higher accuracy becomes even more significant in more challenging scenarios, such as detecting shorter IBD segments or when an extremely low false-positive rate is required. PARENTE is publicly and freely available at

http://parente.stanford.edu/

.

Jesse M. Rodriguez, Serafim Batzoglou, Sivan Bercovici

Learning Natural Selection from the Site Frequency Spectrum

Genetic adaptation to external stimuli occurs through the combined action of mutation and selection. A central problem in genetics is to identify loci responsive to specific selective pressures. Over the last two decades, many tests have been proposed to identify genomic signatures of natural selection. However, the power of these tests changes unpredictably from one dataset to another, with no single dominant method. We build upon recent work that connects many of these tests in a common framework, by describing how positive selection strongly impacts the observed site frequency spectrum (SFS). Many of the proposed tests quantify the skew in SFS to predict selection. Here, we show that the skew depends on many parameters, including the selection coefficient, and time since selection. Moreover, for each of the different regimes of positive selection, informative features of the scaled SFS can be learned from simulated data and applied to population-scale variation data. Using support vector machines, we develop a test that is effective over all selection regimes. On simulated datasets, our test outperforms existing ones over the entire parameter space. We apply our test to variation data from

Drosophila melanogaster

populations adapted to hypoxia, and identify new loci that were missed by previous approaches, but strengthen the role of the Notch pathway in hypoxia tolerance.

Roy Ronen, Nitin Udpa, Eran Halperin, Vineet Bafna

Considering Unknown Unknowns - Reconstruction of Non-confoundable Causal Relations in Biological Networks

Our current understanding of cellular networks is rather incomplete. We miss important but sofar unknown genes and mechanisms in the pathways. Moreover, we often only have a partial account of the molecular interactions and modifications of the known players. When analyzing the cell, we look through narrow windows leaving potentially important events in blind spots. Network reconstruction is naturally confined to what we have observed. Little is known on how the incompleteness of our observations confounds our interpretation of the available data.

Here we ask the question, which features of a network can be confounded by incomplete observations and which cannot. In the context of nested effects models, we show that in the presence of missing observations or hidden factors a reliable reconstruction of the full network is not feasible. Nevertheless, we can show that certain characteristics of signaling networks like the existence of cross talk between certain branches of the network can be inferred in a non-confoundable way. We derive a test for inferring such non-confoundable characteristics of signaling networks. Next, we introduce a new data structure to represent partially reconstructed signaling networks. Finally, we evaluate our method both on simulated data and in the context of a study on early stem cell differentiation in mice.

Mohammad Javad Sadeh, Giusi Moffa, Rainer Spang

Inference of Tumor Phylogenies with Improved Somatic Mutation Discovery

Next-generation sequencing technologies provide a powerful tool for studying genome evolution during progression of advanced diseases such as cancer. Although many recent studies have employed new sequencing technologies to detect mutations across multiple, genetically related tumors, current methods do not exploit available phylogenetic information to improve the accuracy of their variant calls. Here, we present a novel algorithm that uses somatic single nucleotide variations (SNVs) in multiple, related tissue samples as lineage markers for phylogenetic tree reconstruction. Our method then leverages the inferred phylogeny to improve the accuracy of SNV discovery. Experimental analyses demonstrate that our method achieves up to 32% improvement for somatic SNV calling of multiple related samples over the accuracy of GATK’s Unified Genotyper, the state of the art multisample SNV caller.

Raheleh Salari, Syed Shayon Saleh, Dorna Kashef-Haghighi, David Khavari, Daniel E. Newburger, Robert B. West, Arend Sidow, Serafim Batzoglou

Abstract: Using the Fast Fourier Transform to Accelerate the Computational Search for RNA Conformational Switches

We describe the broad outline of a new thermodynamics-based algorithm,

FFTbor

, that uses the fast Fourier transform to perform polynomial interpolation to compute the Boltzmann probability that secondary structures differ by

k

base pairs from an arbitrary reference structure of a given RNA sequence. The algorithm, which runs in quartic time

O

(

n

4

) and quadratic space

O

(

n

2

), is used to determine the correlation between kinetic folding speed and the

ruggedness

of the energy landscape, and to predict the location of riboswitch expression platform candidates. The full paper appears in

PLoS ONE

(2012) 19 Dec 2012. A web server is available at

http://bioinformatics.bc.edu/clotelab/FFTbor/

.

Evan Senter, Saad Sheikh, Ivan Dotu, Yann Ponty, Peter Clote

MethylCRF, an Algorithm for Estimating Absolute Methylation Levels at Single CpG Resolution from Methylation Enrichment and Restriction Enzyme Sequencing Methods

We introduce MethylCRF, a novel Conditional Random Fields-based algorithm to integrate methylated DNA immunoprecipitation (MeDIP-seq) and methylationsensitive restriction enzyme (MRE-seq) sequencing data to predict DNA methylation levels at single CpG resolution. MethylCRF was benchmarked for accuracy against Infinium arrays, RRBS, whole-genome shotgun-bisulfite (WGBS) sequencing and locus specific-bisufite sequencing on the same DNA. MethylCRF transformation of MeDIP/MRE was equivalent to a biological replicate of WGBS in quantification, coverage and resolution, providing a lower cost and widely accessible strategy to create full methylomes.

Michael Stevens, Jeffrey B. Cheng, Mingchao Xie, Joseph F. Costello, Ting Wang

Counting Motifs in the Entire Biological Network from Noisy and Incomplete Data

(Extended Abstract)

Small over-represented motifs in biological networks are believed to represent essential functional units of biological processes. A natural question is to gauge whether a motif occurs abundantly or rarely in a biological network. Given that high-throughput biotechnology is only able to interrogate a portion of the entire biological network with non-negligible errors, we develop a powerful method to correct link errors in estimating undirected or directed motif counts in the entire network from noisy subnetwork data.

Ngoc Hieu Tran, Kwok Pui Choi, Louxin Zhang

Extracting Structural Information from Residual Chemical Shift Anisotropy: Analytic Solutions for Peptide Plane Orientations and Applications to Determine Protein Structure

Residual dipolar coupling (RDC) and residual chemical shift anisotropy (RCSA) provide orientational restraints on internuclear vectors and the principal axes of chemical shift anisotropy (CSA) tensors, respectively. Mathematically, while an RDC represents a single sphero-conic, an RCSA can be interpreted as a

linear combination

of

two

sphero-conics. Since RDCs and RCSAs are described by a molecular alignment tensor, they contain inherent structural ambiguity due to the symmetry of the alignment tensor and the symmetry of the molecular fragment, which often leads to more than one orientation and conformation for the fragment consistent with the measured RDCs and RCSAs. While the orientational multiplicities have been long studied for RDCs, structural ambiguities arising from RCSAs have not been investigated. In this paper, we give exact and tight bounds on the number of peptide plane orientations consistent with multiple RDCs and/or RCSAs measured in one alignment medium. We prove that at most 16 orientations are possible for a peptide plane, which can be computed in closed form by solving a merely quadratic equation, and applying symmetry operations. Furthermore, we show that RCSAs can be used in the initial stages of structure determination to obtain highly accurate protein backbone global folds. We exploit the mathematical interplay between sphero-conics derived from RCSA and RDC, and protein kinematics, to derive quartic equations, which can be solved in closed-form to compute the protein backbone dihedral angles (

φ

,

ψ

). Building upon this, we designed a novel, sparse-data, polynomial-time divide-and-conquer algorithm to compute protein backbone conformations. Results on experimental NMR data for the protein human ubiquitin demonstrate that our algorithm computes backbone conformations with high accuracy from

13

C′ -RCSA or

15

N-RCSA, and N-H

N

RDC data. We show that the structural information present in

13

C′ -RCSA and

15

N-RCSA can be extracted analytically, and used in a rigorous algorithmic framework to compute a high-quality protein backbone global fold, from a limited amount of NMR data. This will benefit automated NOE assignment and high-resolution protein backbone structure determination from sparse NMR data.

Chittaranjan Tripathy, Anthony K. Yan, Pei Zhou, Bruce Randall Donald

Genome-Wide Survival Analysis of Somatic Mutations in Cancer

Motivation.

Next-generation DNA sequencing technologies now enable the measurement of exomes, genomes, and mRNA expression in many samples. The next challenge is to interpret these large quantities of DNA and RNA sequence data. In many human and cancer genomics studies, a major goal is to discover associations between an observed phenotype and a particular variable from genome-wide measurements of many such variables. In this work we consider the problem of testing the association between a DNA sequence variant and the

survival time

, or length of time that patients live following diagnosis or treatment. This problem is relevant to many cancer sequencing studies, in which one aims to discover somatic variants that distinguish patients with fast-growing tumors that require aggressive treatment from patients with better prognosis [1].

Fabio Vandin, Alexandra Papoutsaki, Benjamin J. Raphael, Eli Upfal

Spectral Library Generating Function for Assessing Spectrum-Spectrum Match Significance

Tandem mass spectrometry (MS/MS) continues to be the technology of choice for high-throughput analysis of complex proteomics samples. While MS/MS spectra are commonly identified by matching against a database of known protein sequences, the complementary approach of spectral library searching [1,2,3] against collections of reference spectra consistently outperforms sequence-based searches by resulting in significantly more identified spectra. But despite this demonstrated superior sensitivity, the development of methods to determine the statistical significance of Spectrum-Spectrum Matches (SSMs) in peptide spectral library searches is still in its early stages.

Mingxun Wang, Nuno Bandeira

SPARSE: Quadratic Time Simultaneous Alignment and Folding of RNAs without Sequence-Based Heuristics

Motivation:

There is increasing evidence of pervasive transcription, resulting in hundreds of thousands of ncRNAs of unknown function. Standard computational analysis tasks for inferring functional annotations like clustering require fast and accurate RNA comparisons based on sequence and structure similarity. The gold standard for the latter is Sankoff’s algorithm [3], which simultaneously aligns and folds RNAs. Because of its extreme time complexity of

O

(

n

6

), numerous faster “Sankoff-style” approaches have been suggested. Several such approaches introduce heuristics based on sequence alignment, which compromises the alignment quality for RNAs with sequence identities below 60% [1]. Avoiding such heuristics, as e.g. in LocARNA [4], has been assumed to prohibit time complexities better than

O

(

n

4

), which strongly limits large-scale applications.

Sebastian Will, Christina Schmiedl, Milad Miladi, Mathias Möhl, Rolf Backofen

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees

Phylogenetic network is a model for reticulate evolution. Hybridization network is one type of phylogenetic network for a set of discordant gene trees, and “displays” each gene tree. A central computational problem on hybridization networks is: given a set of gene trees, reconstruct the minimum (i.e. most parsimonious) hybridization network that displays each given gene tree. This problem is known to be NP-hard, and existing approaches for this problem are either heuristics or make simplifying assumptions (e.g. work with only two input trees or assume some topological properties). In this paper, we develop an exact algorithm (called

PIRN

C

) for inferring the minimum hybridization networks from multiple gene trees. The

PIRN

C

algorithm does not rely on structural assumptions. To the best of our knowledge,

PIRN

C

is the first exact algorithm for this formulation. When the number of reticulation events is relatively small (say four or fewer),

PIRN

C

runs reasonably efficient even for moderately large datasets. For building more complex networks, we also develop a heuristic version of

PIRN

C

called

PIRN

CH

. Simulation shows that

PIRN

CH

usually produces networks with fewer reticulation events than those by an existing method.

Yufeng Wu

Fast and Accurate Calculation of Protein Depth by Euclidean Distance Transform

The depth of each atom/residue in a protein structure is a key attribution that has been widely used in protein structure modeling and function annotation. However, the accurate calculation of depth is time consuming. Here, we propose to use the Euclidean distance transform (EDT) to calculate the depth, which conveniently converts the protein structure to a 3D gray-scale image with each pixel labeling the minimum distance of the pixel to the surface of the molecule (i.e. the depth). We tested the proposed EDT method on a set of 261 non-redundant protein structures. The data show that the EDT method is 2.6 times faster than the widely used method by Chakravarty and Varadarajan. The depth value by EDT method is also highly accurate, which is almost identical to the depth calculated by exhaustive search (Pearson’s correlation coefficient≈1). We believe the EDT-based depth calculation program can be used as an efficient tool to assist the studies of protein fold recognition and structure-based function annotation.

Dong Xu, Hua Li, Yang Zhang

Inference of Spatial Organizations of Chromosomes Using Semi-definite Embedding Approach and Hi-C Data

For a long period of time, scientists studied genomes assuming they are linear. Recently, chromosome conformation capture (3C) based technologies, such as Hi-C, have been developed that provide the loci contact frequencies among loci pairs in a genome-wide scale. The technology unveiled that two far-apart loci can interact in the tested genome. It indicated that the tested genome forms a 3D chromsomal structure within the nucleus. With the available Hi-C data, our next challenge is to model the 3D chromosomal structure from the 3C-dervied data computationally. This paper presents a deterministic method called ChromSDE, which applies semi-definite programming techniques to find the best structure fitting the observed data and uses golden section search to find the correct parameter for converting the contact frequency to spatial distance. To the best of our knowledge, ChromSDE is the only method which can guarantee recovering the correct structure in the noise-free case. In addition, we prove that the parameter of conversion from contact frequency to spatial distance will change under different resolutions theoretically and empirically. Using simulation data and real Hi-C data, we showed that ChromSDE is much more accurate and robust than existing methods. Finally, we demonstrated that interesting biological findings can be uncovered from our predicted 3D structure.

ZhiZhuo Zhang, Guoliang Li, Kim-Chuan Toh, Wing-Kin Sung

Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties

(Extended Abstract)

Binding of one protein to another in a highly specific manner to form stable complexes is critical in most biological processes, yet the mechanisms involved in the interaction of proteins are not fully clear. The identification of hot spots, a small subset of binding interfaces that account for the majority of binding free energy, is becoming increasingly important in understanding the principles of protein interactions. Despite experiments like alanine scanning mutagenesis and a variety of computational methods have been applied to this problem, comparative studies suggest that the development of accurate and reliable solutions is still in its infant stage.

We developed

PredHS

(

Pre

diction of

H

ot

S

pots), a computational method that can effectively identify hot spots on protein binding interfaces by using 38 optimally chosen properties. The optimal combination of features was selected from a set of 324 novel structural neighborhood properties by a two-step feature selection method consisting of a random forest algorithm and a sequential backward elimination method. We evaluated the performance of PredHS using a benchmark of 265 alanine-mutated interface residues (Dataset I) and a trimmed subset (Dataset II) with 10-fold cross validation. Compared with the state of the art approaches, PredHS achieves a significant improvement on the prediction quality, which stems from the new structural neighborhood properties, the novel way of feature generation as well as the selection power of the proposed two-step method. We further validated the capability of our method by an independent test and obtained promising results.

The PredHS web server and supplementary data are available at

http://admis.tongji.edu.cn/predhs

.

Lei Deng, Jihong Guan, Xiaoming Wei, Yuan Yi, Shuigeng Zhou

Springer Professional

Research in Computational Molecular Biology

17th Annual International Conference, RECOMB 2013, Beijing, China, April 7-10, 2013. Proceedings

About this book

Table of Contents

Frontmatter

Reconciliation Revisited: Handling Multiple Optima When Reconciling with Duplication, Transfer, and Loss

SEME: A Fast Mapper of Illumina Sequencing Reads with Statistical Evaluation

Dissecting Cancer Heterogeneity with a Probabilistic Genotype-Phenotype Model

eALPS: Estimating Abundance Levels in Pooled Sequencing Using Available Genotyping Data

Analysis of Metabolic Evolution in Bacteria Using Whole-Genome Metabolic Models

Detecting Protein Conformational Changes in Interactions via Scaling Known Structures

IPED: Inheritance Path Based Pedigree Reconstruction Algorithm Using Genotype Data

An Optimal Algorithm for Building the Majority Rule Consensus Tree

UniNovo : A Universal Tool for de Novo Peptide Sequencing

Efficiently Identifying Significant Associations in Genome-Wide Association Studies

Identification of Ultramodified Proteins Using Top-Down Spectra

Distinguishing between Genomic Regions Bound by Paralogous Transcription Factors

Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads

Inferring Intra-tumor Heterogeneity from High-Throughput DNA Sequencing Data

NP-MuScL: Unsupervised Global Prediction of Interaction Networks from Multiple Data Sources

High Resolution Modeling of Chromatin Interactions

A Linear Inside-Outside Algorithm for Correcting Sequencing Errors in Structured RNAs

An Accurate Method for Inferring Relatedness in Large Datasets of Unphased Genotypes via an Embedded Likelihood-Ratio Test

Learning Natural Selection from the Site Frequency Spectrum

Considering Unknown Unknowns - Reconstruction of Non-confoundable Causal Relations in Biological Networks

Inference of Tumor Phylogenies with Improved Somatic Mutation Discovery

Abstract: Using the Fast Fourier Transform to Accelerate the Computational Search for RNA Conformational Switches

MethylCRF, an Algorithm for Estimating Absolute Methylation Levels at Single CpG Resolution from Methylation Enrichment and Restriction Enzyme Sequencing Methods

Counting Motifs in the Entire Biological Network from Noisy and Incomplete Data

Extracting Structural Information from Residual Chemical Shift Anisotropy: Analytic Solutions for Peptide Plane Orientations and Applications to Determine Protein Structure

Genome-Wide Survival Analysis of Somatic Mutations in Cancer

Spectral Library Generating Function for Assessing Spectrum-Spectrum Match Significance

SPARSE: Quadratic Time Simultaneous Alignment and Folding of RNAs without Sequence-Based Heuristics

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees

Fast and Accurate Calculation of Protein Depth by Euclidean Distance Transform

Inference of Spatial Organizations of Chromosomes Using Semi-definite Embedding Approach and Hi-C Data

Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties

Backmatter

Premium Partner