Boosting Alignment Accuracy by Adaptive Local Realignment

Abstract

While mutation rates can vary markedly over the residues of a protein, multiple sequence alignment tools typically use the same values for their scoring-function parameters across a protein’s entire length. We present a new approach, called adaptive local realignment, that in contrast automatically adapts to the diversity of mutation rates along protein sequences. This builds upon a recent technique known as parameter advising that finds global parameter settings for aligners, to adaptively find local settings. Our approach in essence identifies local regions with low estimated accuracy, constructs a set of candidate realignments using a carefully-chosen collection of parameter settings, and replaces the region if a realignment has higher estimated accuracy. This new method of local parameter advising, when combined with prior methods for global advising, boosts alignment accuracy as much as 26% over the best default setting on hard-to-align protein benchmarks, and by 6.4% over global advising alone. Adaptive local realignment, implemented within the Opal aligner using the Facet accuracy estimator, is available at facet.cs.arizona.edu.

Dan DeBlasio, John Kececioglu

A Concurrent Subtractive Assembly Approach for Identification of Disease Associated Sub-metagenomes

Abstract

Comparative analysis of metagenomes can be used to detect sub-metagenomes (species or gene sets) that are associated with specific phenotypes (e.g., host status). The typical workflow is to assemble and annotate metagenomic datasets individually or as a whole, followed by statistical tests to identify differentially abundant species/genes. We previously developed subtractive assembly (SA), a de novo assembly approach for comparative metagenomics that first detects differential reads that distinguish between two groups of metagenomes and then only assembles these reads. Application of SA to type 2 diabetes (T2D) microbiomes revealed new microbial genes associated with T2D. Here we further developed a Concurrent Subtractive Assembly (CoSA) approach, which uses a Wilcoxon rank-sum (WRS) test to detect k-mers that are differentially abundant between two groups of microbiomes (by contrast, SA only checks ratios of k-mer counts in one pooled sample versus the other). It then uses identified differential k-mers to extract reads that are likely sequenced from the sub-metagenome with consistent abundance differences between the groups of microbiomes. Further, CoSA attempts to reduce the redundancy of reads (from abundant common species) by excluding reads containing abundant k-mers. Using simulated microbiome datasets and T2D datasets, we show that CoSA achieves strikingly better performance in detecting consistent changes than SA does, and it enables the detection and assembly of genomes and genes with minor abundance difference. A SVM classifier built upon the microbial genes detected by CoSA from the T2D datasets can accurately discriminates patients from healthy controls, with an AUC of 0.94 (10-fold cross-validation), and therefore these differential genes (207 genes) may serve as potential microbial marker genes for T2D.

Wontack Han, Mingjie Wang, Yuzhen Ye

A Flow Procedure for the Linearization of Genome Sequence Graphs

Abstract

Efforts to incorporate human genetic variation into the reference human genome have converged on the idea of a graph representation of genetic variation within a species, a genome sequence graph. A sequence graph represents a set of individual haploid reference genomes as paths in a single graph. When that set of reference genomes is sufficiently diverse, the sequence graph implicitly contains all frequent human genetic variations, including translocations, inversions, deletions, and insertions.

In representing a set of genomes as a sequence graph one encounters certain challenges. One of the most important is the problem of graph linearization, essential both for efficiency of storage and access, as well as for natural graph visualization and compatibility with other tools. The goal of graph linearization is to order nodes of the graph in such a way that operations such as access, traversal and visualization are as efficient and effective as possible.

A new algorithm for the linearization of sequence graphs, called the flow procedure, is proposed in this paper. Comparative experimental evaluation of the flow procedure against other algorithms shows that it outperforms its rivals in the metrics most relevant to sequence graphs.

David Haussler, Maciej Smuga-Otto, Benedict Paten, Adam M. Novak, Sergei Nikitin, Maria Zueva, Dmitrii Miagkov

Dynamic Alignment-Free and Reference-Free Read Compression

Abstract

The advent of High Throughput Sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pan-genomes. The ideal way to represent and transfer pan-genomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pan-genome. In this paper, we present DARRC, a new alignment-free and reference-free compression method. It addresses the problem of pan-genome compression by encoding the sequences of a pan-genome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large P. aeruginosa dataset, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared to the best performing state-of-the-art HTS-specific compression method in our experiments.

Availability. DARRC is available at https://github.com/GuillaumeHolley/DARRC.

Guillaume Holley, Roland Wittler, Jens Stoye, Faraz Hach

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

Abstract

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this paper, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290x faster than BWA-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each \(\ge 5\) kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and \(> 60,000\) genomes.

Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, Adam M. Phillippy

Determining the Consistency of Resolved Triplets and Fan Triplets

Abstract

The \(\mathcal {R}^{+-} \mathcal {F}^{+-}\) Consistency problem takes as input two sets \(R^{+}\) and \(R^{-}\) of resolved triplets and two sets \(F^{+}\) and \(F^{-}\) of fan triplets, and asks for a distinctly leaf-labeled tree that contains all elements in \(R^{+} \cup F^{+}\) and no elements in \(R^{-} \cup F^{-}\) as embedded subtrees, if such a tree exists. This paper presents a detailed characterization of how the computational complexity of the problem changes under various restrictions. Our main result is an efficient algorithm for dense inputs satisfying \(R^{-} = \emptyset \) whose running time is linear in the size of the input and therefore optimal.

Jesper Jansson, Andrzej Lingas, Ramesh Rajaby, Wing-Kin Sung

Progressive Calibration and Averaging for Tandem Mass Spectrometry Statistical Confidence Estimation: Why Settle for a Single Decoy?

Abstract

Estimating the false discovery rate (FDR) among a list of tandem mass spectrum identifications is mostly done through target-decoy competition (TDC). Here we offer two new methods that can use an arbitrarily small number of additional randomly drawn decoy databases to improve TDC. Specifically, “Partial Calibration” utilizes a new meta-scoring scheme that allows us to gradually benefit from the increase in the number of identifications calibration yields and “Averaged TDC” (a-TDC) reduces the liberal bias of TDC for small FDR values and its variability throughout. Combining a-TDC with “Progressive Calibration” (PC), which attempts to find the “right” number of decoys required for calibration we see substantial impact in real datasets: when analyzing the Plasmodium falciparum data it typically yields almost the entire 17% increase in discoveries that “full calibration” yields (at FDR level 0.05) using 60 times fewer decoys. Our methods are further validated using a novel realistic simulation scheme and importantly, they apply more generally to the problem of controlling the FDR among discoveries from searching an incomplete database.

Uri Keich, William Stafford Noble

Resolving Multicopy Duplications de novo Using Polyploid Phasing

Abstract

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.

Mark J. Chaisson, Sudipto Mukherjee, Sreeram Kannan, Evan E. Eichler

A Bayesian Active Learning Experimental Design for Inferring Signaling Networks

Abstract

Machine learning methods for learning network structure, applied to quantitative proteomics experiments, reverse-engineer intracellular signal transduction networks. They provide insight into the rewiring of signaling within the context of a disease or a phenotype. To learn the causal patterns of influence between proteins in the network, the methods require experiments that include targeted interventions that fix the activity of specific proteins. However, the interventions are costly and add experimental complexity.

We describe a active learning strategy for selecting optimal interventions. Our approach takes as inputs pathway databases and historic datasets, expresses them in form of prior probability distributions on network structures, and selects interventions that maximize their expected contribution to structure learning. Evaluations on simulated and real data show that the strategy reduces the detection error of validated edges as compared to an unguided choice of interventions, and avoids redundant interventions, thereby increasing the effectiveness of the experiment.

Robert Osazuwa Ness, Karen Sachs, Parag Mallick, Olga Vitek

(Branch and Bound over ): A Provable and Efficient Ensemble-Based Algorithm to Optimize Stability and Binding Affinity over Large Sequence Spaces

Abstract

Protein design algorithms that compute binding affinity search for sequences with an energetically favorable free energy of binding. Recent work shows that the following design principles improve the biological accuracy of protein design: ensemble-based design and continuous conformational flexibility. Ensemble-based algorithms capture a measure of entropic contributions to binding affinity, \(K_a\). Designs using backbone flexibility and continuous side-chain flexibility better model conformational flexibility. A third design principle, provable guarantees of accuracy, ensures that an algorithm computes the best sequences defined by the input model (i.e. input structures, energy function, and allowed protein flexibility). However, previous provable methods that model ensembles and continuous flexibility are single-sequence algorithms, which are very costly: linear in the number of sequences and thus exponential in the number of mutable residues. To address these computational challenges, we introduce a new protein design algorithm, \(BBK^*\), that retains all aforementioned design principles yet provably and efficiently computes the tightest-binding sequences. A key innovation of \(BBK^*\) is the multi-sequence (MS) bound: \(BBK^*\) efficiently computes a single provable upper bound to approximate \(K_a\) for a combinatorial number of sequences, and entirely avoids single-sequence computation for all provably suboptimal sequences. Thus, to our knowledge, \(BBK^*\) is the first provable, ensemble-based \(K_a\) algorithm to run in time sublinear in the number of sequences. Computational experiments on 204 protein design problems show that \(BBK^*\) finds the tightest binding sequences while approximating \(K_a\) for up to \(10^5\)-fold fewer sequences than exhaustive enumeration. Furthermore, for 51 protein-ligand design problems, \(BBK^*\) provably approximates \(K_a\) up to 1982-fold faster than the previous state-of-the-art iMinDEE/\(A^*\)/\(K^*\) algorithm. Therefore, \(BBK^*\) not only accelerates protein designs that are possible with previous provable algorithms, but also efficiently performs designs that are too large for previous methods.

Adegoke A. Ojewole, Jonathan D. Jou, Vance G. Fowler, Bruce R. Donald

Superbubbles, Ultrabubbles and Cacti

Abstract

A superbubble is a type of directed acyclic subgraph with single distinct source and sink vertices. In genome assembly and genetics, the possible paths through a superbubble can be considered to represent the set of possible sequences at a location in a genome. Bidirected and biedged graphs are a generalization of digraphs that are increasingly being used to more fully represent genome assembly and variation problems. Here we define snarls and ultrabubbles, generalizations of superbubbles for bidirected and biedged graphs, and give an efficient algorithm for the detection of these more general structures. Key to this algorithm is the cactus graph, which we show encodes the nested decomposition of a graph into snarls and ultrabubbles within its structure. We propose and demonstrate empirically that this decomposition on bidirected and biedged graphs solves a fundamental problem by defining genetic sites for any collection of genomic variations, including complex structural variations, without need for any single reference genome coordinate system. Furthermore, the nesting of the decomposition gives a natural way to describe and model variations contained within large variations, a case not currently dealt with by existing formats, e.g. VCF.

Benedict Paten, Adam M. Novak, Erik Garrison, Glenn Hickey

EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

Abstract

The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If \(\sigma \) is the size of the alphabet then the method of Lam et al. can conduct one step in time \(\mathcal {O}(\sigma )\) while needing space \(\mathcal {O}(\sigma \cdot n)\) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to \(\mathcal {O}(\log \sigma )\) while using \(\mathcal {O}(\log \sigma \cdot n)\) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees.

In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in \(\mathcal {O}(1)\) time per step while using \(\mathcal {O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)\) bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary).

We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between \(\approx 2.2-4.2\) times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices.

Christopher Pockrandt, Marcel Ehrhardt, Knut Reinert

A Bayesian Framework for Estimating Cell Type Composition from DNA Methylation Without the Need for Methylation Reference

Abstract

Genome-wide DNA methylation levels measured from a target tissue across a population have become ubiquitous over the last few years, as methylation status is suggested to hold great potential for better understanding the role of epigenetics. Different cell types are known to have different methylation profiles. Therefore, in the common scenario where methylation levels are collected from heterogeneous sources such as blood, convoluted signals are formed according to the cell type composition of the samples. Knowledge of the cell type proportions is important for statistical analysis, and it may provide novel biological insights and contribute to our understanding of disease biology. Since high resolution cell counting is costly and often logistically impractical to obtain in large studies, targeted methods that are inexpensive and practical for estimating cell proportions are needed. Although a supervised approach has been shown to provide reasonable estimates of cell proportions, this approach leverages scarce reference methylation data from sorted cells which are not available for most tissues and are not appropriate for any target population. Here, we introduce BayesCCE, a Bayesian semi-supervised method that leverages prior knowledge on the cell type composition distribution in the studied tissue. As we demonstrate, such prior information is substantially easier to obtain compared to appropriate reference methylation levels from sorted cells. Using real and simulated data, we show that our proposed method is able to construct a set of components, each corresponding to a single cell type, and together providing up to 50% improvement in correlation when compared with existing reference-free methods. We further make a design suggestion for future data collection efforts by showing that results can be further improved using cell count measurements for a small subset of individuals in the study sample or by incorporating external data of individuals with measured cell counts. Our approach provides a new opportunity to investigate cell compositions in genomic studies of tissues for which it was not possible before.

Elior Rahmani, Regev Schweiger, Liat Shenhav, Eleazar Eskin, Eran Halperin

Towards Recovering Allele-Specific Cancer Genome Graphs

Abstract

Integrated analysis of structural variants (SVs) and copy number alterations (CNAs) in aneuploid cancer genomes is key to understanding the tumor genome complexity. A recently developed new algorithm Weaver can estimate, for the first time, allele-specific copy number of SVs and their interconnectivity in aneuploid cancer genomes. However, one major limitation is that not all SVs identified by Weaver are phased. In this paper, we develop a general convex programming framework that predicts the interconnectivity of unphased SVs with possibly noisy allele-specific copy number estimations as input. We demonstrated through applications to both simulated data and the HeLa whole-genome sequencing data that our method is robust to the noise in the input copy numbers and can predict SV phasings with high specificity. We found that our method can make consistent predictions with Weaver even if a large proportion of the input variants are unphased. We also applied our method to TCGA ovarian cancer whole-genome sequencing samples to phase unphased SVs obtained by Weaver. Our work provides an important new algorithmic framework for recovering more complete allele-specific cancer genome graphs.

Ashok Rajaraman, Jian Ma

Using Stochastic Approximation Techniques to Efficiently Construct Confidence Intervals for Heritability

Abstract

Estimation of heritability is an important task in genetics. The use of linear mixed models (LMMs) to determine narrow-sense SNP-heritability and related quantities has received much recent attention, due of its ability to account for variants with small effect sizes. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. The common way to report the uncertainty in REML estimation uses standard errors (SE), which rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals. In addition, for larger datasets (e.g., tens of thousands of individuals), the construction of SEs itself may require considerable time, as it requires expensive matrix inversions and multiplications.

Here, we present FIESTA (Fast confidence IntErvals using STochastic Approximation), a method for constructing accurate confidence intervals (CIs). FIESTA is based on parametric bootstrap sampling, and therefore avoids unjustified assumptions on the distribution of the heritability estimator. FIESTA uses stochastic approximation techniques, which accelerate the construction of CIs by several orders of magnitude, compared to previous approaches as well as to the analytical approximation used by SEs. FIESTA builds accurate CIs rapidly, e.g., requiring only several seconds for datasets of tens of thousands of individuals, making FIESTA a very fast solution to the problem of building accurate CIs for heritability for all dataset sizes.

Regev Schweiger, Eyal Fisher, Elior Rahmani, Liat Shenhav, Saharon Rosset, Eran Halperin

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees

Abstract

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequencing Read Archive (SRA) are now available. These databases could answer many questions about the condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. While some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called Split Sequence Bloom Tree (SSBT) to support sequence-based querying of terabyte-scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the SBT [1] data structure for the same task. We apply SSBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2,652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in under 4 min using a single thread and can be stored in just 39 GB, a five-fold improvement in search and storage costs compared to SBT.

Brad Solomon, Carl Kingsford

AllSome Sequence Bloom Trees

Abstract

The ubiquity of next generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2,652 human RNA-seq experiments uploaded to the Sequence Read Archive. Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this paper, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39–85%, with a price of up to 3x memory consumption during queries. Notably, it can query a batch of 198,074 queries in under 8 h (compared to around two days previously) and a whole set of \(k\)-mers from a sequencing experiment (about 27 mil \(k\)-mers) in under 11 min.

Chen Sun, Robert S. Harris, Rayan Chikhi, Paul Medvedev

Longitudinal Genotype-Phenotype Association Study via Temporal Structure Auto-learning Predictive Model

Abstract

With rapid progress in high-throughput genotyping and neuroimaging, imaging genetics has gained significant attention in the research of complex brain disorders, such as Alzheimer’s Disease (AD). The genotype-phenotype association study using imaging genetic data has the potential to reveal genetic basis and biological mechanism of brain structure and function. AD is a progressive neurodegenerative disease, thus, it is crucial to look into the relations between SNPs and longitudinal variations of neuroimaging phenotypes. Although some machine learning models were newly presented to capture the longitudinal patterns in genotype-phenotype association study, most of them required fixed longitudinal structures of prediction tasks and could not automatically learn the interrelations among longitudinal prediction tasks. To address this challenge, we proposed a novel temporal structure auto-learning model to automatically uncover longitudinal genotype-phenotype interrelations and utilized such interrelated structures to enhance phenotype prediction in the meantime. We conducted longitudinal phenotype prediction experiments on the ADNI cohort including 3,123 SNPs and two types of imaging markers, VBM and FreeSurfer. Empirical results demonstrated advantages of our proposed model over the counterparts. Moreover, available literature was identified for our top selected SNPs, which demonstrated the rationality of our prediction results. An executable program is available online at https://github.com/littleq1991/sparse_lowRank_regression.

Xiaoqian Wang, Jingwen Yan, Xiaohui Yao, Sungeun Kim, Kwangsik Nho, Shannon L. Risacher, Andrew J. Saykin, Li Shen, Heng Huang, for the ADNI

Improving Imputation Accuracy by Inferring Causal Variants in Genetic Studies

Abstract

Genotype imputation has been widely utilized for two reasons in the analysis of Genome-Wide Association Studies (GWAS). One reason is to increase the power for association studies when causal SNPs are not collected in the GWAS. The second reason is to aid the interpretation of a GWAS result by predicting the association statistics at untyped variants. In this paper, we show that prediction of association statistics at untyped variants that have an influence on the trait produces overly conservative results. Current imputation methods assume that none of the variants in a region (locus consists of multiple variants) affect the trait, which is often inconsistent with the observed data. In this paper, we propose a new method, CAUSAL-Imp, which can impute the association statistics at untyped variants while taking into account variants in the region that may affect the trait. Our method builds on recent methods that impute the marginal statistics for GWAS by utilizing the fact that marginal statistics follow a multivariate normal distribution. We utilize both simulated and real data sets to assess the performance of our method. We show that traditional imputation approaches underestimate the association statistics for variants involved in the trait, and our results demonstrate that our approach provides less biased estimates of these association statistics.

Yue Wu, Farhad Hormozdiari, Jong Wha J. Joo, Eleazar Eskin

The Copy-Number Tree Mixture Deconvolution Problem and Applications to Multi-sample Bulk Sequencing Tumor Data

Abstract

Cancer is an evolutionary process driven by somatic mutation. This process can be represented as a phylogenetic tree. Constructing such a phylogenetic tree from genome sequencing data is a challenging task due to the mutational complexity of cancer and the fact that nearly all cancer sequencing is of bulk tissue, measuring a superposition of somatic mutations present in different cells. We study the problem of reconstructing tumor phylogenies from copy number aberrations (CNAs) measured in bulk-sequencing data. We introduce the Copy-Number Tree Mixture Deconvolution (CNTMD) problem, which aims to find the phylogenetic tree with the fewest number of CNAs that explain the copy number data from multiple samples of a tumor. CNTMD generalizes two approaches that have been researched intensively in recent years: deconvolution/factorization algorithms that aim to infer the number and proportions of clones in a mixed tumor sample; and phylogenetic models of copy number evolution that model the dependencies between copy number events that affect the same genomic loci. We design an algorithm for solving the CNTMD problem and apply the algorithm to both simulated and real data. On simulated data, we find that our algorithm outperforms existing approaches that perform either deconvolution or phylogenetic tree construction under the assumption of a single tumor clone per sample. On real data, we analyze multiple samples from a prostate cancer patient, identifying clones within these samples and a phylogenetic tree that relates these clones and their differing proportions across samples. This phylogenetic tree provides a higher-resolution view of copy number evolution of this cancer than published analyses.

Simone Zaccaria, Mohammed El-Kebir, Gunnar W. Klau, Benjamin J. Raphael

Quantifying the Impact of Non-coding Variants on Transcription Factor-DNA Binding

Abstract

Many recent studies have emphasized the importance of genetic variants and mutations in cancer and other complex human diseases. The overwhelming majority of these variants occur in non-coding portions of the genome, where they can have a functional impact by disrupting regulatory interactions between transcription factors (TFs) and DNA. Here, we present a method for assessing the impact of non-coding mutations on TF-DNA interactions, based on regression models of DNA-binding specificity trained on high-throughput in vitro data. We use ordinary least squares (OLS) to estimate the parameters of the binding model for each TF, and we show that our predictions of TF binding changes due to DNA mutations correlate well with measured changes in gene expression. In addition, by leveraging distributional results associated with OLS estimation, for each predicted change in TF binding we also compute a normalized score (z-score) and a significance value (p-value) reflecting our confidence that the mutation affects TF binding. We use this approach to analyze a large set of pathogenic non-coding variants, and we show that these variants lead to significant differences in TF binding between alleles, compared to a control set of common variants. Thus, our results indicate that there is a strong regulatory component to the pathogenic non-coding variants identified thus far.

Jingkang Zhao, Dongshunyi Li, Jungkyun Seo, Andrew S. Allen, Raluca Gordân

aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity

Abstract

RNA viruses replicate with high mutation rates, creating closely related viral populations. The heterogeneous virus populations, referred to as viral quasispecies, rapidly adapt to environmental changes thus adversely affecting efficiency of antiviral drugs and vaccines. Therefore, studying the underlying genetic heterogeneity of viral populations plays a significant role in the development of effective therapeutic treatments. Recent high-throughput sequencing technologies have provided invaluable opportunity for uncovering the structure of quasispecies populations. However, accurate reconstruction of viral quasispecies remains difficult due to limited read-lengths and presence of sequencing errors. The problem is particularly challenging when the strains in a population are highly similar, i.e., the sequences are characterized by low mutual genetic distances, and further exacerbated if some of those strains are relatively rare; this is the setting where state-of-the-art methods struggle. In this paper, we present a novel viral quasispecies reconstruction algorithm, aBayesQR, that employs a maximum-likelihood framework to infer individual sequences in a mixture from high-throughput sequencing data. The search for the most likely quasispecies is conducted on long contigs that our method constructs from the set of short reads via agglomerative hierarchical clustering; operating on contigs rather than short reads enables identification of close strains in a population and provides computational tractability of the Bayesian method. Results on both simulated and real HIV-1 data demonstrate that the proposed algorithm generally outperforms state-of-the-art methods; aBayesQR particularly stands out when reconstructing a set of closely related viral strains (e.g., quasispecies characterized by low diversity).

Soyeon Ahn, Haris Vikalo

Springer Professional

Research in Computational Molecular Biology

21st Annual International Conference, RECOMB 2017, Hong Kong, China, May 3-7, 2017, Proceedings

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Boosting Alignment Accuracy by Adaptive Local Realignment

A Concurrent Subtractive Assembly Approach for Identification of Disease Associated Sub-metagenomes

A Flow Procedure for the Linearization of Genome Sequence Graphs

Dynamic Alignment-Free and Reference-Free Read Compression

A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases

Determining the Consistency of Resolved Triplets and Fan Triplets

Progressive Calibration and Averaging for Tandem Mass Spectrometry Statistical Confidence Estimation: Why Settle for a Single Decoy?

Resolving Multicopy Duplications de novo Using Polyploid Phasing

A Bayesian Active Learning Experimental Design for Inferring Signaling Networks

(Branch and Bound over ): A Provable and Efficient Ensemble-Based Algorithm to Optimize Stability and Binding Affinity over Large Sequence Spaces

Superbubbles, Ultrabubbles and Cacti

EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

A Bayesian Framework for Estimating Cell Type Composition from DNA Methylation Without the Need for Methylation Reference

Towards Recovering Allele-Specific Cancer Genome Graphs

Using Stochastic Approximation Techniques to Efficiently Construct Confidence Intervals for Heritability

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees

AllSome Sequence Bloom Trees

Longitudinal Genotype-Phenotype Association Study via Temporal Structure Auto-learning Predictive Model

Improving Imputation Accuracy by Inferring Causal Variants in Genetic Studies

The Copy-Number Tree Mixture Deconvolution Problem and Applications to Multi-sample Bulk Sequencing Tumor Data

Quantifying the Impact of Non-coding Variants on Transcription Factor-DNA Binding

aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity

Backmatter