Skip to main content
Top

2013 | Book

Pattern Recognition in Bioinformatics

8th IAPR International Conference, PRIB 2013, Nice, France, June 17-20, 2013. Proceedings

Editors: Alioune Ngom, Enrico Formenti, Jin-Kao Hao, Xing-Ming Zhao, Twan van Laarhoven

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 8th IAPR International Conference on Pattern Recognition in Bioinformatics, PRIB 2013, held in Nice, France, in June 2013. The 25 revised full papers presented were carefully reviewed and selected from 43 submissions. The papers are organized in topical sections on bio-molecular networks and pathway analysis; learning, classification, and clustering; data mining and knowledge discovery; protein: structure, function, and interaction; motifs, sites, and sequence analysis.

Table of Contents

Frontmatter

Bio-molecular Networks and Pathway Analysis

A Fast Agglomerative Community Detection Method for Protein Complex Discovery in Protein Interaction Networks
Abstract
Proteins are known to interact with each other by forming protein complexes and in order to perform specific biological functions. Many community detection methods have been devised for the discovery of protein complexes in protein interaction networks. One common problem in current agglomerative community detection approaches is that vertices with just one neighbor are often classified as separate clusters, which does not make sense for complex identification. Also, a major limitation of agglomerative techniques is that their computational efficiency do not scale well to large protein interaction networks (PINs). In this paper, we propose a new agglomerative algorithm, FAC-PIN, based on a local premetric of relative vertex-to-vertex clustering value and which addresses the above two issues. Our proposed FAC-PIN method is applied to eight PINs from different species, and the identified complexes are validated using experimentally verified complexes. The preliminary computational results show that FAC-PIN can discover protein complexes from PINs more accurately and faster than the HC-PIN and CNM algorithms, the current state-of-the-art agglomerative approaches to complex prediction.
Mohammad S. Rahman, Alioune Ngom
Inferring Gene Regulatory Networks from Time-Series Expressions Using Random Forests Ensemble
Abstract
Reconstructing gene regulatory network (GRN) from time-series expression data has become increasingly popular since time course data contain temporal information about gene regulation. A typical microarray gene expression data contain expressions of thousands of genes but the number of time samples is usually very small. Therefore, inferring a GRN from such a high-dimensional expression data poses a major challenge. This paper proposes a tree based ensemble of random forests in a multivariate auto-regression framework to tackle this problem. The efficacy of the proposed approach is demonstrated on synthetic time-series datasets and Saccharomyces cerevisiae (Yeast) microarray gene expression data with 9-genes. The performance is comparable or better than GRN generated using dynamic Bayesian networks and ordinary differential equations (ODE) model.
D. A. K. Maduranga, Jie Zheng, Piyushkumar A. Mundra, Jagath C. Rajapakse
Local Topological Signatures for Network-Based Prediction of Biological Function
Abstract
In biology, similarity in structure or sequence between molecules is often used as evidence of functional similarity. In protein interaction networks, structural similarity of nodes (i.e., proteins) is often captured by comparing node signatures (vectors of topological properties of neighborhoods surrounding the nodes).
In this paper, we ask how well such topological signatures predict protein function, using protein interaction networks of the organism Saccharomyces cerevisiae. To this end, we compare two node signatures from the literature – the graphlet degree vector and a signature based on the graph spectrum – and our own simple node signature based on basic topological properties.
We find the connection between topology and protein function to be weak but statistically significant. Surprisingly, our node signature, despite its simplicity, performs on par with the other more sophisticated node signatures. In fact, we show that just two metrics, the link count and transitivity, are enough to classify protein function at a level on par with the other signatures suggesting that detailed topological characteristics are unlikely to aid in protein function prediction based on protein interaction networks.
Wynand Winterbach, Piet Van Mieghem, Marcel J. T. Reinders, Huijuan Wang, Dick de Ridder
Mutational Genomics for Cancer Pathway Discovery
Abstract
We propose mutational genomics as an approach for identifying putative cancer pathways. This approach relies on expression profiling tumors that are induced by retroviral insertional mutagenesis. Akin to genetical genomics, this provides the opportunity to search for associations between tumor-initiating events (the viral insertion sites) and the consequent transcription changes, thus revealing putative regulatory interactions. An important advantage is that in mutational genomics the selective pressure exerted by the tumor growth is exploited to yield a relatively small number of loci that are likely to be causal for tumor formation. This is unlike genetical genomics which relies on the natural occurring genetic variation between samples to reveal the effects of a locus on gene expression.
We performed mutational genomics using a set of 97 lymphoma from mice presenting with splenomegaly. This identified several known as well as novel interactions, including many known targets of Notch1 and Gfi1. In addition to direct one-to-one associations, many multilocus networks of association were found. This is indicative of the fact that a cell has many parallel possibilities in which it can reach a state of uncontrolled proliferation. One of the identified networks suggests that Zmiz1 functions upstream of Notch1. Taken together, our results illustrate the potential of mutational genomics as a powerful approach to dissect the regulatory pathways of cancer.
Jeroen de Ridder, Jaap Kool, Anthony G. Uren, Jan Bot, Johann de Jong, Alistair G. Rust, Anton Berns, Maarten van Lohuizen, David J. Adams, Lodewyk Wessels, Marcel Reinders
Outlier Gene Set Analysis Combined with Top Scoring Pair Provides Robust Biomarkers of Pathway Activity
Abstract
Cancer is a disease driven by pathway activity, while useful biomarkers to predict outcome (prognostic markers) or determine treatment (treatment markers) rely on individual genes, proteins, or metabolites. We provide a novel approach that isolates pathways of interest by integrating outlier analysis and gene set analysis and couple it to the top-scoring pair algorithm to identify robust biomarkers. We demonstrate this methodology on pediatric acute myeloid leukemia (AML) data. We develop a biomarker in primary AML tumors, demonstrate robustness with an independent primary tumor data set, and show that the identified biomarkers also function well in relapsed AML tumors.
Michael F. Ochs, Jason E. Farrar, Michael Considine, Yingying Wei, Soheil Meschinchi, Robert J. Arceci
Restricted Neighborhood Search Clustering Revisited: An Evolutionary Computation Perspective
Abstract
Protein-protein interaction networks have been broadly studied in the last few years, in order to understand the behavior of proteins inside the cell. Proteins interacting with each other often share common biological functions or they participate in the same biological process. Thus, discovering protein complexes made of groups of proteins strictly related, can be useful to predict protein functions. Clustering techniques have been widely employed to detect significative biological complexes. In this paper, we integrate one of the most popular network clustering techniques, namely the Restricted Neighborhood Search Clustering (RNSC), with evolutionary computation. The two cost functions introduced by RNSC, besides a new one that combines them, are used by a Genetic Algorithm as fitness functions to be optimized. Experimental evaluations performed on two different groups of interactions of the budding yeast Saccaromices cerevisiae show that the clusters obtained by the genetic approach are more accurate than those found by RNSC, though this method predicts more true complexes.
Clara Pizzuti, Simona E. Rombo

Learning, Classification, and Clustering

Class Dependent Feature Weighting and K-Nearest Neighbor Classification
Abstract
Feature weighting in supervised learning concerns the development of methods for quantifying the capability of features to discriminate instances from different classes. A popular method for this task, called RELIEF, generates a feature weight vector from a given training set, one weight for each feature. This is achieved by maximizing in a greedy way the sample margin defined on the nearest neighbor classifier. The contribution from each class to the sample margin maximization defines a set of class dependent feature weight vectors, one for each class. This provides a tool to unravel interesting properties of features relevant to a single class of interest.
In this paper we analyze such class dependent feature weight vectors. For instance, we show that in a machine learning dataset describing instances of recurrence and non-recurrence events in breast cancer, the features have different relevance in the two types of events, with size of the tumor estimated to be highly relevant in the recurrence class but not in the non-recurrence one. Furthermore, results of experiments show that a high correlation between feature weights of one class and those generated by RELIEF corresponds to an easier classification task.
In general, results of this investigation indicate that class dependent feature weights are useful to unravel interesting properties of features with respect to a class of interest, and they provide information on the relative difficulty of classification tasks.
Elena Marchiori
Simultaneous Sample and Gene Selection Using T-score and Approximate Support Vectors
Abstract
T-score, based on t-statistics between samples and disease classes, is a widely used filter criterion for gene selection from microarray data. However, classical T-score uses all the training samples but for both biological and computational reasons, selection of relevant samples for training is an important step in classification. Using a modified logistic regression approach, we propose a sample selection criterion based on T-score and develop a backward elimination approach for gene selection. The method is more stable and computationally less costly compared to support vector machine recursive feature elimination (SVM-RFE) methods.
Piyushkumar A. Mundra, Jagath C. Rajapakse, D. A. K. Maduranga
Versatile Sparse Matrix Factorization and Its Applications in High-Dimensional Biological Data Analysis
Abstract
Non-negative matrix factorization and sparse representation models have been successfully applied in high-throughput biological data analysis. In this paper, we propose our versatile sparse matrix factorization (VSMF) model for biological data mining. We show that many well-known sparse models are specific cases of VSMF. Through tuning parameters, sparsity, smoothness, and non-negativity can be easily controlled in VSMF. Our computational experiments corroborate the advantages of VSMF.
Yifeng Li, Alioune Ngom

Data Mining and Knowledge Discovery

A Local Structural Prediction Algorithm for RNA Triple Helix Structure
Abstract
Secondary structure prediction (with or without pseudoknots) of an RNA molecule is a well-known problem in computational biology. Most of the existing algorithms have an assumption that each nucleotide can interact with at most one other nucleotide. This assumption is not valid for triple helix structure (a pseudoknotted structure with tertiary interactions). As these structures are found to be important in many biological processes, it is desirable to develop a prediction tool for these structures. We provide the first structural prediction algorithm to handle triple helix structures. Our algorithm runs in O(n 3) time where n is the length of input RNA sequence. The accuracy of the prediction is reasonably high, with average sensitivity and specificity over 80% for base pairs, and over 70% for tertiary interactions.
Bay-Yuan Hsu, Thomas K. F. Wong, Wing-Kai Hon, Xinyi Liu, Tak-Wah Lam, Siu-Ming Yiu
Combining Protein Fragment Feature-Based Resampling and Local Optimisation
Abstract
Protein structure prediction (PSP) suites can predict ‘near-native’ protein models. However, not always these predicted models are close to the native structure with enough precision to be useful for biologists. The literature to date demonstrates that one of the best techniques to predict ‘near-native’ protein models is to use a fragment-based search strategy. Another technique that can help refine protein models is local optimisation. Local optimisation algorithms use the gradient of the function being optimised to suggest which move will bring the function value closer to its local minimum. In this work we combine the concepts of structural refinement through feature-based resampling, fragment-based PSP, and local optimisation to create an algorithm that can create protein models that are closer to their native states. In experiments we demonstrated that our new method generates models that are close to their native conformations. For structures in the test set, it obtained an average RMSD of 5.09\( \textrm{\AA}\) and an average best TM-Score of 0.47 when no local optimisation was applied. However, by applying local optimisation to our algorithm, additional improvements were achieved.
Trent Higgs, Lukas Folkman, Bela Stantic
Experimental Determination of Intrinsic Drosophila Embryo Coordinates by Evolutionary Computation
Abstract
Early fruit fly embryo development begins with the formation of a chemical blueprint that guides cellular movements and the development of organs and tissues. This blueprint sets the intrinsic spatial coordinates of the embryo. The coordinates are curvilinear from the start, becoming more curvilinear as cells start coherent movements several hours into development. This dynamic aspect of the curvature is an important characteristic of early embryogenesis: characterizing it is crucial for quantitative analysis and dynamic modeling of development. This presents a number of methodological problems for the elastic deformation of 3D and 4D data from confocal microscopy, to standardize images and follow temporal changes. The parameter searches for these deformations present hard optimization problems. Here we describe our evolutionary computation approaches to these problems. We outline some of the immediate applications of these techniques to crucial problems in Drosophila developmental biology.
Alexander V. Spirov, Carlos E. Vanario-Alonso, Ekaterina N. Spirova, David M. Holloway
Identifying Informative Genes for Prediction of Breast Cancer Subtypes
Abstract
It is known that breast cancer is not just one disease, but rather a collection of many different diseases occurring in one site that can be distinguished based in part on characteristic gene expression signatures. Appropriate diagnosis of the specific subtypes of this disease is critical for ensuring the best possible patient response to therapy. Currently, therapeutic direction is determined based on the expression of characteristic receptors; while cost effective, this method is not robust and is limited to predicting a small number of subtypes reliably. Using the original 5 subtypes of breast cancer we hypothesized that machine learning techniques would offer many benefits for feature selection. Unlike existing gene selection approaches, we propose a tree-based approach that conducts gene selection and builds the classifier simultaneously. We conducted computational experiments to select the minimal number of genes that would reliably predict a given subtype. Our results support that this modified approach to gene selection yields a small subset of genes that can predict subtypes with greater than 95% overall accuracy. In addition to providing a valuable list of targets for diagnostic purposes, the gene ontologies of selected genes suggest that these methods have isolated a number of potential genes involved in breast cancer biology, etiology and potentially novel therapeutics.
Iman Rezaeian, Yifeng Li, Martin Crozier, Eran Andrechek, Alioune Ngom, Luis Rueda, Lisa Porter
Predicting Therapeutic Targets with Integration of Heterogeneous Data Sources
Abstract
Drug target is of great importance for designing new drugs and understanding the molecular mechanism of drug actions. In general, a drug may bind to multiple proteins, some of which are not related to disease-treatment or even lead to side effects. Therefore, it is necessary to discriminate the effect-mediating drug targets, i.e. therapeutic targets, from other proteins. Although a lot of computational approaches have been developed to predict drug targets and achieve partial success, few attention has been paid to predict therapeutic targets. In this work, we present a new framework to predict drug therapeutic targets based on the integration of heterogeneous data sources. In particular, we develop an ensemble classifier, PTEC (Predicting Therapeutic targets with Ensemble Classifier), that can effeciently integrate both drug and protein properties described from distinct perspectives, thereby improving prediction accuracy. The results on benchmark datasets demonstrate that our approach outperforms other popular approaches significantly, implying the effectiveness of our proposed approach. Furthermore, the results indicate that the integration of different data sources can not only improve the coverage of predicted targets but also the prediction precision. In other words, distinct data sources indeed complement with each other, and the integration of these heterogeneous data sources can improve the prediction accuracy.
Yan-Fen Dai, Yin-Ying Wang, Xing-Ming Zhao
Using Predictive Models to Engineer Biology: A Case Study in Codon Optimization
Abstract
Given recent advances in synthetic biology and DNA synthesis, there is an increasing need for carefully engineered biological parts (e.g. genes, promoter sequences or enzymes) and circuits. However, forward engineering approaches are thus far rarely used in biology due to lack of detailed knowledge of the biological mechanisms. We describe a framework that enables forward engineering in biology by constructing models predictive of properties of interest, then inverting and using these models to design biological parts.
We demonstrate the applicability of the proposed framework on the problem of codon optimization, concerned with optimizing gene coding sequences for efficient translation. Results suggest that our data-driven codon optimization (DECODON) method simultaneously considers the effects multiple translation mechanisms to produce optimal sequences, in contrast to existing codon optimization techniques.
Alexey A. Gritsenko, Marcel J. T. Reinders, Dick de Ridder

Protein: Structure, Function, and Interaction

Active Learning for Protein Function Prediction in Protein-Protein Interaction Networks
Abstract
The high-throughput technologies have led to vast amounts of protein-protein interaction (PPI) data, and a number of approaches based on PPI networks have been proposed for protein function prediction. However, these approaches do not work well if annotated proteins are scarce in the networks. To address this issue, we propose an active learning based approach that uses graph-based centrality metrics to select proper candidates for labeling. We first cluster a PPI network by using the spectral clustering algorithm and select some proper candidates for labeling within each cluster, and then apply a collective classification algorithm to predict protein function based on these annotated proteins. Experiments over two real datasets demonstrate that the active learning based approach achieves better prediction performance by choosing more informative proteins for labeling. Experimental results also validate that betweenness centrality is more effective than degree centrality and closeness centrality in most cases.
Wei Xiong, Luyu Xie, Jihong Guan, Shuigeng Zhou
Conditional Random Fields for Protein Function Prediction
Abstract
Markov Random Fields (MRF) have been shown to be good predictors of functional annotation, using protein-protein interaction data. Many other sources of data can also be used in this prediction task, but they are typically not integrated.In this study, we extend a method using MRFs in order to allow the use of additional data.
A conditional random field (CRF) model is proposed as an alternative to an MRF model in order to remove the requirement of modeling relationships between the sources of data. We observe that a substantial performance improvement is possible using additional data, such as genetic interaction networks. The improvement gained from each source of evidence is not the same for each protein function, indicating that each source supplies different information. We demonstrate that CRFs can be used to efficiently integrate various sources of data to predict functional annotations.
Thies Gehrmann, Marco Loog, Marcel J. T. Reinders, Dick de Ridder
Enhancing Protein Fold Prediction Accuracy Using Evolutionary and Structural Features
Abstract
Protein fold recognition (PFR) is considered as an important step towards the protein structure prediction problem. It also provides crucial information about the functionality of the proteins. Despite all the efforts that have been made during the past two decades, finding an accurate and fast computational approach to solve PFR still remains a challenging problem for bioinformatics and computational biology. It has been shown that extracting features which contain significant local and global discriminatory information plays a key role in addressing this problem. In this study, we propose the concept of segmented-based feature extraction technique to provide local evolutionary information embedded in Position Specific Scoring Matrix (PSSM) and structural information embedded in the predicted secondary structure of proteins using SPINE-X. We also employ the concept of occurrence feature to extract global discriminatory information from PSSM and SPINE-X. By applying a Support Vector Machine (SVM) to our extracted features, we enhance the protein fold prediction accuracy to 7.4% over the best results reported in the literature.
Abdollah Dehzangi, Kuldip Paliwal, James Lyons, Alok Sharma, Abdul Sattar
Exploring Potential Discriminatory Information Embedded in PSSM to Enhance Protein Structural Class Prediction Accuracy
Abstract
Determining the structural class of a given protein can provide important information about its functionality and its general tertiary structure. In the last two decades, the protein structural class prediction problem has attracted tremendous attention and its prediction accuracy has been significantly improved. Features extracted from the Position Specific Scoring Matrix (PSSM) have played an important role to achieve this enhancement. However, this information has not been adequately explored since the protein structural class prediction accuracy relying on PSSM for feature extraction still remains limited. In this study, to explore this potential, we propose segmentation-based feature extraction technique based on the concepts of amino acids’ distribution and auto covariance. By applying a Support Vector Machine (SVM) to our extracted features, we enhance protein structural class prediction accuracy up to 16% over similar studies found in the literature. We achieve over 90% and 80% prediction accuracies for 25PDB and 1189 benchmarks respectively by solely relying on the PSSM for feature extraction.
Abdollah Dehzangi, Kuldip Paliwal, James Lyons, Alok Sharma, Abdul Sattar
Inferring the Association Network from p53 Sequence Alignment Using Granular Evaluations
Abstract
The relationship connecting the biomolecular sequence, the molecular structure, and the biological function is of extreme importance in nanostructure analysis such as drug discovery. Previous studies involving multiple sequence alignment of biomolecules have demonstrated that associated sites are indicative of the structural and functional characteristics of biomolecules, comparable to methods such as consensus sequences analysis. In this paper, a new method to detect associated sites in aligned sequence ensembles is proposed. It involves the use of multiple sub-tables (or levels) of two-dimensional contingency table analysis. The idea is to incorporate analysis by using a concept known as granular computing, which represents information at different levels of granularity. The analysis involves two phases. The first phase includes labeling of the molecular sites in the p53 protein multiple sequence alignment according to the detected associated patterns. The sites are consequently labeled into three different types based on their site characteristics: 1) conserved sites, 2) associated sites and 3) hypervariate sites. In the second phase, the significance of the extracted site patterns is evaluated with respect to targeted structural and functional characteristics of the p53 protein. The results indicate that the extracted site patterns are significantly associated with some of the known functionalities of p53, a cancer suppressor. Furthermore, when these sites are aligned with p63 and p73, the homologs of p53 without the same cancer suppressing property, based on the common domains, the sites significantly discriminate between the human sequences of the p53 family. Therefore, the study confirms the importance of these detected sites that could indicate their differences in cancer suppressing property.
David K. Y. Chiu, Ramya Manjunath
Prediction of Non-genotoxic Hepatocarcinogenicity Using Chemical-Protein Interactions
Abstract
The assessment of non-genotoxic hepatocarcinogenicity of chemicals is currently based on 2-year rodent bioassays. It is desirable to develop a fast and effective method to accelerate the identification of potential hepatocarcinogenicity of non-genotoxic chemicals. In this study, a novel method CPI is proposed to predict potential hepatocarcinogenicity of non-genotoxic chemicals. The CPI method is based on chemical-protein interactions and interpretable decision tree classifiers.The interpretable rules generated by the CPI method are analyzed to provide insights into the mechanism and biomarkers of non-genotoxic hepatocarcinogenicity. The CPI method with an independent test accuracy of 86% using only 1 protein biomarker outperforms the state-of-the-art methods of gene expression profile-based toxicogenomics using 90 gene biomarkers. A protein ABCC3 was identified as a potential protein biomarker for further exploration. This study presents the potential application of CPI method for assessing non-genotoxic hepatocarcinogenicity of chemicals.
Chun-Wei Tung

Motifs, Sites, and Sequences Analysis

A Structure Based Algorithm for Improving Motifs Prediction
Abstract
Minimotifs are short contiguous peptide sequences in proteins that are known to have functions. There are many repositories for experimentally validated minimotifs. MnM is one of them. Predicting minimotifs (in unknown sequences) is a challenging and interesting problem in biology. Minimotifs stored in the MnM database range in length from 5 to 15. Any algorithm for predicting minimotifs in an unknown query sequence is likely to have many false positives owing to the short lengths of the motifs looked for. Our team has developed a series of algorithms (called filters) in the past to reduce the false positives and improve the prediction accuracy. All of these algorithms are based on sequence information. In a recent paper we have demonstrated the power of structural information in characterizing motifs. In this paper we present an algorithm that exploits structural information for reducing false positives in motifs prediction. We test the validity of our algorithm using the minimotifs stored in the MnM database. MnM is a web system for minimotif search that our team has built. It houses more than 300,000 minimotifs. Our new algorithm is a learning algorithm that will be trained in the first phase and in the second phase its accuracy will be measured. For any input query protein sequence, MnM identifies a list of putative minimotifs in the query sequence. We currently employ a series of sequence based algorithms to reduce the false positives in the predictions of MnM. For every minimotif stored in MnM, we also store a number of attributes pertinent to the motif. One such attribute is the source of the minimotif. The source is nothing but the protein in which the minimotif is present. For the analysis of our new algorithm we only employ those minimtofis that have multiple sources for positive control. Random data is used as negative data. The basic idea of our algorithm is the hypothesis that a putative minimotif is likely to be valid if its structure in the query sequence is very similar to its structure in its source protein. Another important feature of our algorithm is that it is specific to individual minimotifs. In other words, a unique set of parameters is learnt for every minimotif. We feel that this is a better approach than learning a common set of parameters for all the minimotifs together. Our findings reveal that in most of the cases the occurrences of the minimotifs in their source proteins are structurally similar. Also, typically, the occurrences of a minimotif in its source protein and a random protein are dissimilar. Our experimental results show that the parameters learnt by our algorithm can significantly reduce false positives.
Sudipta Pathak, Vamsi Krishna Kundeti, Martin R. Schiller, Sanguthevar Rajasekaran
A Workflow for the Prediction of the Effects of Residue Substitution on Protein Stability
Abstract
The effects of residue substitution in protein can be dramatic and predicting its impact may benefit scientists greatly. Like in many scientific domains there are various methods and tools available to address the potential impact of a mutation on the structure of a protein. The identification of these methods, their availability, the time needed to gain enough familiarity with them and their interface, and the difficulty of integrating their results in a global view where all view points can be visualized often limit their use. In this paper, we present the Structural Prediction for pRotein fOlding UTility System (SPROUTS) workflow and describe our method for designing, documenting, and maintaining the workflow. The focus of the workflow is the thermodynamic contribution to stability, which can be considered as acceptable for small proteins. It compiles the predictions from various sources calculating the ΔΔG upon point mutation, together with a consensus from eight distinct algorithms, with a prediction of the mean number of interacting residues during the process of folding, and a sub domain structural analysis into fragments that may potentially be considered as autonomous folding units, i.e., with similar conformations alone and in the protein body. The workflow is implemented and available online. We illustrate its use with the analysis of the engrailed homeodomain (PDB code 1enh).
Ruben Acuña, Zoé Lacroix, Jacques Chomilier
Estimating Viral Haplotypes in a Population Using k-mer Counting
Abstract
Viral haplotype estimation in a population is an important problem in virology. Viruses undergo a high number of mutations and recombinations during replication for their survival in host cells and exist as a population of closely related genetic variants. Due to this, estimating the number of haplotypes and their relative frequencies in the population becomes a challenging task. The usage of a sequenced reference genome has its limitations due to the high mutational rates in viruses. We propose a method for estimating viral haplotypes based only on the counts of k-mers present in the viral population without using the reference genome. We compute k-mer pairs that are related to each other by one mutation, and compute a minimal set of viral haplotypes that explain the whole population based on these k-mer pairs. We compare our method to the software ShoRAH (which uses a reference genome) on simulated dataset and obtained comparable results, even without using a reference genome.
Raunaq Malhotra, Shruthi Prabhakara, Mary Poss, Raj Acharya
Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes
Abstract
The information theory has been used for quite some time in the area of computational biology. In this paper we discuss and improve the function Entropic Profile, introduced by Vinga and Almeida in [23]. The Entropic Profiler is a function of the genomic location that captures the importance of that region with respect to the whole genome. We provide a linear time linear space algorithm called Fast Entropic Profile, as opposed to the original quadratic implementation. Moreover we propose an alternative normalization that can be also efficiently implemented. We show that Fast EP is suitable for large genomes and for the discovery of motifs with unbounded length.
Matteo Comin, Morris Antonello
Backmatter
Metadata
Title
Pattern Recognition in Bioinformatics
Editors
Alioune Ngom
Enrico Formenti
Jin-Kao Hao
Xing-Ming Zhao
Twan van Laarhoven
Copyright Year
2013
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-39159-0
Print ISBN
978-3-642-39158-3
DOI
https://doi.org/10.1007/978-3-642-39159-0

Premium Partner