Skip to main content



Classification of Biological Sequences


Sequence-Based Prediction of Protein Secretion Success in Aspergillus niger

The cell-factory

Aspergillus niger

is widely used for industrial enzyme production. To select potential proteins for large-scale production, we developed a sequence-based classifier that predicts if an over-expressed homologous protein will successfully be produced and secreted. A dataset of 638 proteins was used to train and validate a classifier, using a 10-fold cross-validation protocol. Using a linear discriminant classifier, an average accuracy of 0.85 was achieved. Feature selection results indicate what features are mostly defining for successful protein production, which could be an interesting lead to couple sequence characteristics to biological processes involved in protein production and secretion

Bastiaan A. van den Berg, Jurgen F. Nijkamp, Marcel J. T. Reinders, Liang Wu, Herman J. Pel, Johannes A. Roubos, Dick de Ridder

Machine Learning Study of DNA Binding by Transcription Factors from the LacI Family

We studied 1372 LacI-family transcription factors and their 4484 DNA binding sites using machine learning algorithms and feature selection techniques. The Naive Bayes classifier and Logistic Regression were used to predict binding sites given transcription factor sequences. Prediction accuracy was estimated using 10-fold cross-validation. Experiments showed that the best prediction of nucleotide densities at selected site positions is obtained using only a few key protein sequence positions. These positions are stably selected by the forward feature selection based on the mutual information of factor-site position pairs.

Gennady G. Fedonin, Mikhail S. Gelfand

Joint Loop End Modeling Improves Covariance Model Based Non-coding RNA Gene Search

The effect of more detailed modeling of the interface between stem and loop in non-coding RNA hairpin structures on efficacy of covariance-model-based non-coding RNA gene search is examined. Currently, the prior probabilities of the two stem nucleotides and two loop-end nucleotides at the interface are treated the same as any other stem and loop nucleotides respectively. Laboratory thermodynamic studies show that hairpin stability is dependent on the identities of these four nucleotides, but this is not taken into account in current covariance models. It is shown that separate estimation of emission priors for these nucleotides and joint treatment of substitution probabilities for the two loop-end nucleotides leads to improved non-coding RNA gene search.

Jennifer Smith

Structured Output Prediction of Anti-cancer Drug Activity

We present a structured output prediction approach for classifying potential anti-cancer drugs. Our QSAR model takes as input a description of a molecule and predicts the activity against a set of cancer cell lines in one shot. Statistical dependencies between the cell lines are encoded by a Markov network that has cell lines as nodes and edges represent similarity according to an auxiliary dataset. Molecules are represented via kernels based on molecular graphs. Margin-based learning is applied to separate correct multilabels from incorrect ones. The performance of the multilabel classification method is shown in our experiments with NCI-Cancer data containing the cancer inhibition potential of drug-like molecules against 59 cancer cell lines. In the experiments, our method outperforms the state-of-the-art SVM method.

Hongyu Su, Markus Heinonen, Juho Rousu

SLiMSearch: A Webserver for Finding Novel Occurrences of Short Linear Motifs in Proteins, Incorporating Sequence Context

Short, linear motifs (SLiMs) play a critical role in many biological processes. The SLiMSearch (Short, Linear Motif Search) webserver is a flexible tool that enables researchers to identify novel occurrences of pre-defined SLiMs in sets of proteins. Numerous masking options give the user great control over the contextual information to be included in the analyses, including evolutionary filtering and protein structural disorder. User-friendly output and visualizations of motif context allow the user to quickly gain insight into the validity of a putatively functional motif occurrence. Users can search motifs against the human proteome, or submit their own datasets of UniProt proteins, in which case motif support within the dataset is statistically assessed for over- and under-representation, accounting for evolutionary relationships between input proteins. SLiMSearch is freely available as open source Python modules and all webserver results are available for download. The SLiMSearch server is available at:


Norman E. Davey, Niall J. Haslam, Denis C. Shields, Richard J. Edwards

Towards 3D Modeling of Interacting TM Helix Pairs Based on Classification of Helix Pair Sequence

Spatial structures of transmembrane proteins are difficult to obtain either experimentally or by computational methods. Recognition of helix-helix contacts conformations, which provide structural skeleton of many transmembrane proteins, is essential in the modeling. Majority of helix-helix interactions in transmembrane proteins can be accurately clustered into a few classes on the basis of their 3D shape. We propose a Stochastic Context Free Grammars framework, combined with evolutionary algorithm, to represent sequence level features of these classes. The descriptors were tested using independent test sets and typically achieved the areas under ROC curves 0.60-0.70; some reached 0.77.

Witold Dyrka, Jean-Christophe Nebel, Malgorzata Kotulska

Optimization Algorithms for Identification and Genotyping of Copy Number Polymorphisms in Human Populations

Recent studies show that copy number polymorphisms (CNPs), defined as genome segments that are polymorphic with regard to genomic copy number and segregate at greater than 1% frequency in the populations, are associated with various diseases. Since rare copy number variations (CNVs) and CNPs bear different characteristics, the problem of discovering CNPs presents opportunities beyond what is available to algorithms that are designed to identify rare CNVs. We present a method for identifying and genotyping common CNPs. The proposed method, POLYGON, produces copy number genotypes of the samples at each CNP and fine-tunes its boundaries by framing CNP identification and genotyping as an optimization problem with an explicitly formulated objective function. We apply POLYGON to data from hundreds of samples and demonstrate that it significantly improves the performance of existing single-sample CNV identification methods. We also demonstrate its superior performance as compared to two other CNP identification/genotyping methods.

Gökhan Yavaş, Mehmet Koyutürk, Thomas LaFramboise

Preservation of Statistically Significant Patterns in Multiresolution 0-1 Data

Measurements in biology are made with high throughput and high resolution techniques often resulting in data in multiple resolutions. Currently, available standard algorithms can only handle data in one resolution. Generative models such as mixture models are often used to model such data. However, significance of the patterns generated by generative models has so far received inadequate attention. This paper analyses the statistical significance of the patterns preserved in sampling between different resolutions and when sampling from a generative model. Furthermore, we study the effect of noise on the likelihood with respect to the changing resolutions and sample size. Finite mixture of multivariate Bernoulli distribution is used to model amplification patterns in cancer in multiple resolutions. Statistically significant itemsets are identified in original data and data sampled from the generative models using randomization and their relationships are studied. The results showed that statistically significant itemsets are effectively preserved by mixture models. The preservation is more accurate in coarse resolution compared to the finer resolution. Furthermore, the effect of noise on data on higher resolution and with smaller number of sample size is higher than the data in lower resolution and with higher number of sample size.

Prem Raj Adhikari, Jaakko Hollmén

Novel Machine Learning Methods for MHC Class I Binding Prediction

MHC class I molecules are key players in the human immune system. They bind small peptides derived from intracellular proteins and present them on the cell surface for surveillance by the immune system. Prediction of such MHC class I binding peptides is a vital step in the design of peptide-based vaccines and therefore one of the major problems in computational immunology. Thousands of different types of MHC class I molecules exist, each displaying a distinct binding specificity. The lack of sufficient training data for the majority of these molecules hinders the application of Machine Learning to this problem.

We propose two approaches to improve the predictive power of kernel-based Machine Learning methods for MHC class I binding prediction: First, a modification of the Weighted Degree string kernel that allows for the incorporation of amino acid properties. Second, we propose an enhanced Multitask kernel and an optimization procedure to fine-tune the kernel parameters. The combination of both approaches yields improved performance, which we demonstrate on the IEDB benchmark data set.

Christian Widmer, Nora C. Toussaint, Yasemin Altun, Oliver Kohlbacher, Gunnar Rätsch

Unsupervised Learning Methods for Biological Sequences


SIMCOMP: A Hybrid Soft Clustering of Metagenome Reads

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. In this paper, we present a two pass semi-supervised algorithm, SimComp, for soft clustering of short metagenome reads, that is a hybrid of comparative and composition based methods. In the first pass, a comparative analysis of the metagenome reads against BLASTx extracts the reference sequences from within the metagenome to form an initial set of seeded clusters. Those reads that have a significant match to the database are clustered by their phylogenetic provenance. In the second pass, the remaining fraction of reads are characterized by their species-specific composition based characteristics. SimComp groups the reads into overlapping clusters, each with its read leader. We make no assumptions about the taxonomic distribution of the dataset. The overlap between the clusters elegantly handles the challenges posed by the nature of the metagenomic data. The resulting cluster leaders can be used as an accurate estimate of the phylogenetic composition of the metagenomic dataset. Our method enriches the dataset into a small number of clusters, while accurately assigning fragments as small as 100 base pairs.

Shruthi Prabhakara, Raj Acharya

The Complexity and Application of Syntactic Pattern Recognition Using Finite Inductive Strings

We describe herein the results of implementing an algorithm for syntactic pattern recognition using the concept of Finite Inductive Sequences (FI). We discuss this idea, and then provide a big O estimate of the time to execute for the algorithms. We then provide some empirical data to support the analysis of the timing. This timing is critical if one wants to process millions of symbols from multiple sequences simultaneously. Lastly, we provide an example of the two FI algorithms applied to actual data taken from a gene and then describe some results as well as the associated data derived from this example.

Elijah Myers, Paul S. Fisher, Keith Irwin, Jinsuk Baek, Joao Setubal

An Algorithm to Find All Identical Motifs in Multiple Biological Sequences

Sequence motifs are of greater biological importance in nucleotide and protein sequences. The conserved occurrence of identical motifs represents the functional significance and helps to classify the biological sequences. In this paper, a new algorithm is proposed to find all identical motifs in multiple nucleotide or protein sequences. The proposed algorithm uses the concept of dynamic programming. The application of this algorithm includes the identification of (a) conserved identical sequence motifs and (b) identical or direct repeat sequence motifs across multiple biological sequences (nucleotide or protein sequences). Further, the proposed algorithm facilitates the analysis of comparative internal sequence repeats for the evolutionary studies which helps to derive the phylogenetic relationships from the distribution of repeats.

Ashish Kishor Bindal, R. Sabarinathan, J. Sridhar, D. Sherlin, K. Sekar

Discovery of Non-induced Patterns from Sequences

Discovering patterns from sequence data has significant impact in genomics, proteomics and business. A problem commonly encountered is that the patterns discovered often contain many redundancies resulted from fake significant patterns induced by their strong statistically significant subpatterns. The concept of statistically induced patterns is proposed to capture these redundancies. An algorithm is then developed to efficiently discover non-induced significant patterns from a large sequence dataset. For performance evaluation, two experiments were conducted to demonstrate a) the seriousness of the problem using synthetic data and b) top non-induced significant patterns discovered from Saccharomyces cerevisiae (Yeast) do correspond to the transcription factor binding sites found by the biologists. The experiments confirm the effectiveness of our method in generating a relatively small set of patterns revealing interesting, unknown information inherent in the sequences.

Andrew K. C. Wong, Dennis Zhuang, Gary C. L. Li, En-Shiun Annie Lee

Exploring Homology Using the Concept of Three-State Entropy Vector

The three-base periodicity usually found in exons has been used for several purposes, as for example the prediction of potential genes. In this paper, we use a data model, previously proposed for encoding protein-coding regions of DNA sequences, to build signatures capable of supporting the construction of meaningful dendograms. The model relies on the three-base periodicity and provides an estimate of the entropy associated with each of the three bases of the codons. We observe that the three entropy values vary among themselves and also from species to species. Moreover, we provide evidence that this makes it possible to associate a three-state entropy vector with each species, and we show that similar species are characterized by similar three-state entropy vectors.

Armando J. Pinho, Sara P. Garcia, Paulo J. S. G. Ferreira, Vera Afreixo, Carlos A. C. Bastos, António J. R. Neves, João M. O. S. Rodrigues

A Maximum-Likelihood Formulation and EM Algorithm for the Protein Multiple Alignment Problem

A given group of protein sequences of different lengths is considered as resulting from random transformations of independent random ancestor sequences of the same preset smaller length, each produced in accordance with an unknown common probabilistic profile. We describe the process of transformation by a Hidden Markov Model (HMM) which is a direct generalization of the PAM model for amino acids. We formulate the problem of finding the maximum likelihood probabilistic ancestor profile and demonstrate its practicality. The proposed method of solving this problem allows for obtaining simultaneously the ancestor profile and the posterior distribution of its HMM, which permits efficient determination of the most probable multiple alignment of all the sequences. Results obtained on the BAliBASE 3.0 protein alignment benchmark indicate that the proposed method is generally more accurate than popular methods of multiple alignment such as CLUSTALW, DIALIGN and ProbAlign.

Valentina Sulimova, Nikolay Razin, Vadim Mottl, Ilya Muchnik, Casimir Kulikowski

Polynomial Supertree Methods Revisited

Supertree methods allow to reconstruct large phylogenetic trees by combining smaller trees with overlapping leaf sets, into one, more comprehensive supertree. The most commonly used supertree method, matrix representation with parsimony (MRP), produces accurate supertrees but is rather slow due to the underlying hard optimization problem. In this paper, we present an extensive simulation study comparing the performance of MRP and the polynomial supertree methods

MinCut Supertree


Modified MinCut Supertree





, and


. We consider both quality and resolution of the reconstructed supertrees. Our findings illustrate the trade-off between accuracy and running time in supertree construction, as well as the pros and cons of voting- and veto-based supertree approaches.

Malte Brinkmeyer, Thasso Griebel, Sebastian Böcker

Enhancing Graph Database Indexing by Suffix Tree Structure

Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, scientists require systems that search for all occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.

This paper presents GraphGrepSX. The system implements efficient graph searching algorithms together with an advanced filtering technique.

GraphGrepSX is compared with SING, GraphFind, CTree and GCoding. Experiments show that GraphGrepSX outperforms the compared systems on a very large collection of molecular data. In particular, it reduces the size and the time for the construction of large database index and outperforms the most popular systems.

Vincenzo Bonnici, Alfredo Ferro, Rosalba Giugno, Alfredo Pulvirenti, Dennis Shasha

Learning Methods for Gene Expression and Mass Spectrometry Data


Semi-Supervised Graph Embedding Scheme with Active Learning (SSGEAL): Classifying High Dimensional Biomedical Data

In this paper, we present a new dimensionality reduction (DR) method (SSGEAL) which integrates Graph Embedding (GE) with semi-supervised and active learning to provide a low dimensional data representation that allows for better class separation. Unsupervised DR methods such as Principal Component Analysis and GE have previously been applied to the classification of high dimensional biomedical datasets (e.g. DNA microarrays and digitized histopathology) in the reduced dimensional space. However, these methods do not incorporate class label information, often leading to embeddings with significant overlap between the data classes. Semi-supervised dimensionality reduction (SSDR) methods have recently been proposed which utilize both labeled and unlabeled instances for learning the optimal low dimensional embedding. However, in several problems involving biomedical data, obtaining class labels may be difficult and/or expensive. SSGEAL utilizes labels from instances, identified as “hard to classify” by a support vector machine based active learning algorithm, to drive an updated SSDR scheme while reducing labeling cost. Real world biomedical data from 7 gene expression studies and 3900 digitized images of prostate cancer needle biopsies were used to show the superior performance of SSGEAL compared to both GE and SSAGE (a recently popular SSDR method) in terms of both the Silhouette Index (SI) (SI = 0.35 for GE, SI = 0.31 for SSAGE, and SI = 0.50 for SSGEAL) and the Area Under the Receiver Operating Characteristic Curve (AUC) for a Random Forest classifier (AUC = 0.85 for GE, AUC = 0.93 for SSAGE, AUC = 0.94 for SSGEAL).

George Lee, Anant Madabhushi

Iterated Local Search for Biclustering of Microarray Data

In the context of microarray data analysis, biclustering aims to identify simultaneously a group of genes that are highly correlated across a group of experimental conditions. This paper presents a Biclustering Iterative Local Search (BILS) algorithm to the problem of biclustering of microarray data. The proposed algorithm is highlighted by the use of some original features including a new evaluation function, a dedicated neighborhood relation and a tailored perturbation strategy. The BILS algorithm is assessed on the well-known yeast cell-cycle dataset and compared with two most popular algorithms.

Wassim Ayadi, Mourad Elloumi, Jin-Kao Hao

Biologically-aware Latent Dirichlet Allocation (BaLDA) for the Classification of Expression Microarray

Topic models have recently shown to be really useful tools for the analysis of microarray experiments. In particular they have been successfully applied to gene clustering and, very recently, also to samples classification. In this latter case, nevertheless, the basic assumption of functional independence between genes is limiting, since many other a priori information about genes’ interactions may be available (co-regulation, spatial proximity or other a priori knowledge). In this paper a novel topic model is proposed, which enriches and extends the Latent Dirichlet Allocation (LDA) model by integrating such dependencies, encoded in a categorization of genes. The proposed topic model is used to derive a highly informative and discriminant representation for microarray experiments. Its usefulness, in comparison with standard topic models, has been demonstrated in two different classification tests.

Alessandro Perina, Pietro Lovato, Vittorio Murino, Manuele Bicego

Measuring the Quality of Shifting and Scaling Patterns in Biclusters

The most widespread biclustering algorithms use the Mean Squared Residue (MSR) as measure for assessing the quality of biclusters. MSR can identify correctly shifting patterns, but fails at discovering biclusters presenting scaling patterns. Virtual Error (VE) is a measure which improves the performance of MSR in this sense, since it is effective at recognizing biclusters containing shifting patters or scaling patterns as quality biclusters. However, VE presents some drawbacks when the biclusters present both kind of patterns simultaneously. In this paper, we propose a improvement of VE that can be integrated in any heuristic to discover biclusters with shifting and scaling patterns simultaneously.

Beatriz Pontes, Raúl Giráldez, Jesús S. Aguilar-Ruiz

Frequent Episode Mining to Support Pattern Analysis in Developmental Biology

We introduce a new method for the analysis of heterochrony in developmental biology. Our method is based on methods used in data mining and intelligent data analysis and applied in, e.g., shopping basket analysis, alarm network analysis and click stream analysis. We have transferred, so called, frequent episode mining to operate in the analysis of developmental timing of different (model) species. This is accomplished by extracting small temporal patterns, i.e. episodes, and subsequently comparing the species based on extracted patterns. The method allows relating the development of different species based on different types of data. In examples we show that the method can reconstruct a phylogenetic tree based on gene-expression data as well as using strict morphological characters. The method can deal with incomplete and/or missing data. Moreover, the method is flexible and not restricted to one particular type of data: i.e., our method allows comparison of species and genes as well as morphological characters based on developmental patterns by simply transposing the dataset accordingly. We illustrate a range of applications.

Ronnie Bathoorn, Monique Welten, Michael Richardson, Arno Siebes, Fons J. Verbeek

Time Series Gene Expression Data Classification via L 1-norm Temporal SVM

Machine learning methods have been successfully applied to the phenotype classification of many diseases based on static gene expression measurements. More recently microarray data have been collected over time, making available datasets composed by time series of expression gene profiles. In this paper we propose a new method for time series classification, based on a temporal extension of



-norm support vector machines, that uses dynamic time warping distance for measuring time series similarity. This results in a mixed-integer optimization model which is solved by a sequential approximation algorithm. Computational tests performed on two benchmark datasets indicate the effectiveness of the proposed method compared to other techniques, and the general usefulness of the approaches based on dynamic time warping for labeling time series gene expression data.

Carlotta Orsenigo, Carlo Vercellis



Sub-grid and Spot Detection in DNA Microarray Images Using Optimal Multi-level Thresholding

The analysis of DNA microarray images is a crucial step in gene expression analysis, since any errors in early stages are propagated in future steps in the analysis. When processing the underlying images, accurately separating the sub-grids and spots is of extreme importance for subsequent steps that include segmentation, quantification, normalization and clustering. We propose a fully automatic approach that first detects the sub-grids given the entire microarray image, and then detects the locations of the spots in each sub-grid. The approach first detects and corrects rotations in the images by an affine transformation, followed by a polynomial-time optimal multi-level thresholding algorithm to find the positions of the sub-grids and spots. Additionally, a new validity index is proposed in order to find the correct number of sub-grids in the microarray image, and the correct number of spots in each sub-grid. Extensive experiments on real-life microarray images show that the method performs these tasks automatically and with a high degree of accuracy.

Iman Rezaeian, Luis Rueda

Quantification of Cytoskeletal Protein Localization from High-Content Images

Cytoskeletal proteins function as dynamic and complex components in many aspects of cell physiology and the maintenance of cell structure. However, very little is known about the coordinated system of these proteins. The knowledge of subcellular localization of proteins is crucial for understanding how proteins function within a cell. We present a framework for quantification of cytoskeletal protein localization from high-content microscopic images. Analyses of high content images of cells transfected by cytoskeleton genes involve individual cell segmentation, intensity transformation of subcellular compartments, protein segmentation based on correlation coefficients, and colocalization quantification of proteins in subcellular components. By quantifying the abundance of proteins in different compartments, we generate colocalization profiles that give insights into the functions of different cytoskeletal proteins.

Shiwen Zhu, Paul Matsudaira, Roy Welsch, Jagath C. Rajapakse

Pattern Recognition for High Throughput Zebrafish Imaging Using Genetic Algorithm Optimization

In this paper we present a novel approach for image based high–throughput analysis of zebrafish embryos. Zebrafish embryos can be made available in high numbers; specifically in groups that have been exposed to different treatments. Preferably, the embryos are processed in batches. However, this complicates an automated processing as individual embryos need to be recognized. We present an approach in which the individual embryos are recognized and counted in an image with multiple instances and in multiple orientations. The recognition results in a mask that is used in the analysis of the images; multichannel images with bright–field and fluorescence are used.

The pattern recognition is based on a genetic algorithm which is the base of an optimization procedure through which the pattern is found. The optimization is accomplished by a deformable template that is incorporated in the genetic algorithm. We show that this approach is very robust and produces result fast so that it becomes very useful in a high–throughput environment. The method is fully automated and does not require any human intervention. We have tested our approach on both synthetic and real life images (zebrafish embryos). The results indicate that the method can be applied to a broad range of pattern recognition problems that require a high–throughput approach.

Alexander E. Nezhinsky, Fons J. Verbeek

Consensus of Ambiguity: Theory and Application of Active Learning for Biomedical Image Analysis

Supervised classifiers require manually labeled training samples to classify unlabeled objects. Active Learning (AL) can be used to selectively label only “ambiguous” samples, ensuring that each labeled sample is maximally informative. This is invaluable in applications where manual labeling is expensive, as in medical images where annotation of specific pathologies or anatomical structures is usually only possible by an expert physician. Existing AL methods use a single definition of ambiguity, but there can be significant variation among individual methods. In this paper we present a consensus of ambiguity (CoA) approach to AL, where only samples which are consistently labeled as ambiguous across multiple AL schemes are selected for annotation. CoA-based AL uses fewer samples than Random Learning (RL) while exploiting the variance between individual AL schemes to efficiently label training sets for classifier training. We use a consensus ratio to determine the variance between AL methods, and the CoA approach is used to train classifiers for three different medical image datasets: 100 prostate histopathology images, 18 prostate DCE-MRI patient studies, and 9,000 breast histopathology regions of interest from 2 patients. We use a Probabilistic Boosting Tree (PBT) to classify each dataset as either cancer or non-cancer (prostate), or high or low grade cancer (breast). Trained is done using CoA-based AL, and is evaluated in terms of accuracy and area under the receiver operating characteristic curve (AUC). CoA training yielded between 0.01-0.05% greater performance than RL for the same training set size; approximately 5-10 more samples were required for RL to match the performance of CoA, suggesting that CoA is a more efficient training strategy.

Scott Doyle, Anant Madabhushi

Semi-supervised Learning of Sparse Linear Models in Mass Spectral Imaging

We present an approach to learn predictive models and perform variable selection by incorporating structural information from Mass Spectral Imaging (MSI) data. We explore the use of a smooth quadratic penalty to model the natural ordering of the physical variables, that is the mass-to-charge (




) ratios. Thereby, estimated model parameters for nearby variables are enforced to smoothly vary. Similarly, to overcome the lack of labeled data we model the spatial proximity among spectra by means of a connectivity graph over the set of predicted labels. We explore the usefulness of this approach in a mouse brain MSI data set.

Fabian Ojeda, Marco Signoretto, Raf Van de Plas, Etienne Waelkens, Bart De Moor, Johan A. K. Suykens

Molecular Structure Prediction


A Matrix Algorithm for RNA Secondary Structure Prediction

In this paper we propose a novel high-performance algorithm, referred to as MARSs (Matrix Algorithm for RNA Secondary Structure Prediction), for predicting RNA Secondary Structures with or without pseudoknots. The algorithm is capable of operating in both serial and parallel modes. The algorithm will take complete advantage of the explicit hardware parallelism increasingly available in todayś multi-core processors resulting in execution speedups. Unlike Dynamic Programming based algorithms, MARSs is non-recursive by design and therefore eliminates some of the disadvantages of Dynamic Programming based algorithms. We performed a large-scale experiment on a multi-core hardware using real sequences with verified structures. We detail and discuss the results from these experiments using metrics such as performance gains, run-times and prediction accuracy. This is one of the first attempts of its kind to provide a complete flexibility in evolving a RNA secondary structure with or without pseudoknots using a matrix-based approach.

S. P. T. Krishnan, Mushfique Junayed Khurshid, Bharadwaj Veeravalli

Exploiting Long-Range Dependencies in Protein β-Sheet Secondary Structure Prediction

We investigate if interactions of longer range than typically considered in local protein secondary structure prediction methods can be captured in a simple machine learning framework to improve the prediction of


sheets. We use support vector machines and recursive feature elimination to show that the small signals available in long range interactions can indeed be exploited. The improvement is small but statistically significant on the benchmark datasets we used. We also show that feature selection within a long window and over amino acids at specific positions typically selects amino acids that are shown to be more relevant in the initiation and termination of


-sheet formation.

Yizhao Ni, Mahesan Niranjan

Alpha Helix Prediction Based on Evolutionary Computation

Multiple approaches have been developed in order to predict the protein secondary structure. In this paper, we propose an approach to such a problem based on evolutionary computation. The proposed approach considers various amino acids properties in order to predict the secondary structure of a protein. In particular, we will consider the hydrophobicity, the polarity and the charge of amino acids. In this study, we focus on predicting a particular kind of secondary structure:


-helices. The results of our proposal will be a set of rules that will identify the beginning or the end of such a structure.

Alfonso E. Márquez Chamorro, Federico Divina, Jesús S. Aguilar Ruiz, Gualberto Asencio Cortés

An On/Off Lattice Approach to Protein Structure Prediction from Contact Maps

An important unsolved problem in structural bioinformatics is that of protein structure prediction (PSP), the reconstruction of a biologically plausible three-dimensional structure for a given protein given only its amino acid sequence. The PSP problem is of enormous interest, because the function of proteins is a direct consequence of their three-dimensional structure. Approaches to solve the PSP use protein models that range from very realistic (all-atom) to very simple (on a lattice). Finer representations usually generate better candidate structures, but are computationally more costly than the simpler on-lattice ones. In this work we propose a combined approach that makes use of a simple and fast lattice protein structure prediction algorithm, REMC-HPPFP, to compute a number of coarse candidate structures. These are later refined by 3Distill, an off-lattice, residue-level protein structure predictor. We prove that the lattice algorithm is able to bootstrap 3Distill, which consequently converges much faster, allowing for shorter execution times without noticeably degrading the quality of the predictions. This novel method allows us to generate a large set of decoys of quality comparable to those computed by the off-lattice method alone, but using a fraction of the computations. As a result, our method could be used to build large databases of predicted decoys for analysis, or for selecting the best candidate structures through reranking techniques. Furthermore our method is generic, in that it can be applied to other algorithms than 3Distill.

Stefano Teso, Cristina Di Risio, Andrea Passerini, Roberto Battiti

Protein Protein Interaction and Network Inference


Biological Protein-Protein Interaction Prediction Using Binding Free Energies and Linear Dimensionality Reduction

An important issue in understanding and classifying protein-protein interactions (PPI) is to characterize their interfaces in order to discriminate between transient and obligate complexes. We propose a classification approach to discriminate between these two types of complexes. Our approach uses contact and binding free energies of the residues present in the interaction, which are the input features for the classifiers. A total of 282 features are extracted for each complex, and the classification is performed via recently proposed dimensionality reduction (LDR) methods, including the well-know Fisher’s discriminant analysis and two heteroscedastic approaches. The results on a standard benchmark of transient and obligate protein complexes show that LDR approaches achieve a very high classification accuracy (over 78%), outperforming various support vector machines and nearest-neighbor classifiers. An additional insight on the proposed approach and experiments on different subsets of features shows that solvation energies can be used in the classification, leading to a performance comparable to using the full binding free energies of the interaction.

L. Rueda, Carolina Garate, Sridip Banerjee, Md. Mominul Aziz

Employing Publically Available Biological Expert Knowledge from Protein-Protein Interaction Information

Genome wide association studies (GWAS) are now allowing researchers to probe the depths of common complex human diseases, yet few have identified single sequence variants that confer disease susceptibility. As hypothesized, this is due the fact that multiple interacting factors influence clinical endpoint. Given the number of single nucleotide polymorphisms (SNPs) combinations grows exponentially with the number of SNPs being analyzed, computational methods designed to detect these interactions in smaller datasets are thus not applicable. Providing statistical expert knowledge has exhibited an improvement in their performance, and we believe biological expert knowledge to be as capable. Since one of the strongest demonstrations of the functional relationship between genes is protein-protein interactions, we present a method that exploits this information in genetic analyses. This study provides a step towards utilizing expert knowledge derived from public biological sources to assist computational intelligence algorithms in the search for epistasis.

Kristine A. Pattin, Jiang Gui, Jason H. Moore

SFFS-MR: A Floating Search Strategy for GRNs Inference

An important problem in the bioinformatics field is the inference of gene regulatory networks (GRN) from temporal expression profiles. In general, the main limitations faced by GRN inference methods is the small number of samples with huge dimensionalities and the noisy nature of the expression measurements. In face of these limitations, alternatives are needed to get better accuracy on the GRNs inference problem. In this context, this work addresses this problem by presenting an alternative feature selection method that applies prior knowledge on its search strategy, called SFFS-MR. The proposed search strategy is based on SFFS algorithm, with the inclusion of multiple roots at the beginning of the search, which are defined by the best and worst single results of the SFS algorithm. In this way, the search space traversed is guided by these roots in order to find the predictor genes for a given target gene, specially to better identify genes presenting intrinsically multivariate prediction, without worsening the asymptotical computational cost of the SFFS. Experimental results show that the SFFS-MR provides a better inference accuracy than SFS and SFFS, maintaining a similar robustness of the SFS and SFFS methods. In addition, the SFFS-MR was able to achieve 60% of accuracy on network recovery after only 20 observations from a state space of size 2


, thus presenting very good results.

Fabrício M. Lopes, David C. Martins, Junior Barrera, Roberto M. Cesar

Revisiting the Voronoi Description of Protein-Protein Interfaces: Algorithms

Describing macro-molecular interfaces is key to improve our understanding of the specificity and of the stability of macro-molecular interactions, and also to predict complexes when little structural information is known. Ideally, an interface model should provide easy-to-compute geometric and topological parameters exhibiting a good correlation with important bio-physical quantities. It should also be parametric and amenable to comparisons. In this spirit, we recently developed an interface model based on Voronoi diagrams, which proved instrumental to refine state-of-the-art conclusions and provide new insights.

This paper formally presents this Voronoi interface model. First, we discuss its connexion to classical interface models based on distance cut-offs and solvent accessibility. Second, we develop the geometric and topological constructions underlying the Voronoi interface, and design efficient algorithms based on the Delaunay triangulation and the



We conclude with perspectives. In particular, we expect the Voronoi interface model to be particularly well suited for the problem of comparing interfaces in the context of large-scale structural studies.

Frederic Cazals

MC4: A Tempering Algorithm for Large-Sample Network Inference

Bayesian networks and their variants are widely used for modelling gene regulatory and protein signalling networks. In many settings, it is the underlying network structure itself that is the object of inference. Within a Bayesian framework inferences regarding network structure are made via a posterior probability distribution over graphs. However, in practical problems, the space of graphs is usually too large to permit exact inference, motivating the use of approximate approaches. An MCMC-based algorithm known as MC


is widely used for network inference in this setting. We argue that recent trends towards larger sample size datasets, while otherwise advantageous, can, for reasons related to concentration of posterior mass, render inference by MC



. We therefore exploit an approach known as parallel tempering to put forward an algorithm for network inference which we call MC


. We show empirical results on both synthetic and proteomic data which highlight the ability of MC


to converge faster and thereby yield demonstrably accurate results, even in challenging settings where MC



Daniel James Barker, Steven M. Hill, Sach Mukherjee

Flow-Based Bayesian Estimation of Nonlinear Differential Equations for Modeling Biological Networks

We consider the problem of estimating parameters and unobserved trajectories in nonlinear ordinary differential equations (ODEs) from noisy and partially observed data. We focus on a class of state-space models defined from the integration of the differential equation in the evolution equation. Within a Bayesian framework, we derive a non-sequential estimation procedure that infers the parameters and the initial condition of the ODE, taking into account that both are required to fully characterize the solution of the ODE. This point of view, new in the context of state-space models, modifies the learning problem. To evaluate the relevance of this approach, we use an Adaptive Importance Sampling in a population Monte Carlo scheme to approximate the posterior probability distribution. We compare this approach to recursive estimation via Unscented Kalman Filtering on two reverse-modeling problems in systems biology. On both problems, our method improves on classical smoothing methods used in state space models for the estimation of unobserved trajectories.

Nicolas J. -B. Brunel, Florence d’Alché-Buc


Weitere Informationen

Premium Partner