Skip to main content

2014 | Buch

Bioinformatics Research and Applications

10th International Symposium, ISBRA 2014, Zhangjiajie, China, June 28-30, 2014. Proceedings

herausgegeben von: Mitra Basu, Yi Pan, Jianxin Wang

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 10th International Symposium on Bioinformatics Research and Applications, ISBRA 2014, held in Zhangjiajie, China, in June 2014. The 33 revised full papers and 31 one-page abstracts included in this volume were carefully reviewed and selected from 119 submissions. The papers cover a wide range of topics in bioinformatics and computational biology and their applications including the development of experimental or commercial systems.

Inhaltsverzeichnis

Frontmatter

Full Papers

Predicting Disease Risks Using Feature Selection Based on Random Forest and Support Vector Machine

Disease risk prediction is an important task in biomedicine and bioinformatics. To resolve the problem of high-dimensional features space and highly feature redundancy and to improve the intelligibility of data mining results, a new wrapper method of feature selection based on random forest variables importance measures and support vector machine was proposed. The proposed method combined sequence backward searching approach and sequence forward searching approach. Feature selection starts with the entire set of features in the dataset. At every iteration, two feature subsets are gained. One feature subset removes those most unimportant features and the most important feature at the same time, which is used to train random forest and to compute feature importance for next feature selection. Another feature subset removes only those most unimportant features while remains the most important feature, which is used as the optimal feature subset to train SVM classifier. Finally, the feature subset with the highest SVM classification accuracy was regarded as optimal feature subset. The experimental results on 11 UCI datasets, a real clinical data sets and a gene expression dataset show that the proposed algorithm can generate the smaller feature subset while improve the classification accuracy.

Jing Yang, Dengju Yao, Xiaojuan Zhan, Xiaorong Zhan
Phylogenetic Bias in the Likelihood Method Caused by Missing Data Coupled with Among-Site Rate Variation: An Analytical Approach

More and more researchers in phylogenetics are concatenating gene sequences to produce supermatrices in the hope that larger data sets will lead to better phylogenetic resolution. Almost all of these supermatrices contain a high proportion of missing data which could potentially cause phylogenetic bias. Previous studies aiming to identify the missing-data-mediated bias in the maximum likelihood method have noted a bias associated with among-site rate variation. However, this finding is by sequence simulation and has been challenged by other simulation studies, with the controversy still unresolved. Here I illustrate analytically this bias caused by missing data coupled with among-site rate variation. This approach allows one to see how much the bias can contribute to likelihood differences among different topologies. The study highlights the point that, while supermatrices may lead to “robust” trees, such “robust” trees may be purchased with illegal phylogenetic currency.

Xuhua Xia
An Eigendecomposition Method for Protein Structure Alignment

The alignment of two protein structures is a fundamental problem in structural bioinformatics. Their structural similarity carries with it the connotation of similar functional behavior that could be exploited in various applications. In this paper, we model a protein as a polygonal chain of

α

carbon residues in three dimension and investigate the application of an eigendecomposition method due to Umeyama to the protein structure alignment problem. This method allows us to reduce the structural alignment problem to an approximate weighted graph matching problem.

The paper introduces two new algorithms,

EDAlign

res

and

EDAlign

sse

, for pairwise protein structure alignment.

E

DAlign

res

identifies the best structural alignment of two equal length proteins by refining the correspondence obtained from eigendecomposition and to maximize similarity measure, TM-score, for the refined correspondence.

EDAlign

sse

, on the other hand, does not require the input proteins to be of equal length. It works in three stages: (1) identifies a correspondence between secondary structure elements (i.e SSE-pairs); (2) identifies a correspondence between residues within SSE-pairs; (3) applies a rigid transformation to report structural alignment in space. The latter two steps are repeated until there is no further improvement in the alignment. We report the TM-score and cRMSD as measures of structural similarity. These new methods are able to report sequence and topology independent alignments, with similarity scores that are comparable to those of the state-of-the-art algorithms such as, TM align and SuperPose.

Satish Chandra Panigrahi, Asish Mukhopadhyay
Functional Interplay between Hemagglutinin and Neuraminidase of Pandemic 2009 H1N1 from the Perspective of Virus Evolution

Influenza type A viruses are classified into subtypes based on their two surface proteins, hemagglutinin (HA) and neuraminidase (NA). Our time series analysis on the strains of pandemic 2009 H1N1 collected from 2009 to 2013 demonstrated that the HA receptor binding preference of this virus in USA, Europe, and Asia has been the characteristic of swine H1N1 virus since 2009. However, its binding characteristics of seasonal human H1N1 and avian H1N1 both have been on steady rise with American strains having the sharpest surge in 2013. The first increase could enhance the viral transmission and replication in humans and the second boost its ability to cause infection deep in lungs, which might explain the recent human deaths caused by this virus in Texas in December 2013. We further explored the corresponding NA activity of this virus to reveal the functional interdependence between HA and NA during the evolution and adaptation of this virus from 2009 to 2013. To understand the real causality, the amino acid substitutions in HA and NA that actually produced the mutations were also identified.

Wei Hu
Predicting Protein Submitochondrial Locations Using a K-Nearest Neighbors Method Based on the Bit-Score Weighted Euclidean Distance

Mitochondria are essential subcellular organelles found in eukaryotic cells. Knowing information on a protein’s subcellular or sub-subcellular location provides in-depth insights about the microenvironment where it interacts with other molecules and is crucial for inferring the protein’s function. Therefore, it is important to predict the submitochondrial localization of mitochondrial proteins. In this study, we introduced MitoBSKnn, a K-nearest neighbor method based on a bit-score weighted Euclidean distance, which is calculated from an extended version of pseudo-amino acid composition. We then improved the method by applying a heuristic feature selection process. Using the selected features, the final method achieved a 93% overall accuracy on the benchmarking dataset, which is higher than or comparable to other state-of-art methods. On a larger recently curated dataset, the method also achieved a consistent performance of 90% overall accuracy. MitoBSKnn is available at http://edisk.fandm.edu/jing.hu/mitobsknn/mitobsknn.html.

Jing Hu, Xianghe Yan
Algorithms Implemented for Cancer Gene Searching and Classifications

Understanding the gene expression is an important factor to cancer diagnosis. One target of this understanding is implementing cancer gene search and classification methods. However, cancer gene search and classification is a challenge in that there is no an obvious exact algorithm that can be implemented individually for various cancer cells. In this paper a research is conducted through the most common top ranked algorithms implemented for cancer gene search and classification, and how they are implemented to reach a better performance. The paper will distinguish algorithms implemented for Bio image analysis for cancer cells and algorithms implemented based on DNA array data. The main purpose of this paper is to explore a road map towards presenting the most current algorithms implemented for cancer gene search and classification.

Murad M. Al-Rajab, Joan Lu
Dysregulated microRNA Profile in HeLa Cell Lines Induced by Lupeol

Lupeol attracted lots of research attention because of its anticancer activity. This work presents the complete microRNA profile between lupeol treated HeLa cell lines and control group, and investigates the complete small RNA sequencing data analysis process, including microRNA annotation, novel microRNA prediction, dysregulated microRNA identification and microRNA target prediction. Based on single replicate data, we applied generalized fold change (GFOLD) algorithm to detect significant regulated microRNAs. Furthermore, we adopted GOmir to predict targets of some microRNAs which have received fully attention and perform ontology analysis. The experimental results indicate that the predicted microRNAs are highly correlated with carcinogenesis.

Xiyuan Lu, Cuihong Dai, Aiju Hou, Jie Cui, Dayou Cheng, Dechang Xu
A Simulation for Proportional Biological Operational Mu-Circuit

To quantitatively control the expression of target gene is challenging but highly desired in practice. We design a device-Biological Proportional Operational Mu-circuit (P-BOM) incorporating AND/OR gate and operational amplifier into one circuit and explore its behaviors through simulation. The results imply that we can control input-output proportionly by manipulating the RBS of

hrp

R,

hrp

S,

tet

R and output gene.

Dechang Xu, Zhipeng Cai, Ke Liu, Xiangmiao Zeng, Yujing Ouyang, Cuihong Dai, Aiju Hou, Dayou Cheng, Jianzhong Li
Computational Prediction of Human Saliva-Secreted Proteins

Using proteins in saliva as biomarkers has great advantage in early diagnosis and prognosis evaluation of health conditions or diseases. In this article, we present a computational method for predicting secreted proteins in human saliva. Firstly, we collected currently known saliva-secreted proteins and the representatives that deem to be not extracellular secretion into saliva. Secondly, we pruned the negative data concerned the imbalance condition, and then extracted the relevant features from the physicochemical and sequence properties of all remained proteins. After that, a support vector machine classifier was built which got performance of average sensitivity, specificity, precision, accuracy and Matthews correlation coefficient value to 80.67%, 90.56%, 90.09%, 85.53% and 0.7168, respectively. These results indicated that the selected features and the model are effective. Finally, a screening test was implemented to all human proteins in UniProt and acquired 5811 proteins as predicted saliva-secreted proteins which may be used as biomarker candidates for further salivary diagnosis.

Ying Sun, Chunguang Zhou, Jiaxin Wang, Zhongbo Cao, Wei Du, Yan Wang
A Parallel Scheme for Three-Dimensional Reconstruction in Large-Field Electron Tomography

Large-field high-resolution electron tomography enables visualizing detailed mechanisms under global structure. As field enlarges, the processing time increases and the distortions in reconstruction become more critical. Adopting a nonlinear projection model instead of a linear one can compensate for curvilinear trajectories, nonlinear electron optics and sample warping. But the processing time for the reconstruction with nonlinear projection model is rather considerable. In this work, we propose a new parallel strategy for block iterative reconstruction algorithms. We also adopt a page-based data transfer in this strategy so as to dramatically reduce the processing time for data transfer and communication. We have tested this parallel strategy and it can yield speedups of approximate 40 times according to our experimental results.

Jingrong Zhang, Xiaohua Wan, Fa Zhang, Fei Ren, Xuan Wang, Zhiyong Liu
An Improved Correlation Method Based on Rotation Invariant Feature for Automatic Particle Selection

Particle selection from cryo-electron microscopy (cryo-EM) images is very important for high-resolution reconstruction of macromolecular structure. However, the accuracy of existing selection methods are normally restricted to noise and low contrast of cryo-EM images. In this paper, we presented an improved correlation method based on rotation invariant features for automatic, fast particle selection. We first selected a preliminary particle set applying rotation invariant features, then filtered the preliminary particle set using correlation to reduce the interference of high noise background and improve the precision of correlation method. We used Divide and Conquer technique and cascade strategy to improve the recognition ability of features and reduce processing time. Experimental results on the benchmark of cryo-EM images show that our method can improve the accuracy of particle selection significantly.

Yu Chen, Fei Ren, Xiaohua Wan, Xuan Wang, Fa Zhang
An Effective Algorithm for Peptide de novo Sequencing from Mixture MS/MS Spectra

In the past decade, extensive research has been conducted for the computational analysis of mass spectrometry based proteomics data. Yet, there are still remaining challenges, among which, one particular challenge is that the identification rate of the MS/MS spectra collected is rather low. One significant reason that contributes to this situation is the concurrent fragmentation of multiple precursors in a single MS/MS spectrum. Nearly all the mainstream computational methods take the assumption that the acquired spectra come from a single precursor, thus they are not suitable for the identification of mixture spectra. In this research, we formulated the mixture spectra

de novo

sequencing problem mathematically, and proposed a dynamic programming algorithm for the problem. Experiment shows that our proposed algorithm can serve as a complimentary method for the identification of mixture spectra.

Yi Liu, Bin Ma, Kaizhong Zhang, Gilles Lajoie
Identifying Spurious Interactions in the Protein-Protein Interaction Networks Using Local Similarity Preserving Embedding

Over the last decade, the development of high-throughput techniques has resulted in a rapid accumulation of protein-protein interaction (PPI) data. However, the high-throughput experimental interaction data is prone to exhibit high level of noise. In this paper, we propose a new approach called Local Similarity Preserving Embedding(LSPE) for assessing the reliability of interactions. Unlike previous approaches which seek to preserve a global predefined distance matrix in the embedding space, LSPE tries to adaptively and locally learn a Euclidean embedding under the simple geometric assumption of PPI networks. The experimental results show that our approach substantially outperforms previous methods on PPI assessment problems. LSPE could thus facilitate further graph-based studies of PPIs and may help infer their hidden underlying biological knowledge.

Lin Zhu, Zhu-Hong You, De-Shuang Huang
Multiple RNA Interaction with Sub-optimal Solutions

The interaction of two RNA molecules involves a complex interplay between folding and binding that warranted recent developments in RNA-RNA interaction algorithms. However, biological mechanisms in which more than two RNAs take part in an interaction exist. It is reasonable to believe that interactions involving multiple RNAs are generally more complex to be treated pairwise. In addition, given a pool of RNAs, it is not trivial to predict which RNAs are interacting without sufficient biological knowledge. Therefore, structures resulting from multiple RNA interactions often cannot be predicted by the existing algorithms.

We recently proposed a system for multiple RNA interaction that overcomes the difficulties mentioned above by formulating a combinatorial optimization problem called

Pegs and Rubber Bands

. A solution to this problem encodes a structure of interacting RNAs. In general, however, the optimal solution obtained does not necessarily correspond to the actual structure observed experimentally. Moreover, a structure produced by interacting RNAs may not be unique. In this work, we extend our previous approach to generate multiple sub-optimal solutions. By clustering these solutions, we are able to reveal representatives that correspond to realistic structures. Specifically, our results on the U2-U6 complex in the spliceosome of yeast and the CopA-CopT complex in E. Coli are consistent with published biological structures.

Syed Ali Ahmed, Saad Mneimneh
Application of Consensus String Matching in the Diagnosis of Allelic Heterogeneity
(Extended Abstract)

In this paper, an algorithm is proposed that detects the existence of a common ancestor gene sequence for non-overlapping inversion (reversed complement) metric given two input DNA sequences. Theoretical average and worst case time complexity of the algorithm is proven to be

O

(

n

3

) and

O

(

n

4

) respectively, where

n

is length of input sequences. However, practically those are found to be

O

(

n

2

) and

O

(

n

3

) respectively, where the worst case occurs when both input sequences have the similarity of around 90%. Similarly, theoretical worst case space complexity is

O

(

n

3

), whereas it is

O

(

n

2

) practically. The work is motivated by the purpose of diagnosing unknown genetic disease that shows

allelic heterogeneity

, a case where a normal gene mutates in different orders resulting in two different gene sequences causing two different genetic diseases. The algorithm can be useful as well in the study of breed-related hereditary conditions to determine the genetic spread of a defective gene in the population.

Fatema Tuz Zohora, M. Sohel Rahman
Continuous Time Bayesian Networks for Gene Network Reconstruction: A Comparative Study on Time Course Data

Dynamic aspects of regulatory networks are typically investigated by measuring relevant variables at multiple points in time. Current state-of-the-art approaches for gene network reconstruction directly build on such data, making the strong assumption that the system evolves in a synchronous fashion and in discrete time. However, omics data generated with increasing time-course granularity allow to model gene networks as systems whose state evolves in continuous time, thus improving the model’s expressiveness. In this work continuous time Bayesian networks are proposed as a new approach for regulatory network reconstruction from time-course expression data. Their performance is compared to that of two state-of-the-art methods: dynamic Bayesian networks and Granger causality. The comparison is accomplished using both simulated and experimental data. Continuous time Bayesian networks achieve the highest F-measure on both datasets. Furthermore, precision, recall and F-measure degrade in a smoother way than those of dynamic Bayesian networks and Granger causality, when the complexity of the gene regulatory network increases.

Enzo Acerbi, Fabio Stella
Drug Target Identification Based on Structural Output Controllability of Complex Networks

Identifying drug target is one of the most important tasks in systems biology. In this paper, we develop a method to identify drug targets in biomolecular networks based on the structural output controllability of complex networks. The drug target identification has been formulated as a problem of finding steering nodes in networks. By applying control signals to these nodes, the biomolecular networks can be transited from one state to another. According to the control theory, a graph-theoretic algorithm has been proposed to find a minimum set of steering nodes in biomolecular networks which can be a potential set of drug targets. An illustrative example shows how the proposed method works. Application results of the method to real metabolic networks are supported by existing research results.

Lin Wu, Yichao Shen, Min Li, Fang-Xiang Wu
NovoGMET: De Novo Peptide Sequencing Using Graphs with Multiple Edge Types (GMET) for ETD/ECD Spectra

De novo

peptide sequencing using tandem mass spectrometry (MS/MS) data has become a major computational method for sequence identification in recent years. With the development of new instruments and technology, novel computational methods have emerged with enhanced performance. However, there are only a few methods focusing on ECD/ETD spectra, which mainly contain variants of

c

-ions and

z

-ions. A

de novo

sequencing method for ECD/ETD spectra, NovoGMET, is presented here and compared with another successful

de novo

sequencing method, pNovo+, which has an option for ECD/ETD spectra. The proposed method applies a new spectrum graph with multiple edge types (GMET), considers multiple peptide tags, and integrates amino acid combination (AAC) and fragment ion charge information. Experiments conducted on three different datasets show that the average full length peptide identification accuracy of NovoGMET is as high as 88.70%, and that NovoGMET’s average accuracy is more than 20% greater on all datasets as compared to pNovo+.

Yan Yan, Anthony J. Kusalik, Fang-Xiang Wu
Duplication Cost Diameters

The gene duplication problem seeks a species tree that reconciles given gene trees with the minimum number of gene duplication events, called gene duplication cost. To better assess species trees inferred by the gene duplication problem we study diameters of the gene duplication cost, which describe fundamental mathematical properties of this cost. The gene duplication cost is defined for a gene tree, a species tree, and a leaf labeling function that maps the leaf-genes of the gene tree to the leaf-species. The diameters of this cost are its maximal values when one topology or both topologies of the trees involved are fixed under all possible leaf labelings, and are fundamental in understanding how gene trees and species trees relate. We describe the properties and formulas for these diameters for bijective and general leaf labelings, and present efficient algorithms to compute the diameters and their corresponding leaf labelings. Moreover, we provide experimental evaluations demonstrating applications of diameters for the gene duplication problem.

Paweł Górecki, Jarosław Paszek, Oliver Eulenstein
Computational Identification of De-Centric Genetic Regulatory Relationships from Functional Genomic Data

We developed a new computational technique to identify de-centric genetic regulatory relationship candidates. Our technique takes advantages of functional genomics data for the same species under different perturbation conditions, therefore making it complementary to current computational techniques including database search, clustering of gene expression profiles, motif matching, structural modeling, and network effect simulation methods. It is fast and addressed the need of biologists to determine activation/inhibition relationship details often missing in synthetic lethality or chip-seq experiments. We used GEO microarray data set GSE25644 with 158 different mutant genes in

S. cerevisiae

. We screened out 83 targets with 610 activation pairs and 93 targets with 494 inhibition pairs. In the Yeast Fitness database, 33 targets (40%) with 126 activation pairs and 31 targets (33%) with 97 inhibition pairs were identified. To be identified further are 50 targets with 484 activation pairs and 62 targets with 397 inhibition pairs. The aggregation test confirmed that all discovered de-centric regulatory relationships are significant from random discovery at a p-value=0.002; therefore, this method is highly complementary to others that tend to discover hub-related regulatory relationships. We also developed criteria for rejecting genetic regulator candidates x as a candidate regulator and assessing the ranking of the regulator-target relationship identified. The top 10 high suspected regulators determined by our criteria were found to be significant, pending future experimental verifications.

Zongliang Yue, Ping Wan, Zhan Xie, Jake Y. Chen
Classification of Mutations by Functional Impact Type: Gain of Function, Loss of Function, and Switch of Function

Genomic variations have been intensively studied since the development of high-throughput sequencing technologies. There are numerous tools and databases predicting and annotating the functional impact of genetic variants, such as determining whether a variant is neutral or deleterious to the functions of the corresponding protein. However, there is a need for methods that not only identify neutral or deleterious mutations but also provide fine grained prediction on the outcome resulting from mutations, such as gain, loss, or switch of function. This paper proposes the deployment of multiple hidden Markov models to computationally classify mutations by functional impact type.

Mingming Liu, Layne T. Watson, Liqing Zhang
Network Analysis of Human Disease Comorbidity Patterns Based on Large-Scale Data Mining

Disease comorbidity is an important aspect of phenotype associations and reflects overlapping pathogenesis between diseases. Existing comorbidity studies usually focused on specific diseases and patient populations. In this study, we systematically mined and analyzed disease comorbidity patterns without restricting disease types and patient populations. We presented a data mining approach and extracted comorbidity patterns from a patient-disease database in the drug adverse event reporting system. The database contains records of 3,354,043 patients. We first demonstrated that the data are not severely biased towards specific patient populations and valuable for comorbidity mining. Then we developed an automatic pipeline to process the data, and applied an association rule mining algorithm to mine comorbidity relationships among multiple diseases. Our approach extracted 8,576 comorbidity patterns for 613 diseases. We constructed a disease comorbidity network from these patterns and demonstrated that the comorbidity clusters reflect genetic associations between diseases. Different from previous studies based on relative risk, which tends to identify comorbidities for rare diseases, our approach extracted many patterns for common diseases. We applied the approach on colorectal cancer, and found interesting relationships between colorectal cancer and metabolic disorders, which may lead to promising pathogenesis discoveries.

Yang Chen, Rong Xu
Identification of Essential Proteins by Using Complexes and Interaction Network

Essential proteins are indispensable in maintaining the cellular life. Identification of essential proteins can provide basis for drug target design, disease treatment as well as synthetic biology minimal genome. However, it is still time-consuming and expensive to identify essential protein based on experimental approaches. With the development of high-throughput experimental techniques in the post-genome era, a large number of PPI data and gene expression data can be obtained, which provide an unprecedented opportunity to study essential proteins at the network level. So far, many network topological methods have been proposed to identify the essential proteins. In this paper, we propose a new method, United complex Centrality(UC), to identify essential proteins by integrating protein complexes information and topological features of PPI network. By analysis of the relationship between protein complexes and essential proteins, we find that proteins appeared in multiple complexes are more inclined to be essential compared to these only appeared in a single complex. The experiment results show that protein complex information can help identify the essential proteins more accurate. Our method UC is obviously better than traditional centrality methods(DC, IC, EC, SC, BC, CC, NC) for identifying essential proteins. In addition, even compared with Harmonic Centricity which also used protein complexes information, it still has a great advantage.

Min Li, Yu Lu, Zhibei Niu, Fang-Xiang Wu, Yi Pan
GenoScan: Genomic Scanner for Putative miRNA Precursors

The significance of miRNAs has been clarified over the last decade as thousands of these small non-coding RNAs have been found in a wide variety of species. By binding to specific target mRNAs, miRNAs act as negative regulators of gene expression in many different biological processes. Computational approaches for discovery of miRNAs in genomes usually take the form of an algorithm that scans sequences for miRNA-characteristic hairpins, followed by classification of those hairpins as miRNAs or non-miRNAs. In this study, two new approaches to genome-scale miRNA discovery are presented and evaluated. These methods, one ensemble-based and one using logistic regression, have been designed to detect miRNA candidates without relying on conservation or transcriptome data, and to achieve high-confidence predictions in reasonable computational time. GenoScan achieves high accuracy with a good balance between sensitivity and specificity. In a benchmark evaluation including 15 previously published methods, the regression-based approach in GenoScan achieved the highest classification accuracy.

Benjamin Ulfenborg, Karin Klinga-Levan, Björn Olsson
Searching SNP Combinations Related to Evolutionary Information of Human Populations on HapMap Data

The International HapMap Project is a partnership of scientists and funding agencies from different countries to develop a public resource that will help researchers find genes associated with human disease and response to pharmaceuticals. The project has collected large amounts of SNP(single-nucleotide polymorphism) data of individuals of different human populations. Many researchers have revealed evolution information from the SNP data. But how to find all the SNPs related to human evolution is still a hard work. At most time, these SNPs work together which leads to the differences between different human populations. The number of SNP combinations is very large, thus it is impossible to check all the combinations. In this paper, a novel algorithm is proposed to find the SNP combinatorial patterns whose frequencies are quite different in two different populations. The numbers of the multi-SNP combinations are regarded as the differences between each paired human populations, then a hierarchical clustering algorithm is used to construct the evolution trees for human populations. The trees from 4 chromosomes are consistent and the result can be validated by other literatures, which indicates that evolutionary information is well mined. The multi-SNP combinations found by our method can be studied further in many aspects.

Xiaojun Ding, Haihua Gu, Zhen Zhang, Min Li, Fangxiang Wu
2D Pharmacophore Query Generation

Using pharmacophores in virtual screening of large chemical compound libraries proved to be a valuable concept in computer-aided drug design. Traditionally, pharmacophore-based screening is performed in 3D space where crystallized or predicted structures of ligands are superposed and where pharmacophore features are identified and compiled into a 3D pharmacophore model. However, in many cases the structures of the ligands are not known which results in using a 2D pharmacophore model.

We introduce a method capable of automatic generation of 2D pharmacophore models given previous knowledge about the biological target of interest. The knowledge comprises of a set of known active and inactive molecules with respect to the target. From the set of active and inactive molecules 2D pharmacophore features are extracted using pharmacophore fingerprints. Then a statistical procedure is applied to identify features separating the active from the inactive molecules and these features are then used to build a pharmacophore model. Finally, a similarity measure utilizing the model is applied for virtual screening.

The method was tested on multiple state of the art datasets and compared to several virtual screening methods. Our approach seems to exceed the existing methods in most cases. We believe that the presented methodology forms a valuable addition to the set of tools available for the early stage drug discovery process.

David Hoksza, Petr Škoda
Structure-Based Analysis of Protein Binding Pockets Using Von Neumann Entropy

Protein binding sites are regions where interactions between a protein and ligand take place. Identification of binding sites is a functional issue especially in structure-based drug design. This paper aims to present a novel feature of protein binding pockets based on the complexity of corresponding weighted Delaunay triangulation. The results demonstrate that candidate binding pockets obtain less relative Von Neumann entropy which means more random scattering of voids inside them.

Negin Forouzesh, Mohammad Reza Kazemi, Ali Mohades
A New Mathematical Model for Inbreeding Depression in Large Populations

It has been widely recognized that inbreeding mating results in increased homozygosity which generally leads to a decreased fitness of population. This conclusion was supported by a large number of experimental observations in natural populations. However, a theoretical analysis of this phenomenon is still lacking. Here we present a theoretic proof showing that for most natural populations, inbreeding mating does reduce the mean fitness of populations. It also suggests that inbreeding depression depends on not only the mating system but also the structure of population. As a consequence, we conclude that, for a natural inbreeding population without any inbreeding depression, most genotypes should be additive or co-dominant. This result gives an explanation to the question why hermaphroditism populations do not show severe inbreeding depression. Another major result of this research is that, for a large inbreeding population with directional relative genotype fitnesses, the mean fitness increases monotonically for any value of inbreeding coefficient. This result may provide a reason to explain the frequent occurrence of self-fertilization populations. We also characterize pseudo-overdominance for single locus, which suggests that there are many pseudo-overdominance populations among the class of over-dominance populations.

Shuhao Sun, Fima Klebaner, Tianhai Tian
dSpliceType: A Multivariate Model for Detecting Various Types of Differential Splicing Events Using RNA-Seq

Alternative splicing plays a key role in regulating gene expression. Dysregulated alternative splicing events have been linked to a number of human diseases. Recently, the high-throughput RNA-Seq technology provides unprecedented opportunities and holds a strong promise for better characterizing and dissecting alternative splicing events on a whole transcriptome scale. Therefore, efficient and effective computational methods and tools for detecting differentially spliced genes and events in human disease are urgently needed. We present a novel and efficient computational method, dSpliceType, to detect five most common types of differential splicing events between two conditions using RNA-Seq. dSpliceType is among the first to utilize sequential dependency of normalized base-wise read coverage signals and capture biological variability among replicates using a multivariate statistical model. dSpliceType substantially reduces sequencing biases by taking ratio of normalized RNA-Seq splicing indexes at each nucleotide between disease and control conditions. Our method employs a change-point analysis followed by a parametric statistical test using Schwarz Information Criterion (SIC) on each candidate splicing event for differential splicing event detection. We evaluated and compared the performance of dSpliceType with the other two existing methods, MATS and Cuffdiff. The result demonstrates that dSpliceType is a fast, effective and accurate approach, which can detect various types of differential splicing events from a wide range of expressed genes, including genes with lower abundances. dSpliceType is freely available at

http://orleans.cs.wayne.edu/dSpliceType/

.

Nan Deng, Dongxiao Zhu
Conformational Transitions and Principal Geodesic Analysis on the Positive Semidefinite Matrix Manifold

Given an initial and final protein conformation, generating the intermediate conformations provides important insight into the protein’s dynamics. We represent a protein conformation by its Gram matrix, which is a

point

on the rank 3 positive semidefinite matrix manifold, and show matrices along the geodesic linking an initial and final Gram matrix can be used to generate a feasible pathway for the protein’s structural change. This geodesic is based on a particular quotient geometry. If a protein is known to contain domains or groups of atoms that act as rigid clusters, facial reduction can be used to decrease the size of the Gram matrices before calculating the geodesic. The geodesic between two conformations is only one path a protein’s Gram matrix can follow; principal geodesic analysis (PGA) is one possible strategy to find other geodesics.

Xiao-Bo Li, Forbes J. Burkowski
Joint Analysis of Functional and Phylogenetic Composition for Human Microbiome Data

With the advance of high-throughput sequencing technology, it is possible to investigate many complex biological and ecological systems. The objective of Human Microbiome Project (HMP) is to explore the microbial diversity in our human body and to provide experimental and computational standards for subsequent similar studies. The first-stage HMP generated a lot of data for computational analysis and provided a challenge for integration and interpretation of various microbiome data. In this paper, we introduce a data integration method –

L

aplacian-regularized

J

oint

N

on-negative

M

atrix

F

actorization (LJ-NMF) for analyzing functional and phylogenetic profiles from HMP jointly. The experimental results indicate that the proposed method offers an efficient framework for microbiome data analysis.

Xingpeng Jiang, Xiaohua Hu, Weiwei Xu
schematikon: Detailed Sequence-Structure Relationships from Mining a Non-redundant Protein Structure Database
(Extended Abstract)

If a "protein folding code" exists, it ought to give rise to detectable sequence propensities that are associated with low energy conformations,

i.e.

native structure. To the degree that the frequency of structure patterns in folded proteins has a Boltzmann-like behaviour, such conformations should be detectable by their excess occurrence over random. We have mined a database of non-homologous, well resolved protein structure domains – Nh3D – and have discovered an abundance of such sets of overrepresented structurally similar patterns. We designate the best representatives of a set a

motif

. Our motif dictionary

schematikon

shows significant and interesting sequence propensities and is predictive regarding the experimentally determined consequences of sequence change on stability.

Boris Steipe, Bhooma Thiruv
Backmatter
Metadaten
Titel
Bioinformatics Research and Applications
herausgegeben von
Mitra Basu
Yi Pan
Jianxin Wang
Copyright-Jahr
2014
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-08171-7
Print ISBN
978-3-319-08170-0
DOI
https://doi.org/10.1007/978-3-319-08171-7