Skip to main content

2018 | Buch

Bioinformatics Research and Applications

14th International Symposium, ISBRA 2018, Beijing, China, June 8-11, 2018, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 14th International Conference on Bioinformatics Research and Applications, ISBRA 2018, held in Beijing, China, in June 2018.

The 24 full and 10 short papers presented in this volume were carefully reviewed and selected from a total of 138 submissions. They were organized in topical sections named: network analysis and modelling; genomic data analysis; cancer data analysis; structure and interaction; HPC and CryoEM; machine and deep learning; data analysis and methodology; analysis and visualization tools; and RNA-Seq data analysis.

Inhaltsverzeichnis

Frontmatter
Correction to: A Pattern Recognition Tool for Medium-Resolution Cryo-EM Density Maps and Low-Resolution Cryo-ET Density Maps

The original version of this chapter contained an error in the fourth author’s name. The spelling of Julio Kovacs’s name was incorrect in the header of the paper. The author name was corrected.

Devin Haslam, Salim Sazzed, Willy Wriggers, Julio Kovacs, Junha Song, Manfred Auer, Jing He

Network Analysis and Modelling

Frontmatter
Prediction of Drug Response with a Topology Based Dual-Layer Network Model

Identifying the response of a cancer patient to a particular therapeutic agent is critical in drug discovery and will significantly facilitate the development of personalized medicine. The publicly available drug response profiles across cell lines provide an alternative way for predicting the response of cancer drugs. In this work, we propose a topology based dual-layer network (TDLN) model to predict drug response based on large-scale cell line experiments. With the Cancer Cell Line Encyclopedia (CCLE), Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Therapeutic Response Portal (CTRP) datasets as benchmark datasets, our proposed topology based dual-layer network model outperforms other existing popular approaches and identify some novel indications of known drugs for cancer.

Suyun Huang, Xing-Ming Zhao
GRTR: Drug-Disease Association Prediction Based on Graph Regularized Transductive Regression on Heterogeneous Network

Computational drug repositioning helps to decipher the complex relations among drugs, targets, and diseases at a system level. However, most existing computational methods are biased towards known drugs-disease associations already verified by biological experiments. It is difficult to achieve excellent performance with sparse known drug-disease associations. In this article, we present a graph regularized transductive regression method (GRTR) to predict novel drug-disease associations. The proposed method first constructs a heterogeneous graph consisting of three interlinked sub-graphs including drugs, diseases and targets from multiple sources and adopts preliminary estimation of drug-related disease to initial unknown drug-disease associations for unlabeled drugs. Since the known drug-disease associations are sparse, graph regularized transductive regression is used to score and rank drug-disease associations iteratively. In the computational experiments, the proposed method achieves better performance than others in terms of AUC and AUPR. Moreover, the varying of parameters is shown to verify the importance of preliminary estimation in GRTR. Case studies on several selected drugs further confirm the practicality of our method in discovering potential indications for drugs.

Qiao Zhu, Jiawei Luo, Pingjian Ding, Qiu Xiao
An Improved Particle Swarm Optimization with Dynamic Scale-Free Network for Detecting Multi-omics Features

Along with the rapid development of high-throughput sequencing technology, a large amount of multi-omics data sets are generated, which provide more opportunities to understand the mechanism of complex diseases. In this study, an improved particle swarm optimization with dynamic scale-free network, named DSFPSO, is proposed for detecting multi-omics features. The highlights of DSFPSO are the introduced scale-free network and velocity updating strategies. The scale-free network is employed to DSFPSO as its population structure, which can dynamically adjust the iteration processes. Three types of velocity updating strategies are used in DSFPSO for fully considering the heterogeneity of particles and their neighbors. Both gene function analysis and pathway analysis on colorectal cancer (CRC) data show that DSFPSO can detect CRC-associated features effectively.

Huiyu Li, Sheng-Jun Li, Junliang Shang, Jin-Xing Liu, Chun-Hou Zheng
PBMarsNet: A Multivariate Adaptive Regression Splines Based Method to Reconstruct Gene Regulatory Networks

Gene Regulatory Network (GRN) is a directed graph which describes the regulations between genes. The problem of reconstructing GRNs has been studied for decades. Most of existing methods infer the GRNs from gene expression data. Previous studies use random forest, partial least squares or other feature selection techniques to solve it. In this paper, we propose a Multivariate Adaptive Regression Splines (MARS) based method to estimate the feature importance and reconstruct the GRNs. MARS can catch the nonlinear relationships between genes. To avoid the overfitting and make the estimation robust, we apply an ensemble model of MARS based on bootstrap and weighted features by PMI (Part mutual information), called PBMarsNet. The results show that PBMarsNet performs better than the state-of-the-art methods.

Siyu Zhao, Ruiqing Zheng, Xiang Chen, Yaohang Li, Fang-Xiang Wu, Min Li

Genomic Data Analysis

Frontmatter
Bounds on Identification of Genome Evolution Pacemakers

Several works have pointed out that the tight correlation between genes’ evolutionary rate is better explained by a model denoted as the Universal Pacemaker (UPM) rather than by a simple rate constancy as manifested by the classical hypothesis of Molecular Clock (MC). Under UPM, the relative evolutionary rates of all genes remain nearly constant whereas the absolute rates can change arbitrarily according to the pacemaker ticks. This evolutionary framework was recently adapted to model epigenetic aging where methylated sites are the analogs of evolving genes.A consequent question to the above finding is the determination of the number of such pacemakers and which gene adheres to which pacemaker. This however turns to be a non trivial task and is affected by the number of variables, their random noise, and the amount of available information. To this end, a clustering heuristic was devised exploiting the correlation between corresponding edge lengths across thousands of gene trees. Nevertheless, no theoretical study linking the relationship between the affecting parameters was done.We here study this question by providing theoretical bounds, expressed by the system parameters, on probabilities for positive and negative results. We corroborate these results by a simulation study that reveals the critical role of the variances.

Sagi Snir
REXTAL: Regional Extension of Assemblies Using Linked-Reads

It is currently impossible to get complete de novo assembly of segmentally duplicated genome regions using genome-wide short-read datasets. Here, we devise a new computational method called Regional Extension of Assemblies Using Linked-Reads (REXTAL) for improved region-specific assembly of segmental duplication-containing DNA, leveraging genomic short-read datasets generated from large DNA molecules partitioned and barcoded using the Gel Bead in Emulsion (GEM) microfluidic method [1]. We show that using REXTAL, it is possible to extend assembly of single-copy diploid DNA into adjacent, otherwise inaccessible subtelomere segmental duplication regions and other subtelomeric gap regions. Moreover, REXTAL is computationally more efficient for the directed assembly of such regions from multiple genomes (e.g., for the comparison of structural variation) than genome-wide assembly approaches.

Tunazzina Islam, Desh Ranjan, Eleanor Young, Ming Xiao, Mohammad Zubair, Harold Riethman
A Scalable Reference-Free Metagenomic Binning Pipeline

Metagenomics studies microbial genomes in an ecosystem such as the gastrointestinal tract of a human through sequencing thousands of organism in parallel. The sheer number of genomic fragments are challenging for current metagenomic binning software to process. Here we present a scalable reference-free metagenomic binning pipeline designed to handle large scale metagenomic data. It allows users to input several tera base pairs (TB) of reads and produces highly accurate binning results, even at a species level. The pipeline outputs all binned species in multiple metagenomic samples and their estimated relative abundance. We integrate the pipeline into an open-source software, MetaMat, which is freely available at: https://github.com/BioAlgs/MetaMat.

Terry Ma, Xin Xing

Cancer Data Analysis

Frontmatter
The Review of the Major Entropy Methods and Applications in Biomedical Signal Research

Since biomedical signals are high dimensional data sets with a lot of noise signal, the results processed by the classical signal processing method are subjected to the impact of the noise and interference. Entropy as a measure of disorder or uncertainty in the data has been applied in signal processing research areas. This review is to introduce the application of entropy in the analysis of biomedical signals and discuss the advantages and shortcomings of various entropies. Especially, the utilization and application of entropy concept in cancer research are highlighted.

Guangdi Liu, Yuan Xia, Chuanwei Yang, Le Zhang
Inferring Dysregulated Pathways of Driving Cancer Subtypes Through Multi-omics Integration

The rapid accumulation of multi-omics cancer data has created the opportunity for biological discovery and biomedical applications. In this study, we propose an approach that integrates multi-omics data to identify dysregulated pathways driving cancer subtypes, which simultaneously considers DNA methylation, DNA copy number, somatic mutation and gene expression profiles. After applying it to Breast Invasive Carcinoma (BRCA) in TCGA, we identify distinct top 30 dysregulated pathways for each breast cancer subtypes. The result suggests that dysregulated pathways of different subtypes display common and specific patterns. Furthermore, 44 differentially expressed genes with corresponding genetic and epigenetic dysregulation are retrieved from the subtype-specific pathways. Literature validation and functional enrichment analysis indicate that these genes are function associated with BRCA. Our method provides a new insight for identifying the driver of cancer subtypes through multi-omics data integration.

Kai Shi, Lin Gao, Bingbo Wang
An Extension of Deep Pathway Analysis: A Pathway Route Analysis Framework Incorporating Multi-dimensional Cancer Genomics Data

Recent breakthroughs in cancer research have happened via the up-and-coming field of pathway analysis. By applying statistical methods to previously known gene and protein regulatory information, pathway analysis provides a meaningful way to interpret genomic data. In this paper we propose systematic methodology framework for studying biological pathways; one that cross-analyzes mutation information, transcriptome and proteomics data. Each pathway route is encoded as a bayesian network which is initialized with a sequence of conditional probabilities specifically designed to encode directionality of regulatory relationships defined by the pathways. Proteomics regulations, such as phosphorylation, is modeled by dynamically generated bayesian network through combining certain type of proteomics data to the regulated target. The entire pipeline is automated in R. The effectiveness of our model is demonstrated through its ability to distinguish real pathways from decoy pathways on TCGA mRNA-seq, mutation, copy number variation and phosphorylation data for both breast cancer and ovarian cancer study.

Yue Zhao
Hierarchical Similarity Network Fusion for Discovering Cancer Subtypes

Recent breakthroughs in biologic sequencing technologies have cost-effectively yielded diverse types of observations. Integrative analysis of multiple platform cancer data, which is capable of revealing intrinsic characteristics of a biological process, has become an attractive research route on cancer subtypes discovery. Most machine learning based methods need represent each input data in unified space, losing certain important features or resulting in various noises in some data types. Furthermore, many network based data integration methods treat each type data independently, leading to a lot of inconsistent conclusions. Subsequently, similarity network fusion (SNF) was developed to deal with such questions. However, Euclidean distance metrics employed in SNF suffers curse of dimensionality and thus gives rise to poor results.To this end, we propose a new integrated method, dubbed hierarchical similarity network (HSNF), to learn a fused discriminating patient similarity network. HSNF randomly samples sub-features from different input data to construct multiple input similarity matrixes used as a basic of fusion so that diverse similarity matrixes are generated by multiple random sampling. Then we design a hierarchical fusion framework to make full use of the complementariness of diverse similarity networks from different feature modalities. Finally, based on the final fused similarity matrix, spectral clustering was used to discover cancer subtypes. Experimental results on five public cancer datasets manifest that HSNF can discover significantly different subtypes and can consistently outperform the-state-of-the-art in terms of silhouette, and p-value of survival analysis.

Shuhui Liu, Xuequn Shang

Structure and Interaction

Frontmatter
Sprites2: Detection of Deletions Based on an Accurate Alignment Strategy

Since humans are diploid organisms, homozygous and heterozygous deletions are ubiquitous in the human genome. How to distinguish homozygous and heterozygous deletions is an important issue for current structural variation detection tools. Additionally, due to the problems of sequencing errors, micro-homologies and micro-insertions, breakpoint locations identified with common alignment tools which use greedy strategy may not be the true deletion locations, and usually lead to false structural variation detections. In this paper, we propose a deletion detection method called Sprites2. Comparing with Sprites, Sprites2 adds the following novel function modules: (1) Sprites2 takes advantage of the variance of insert size distribution to determine the type of deletions which can enhance the accuracy of deletion calls; (2) Sprites2 uses a novel alignment strategy based on AGE (one algorithm aligning 5’ and 3’ ends between two sequences simultaneously) to locate breakpoints which can solve the problems introduced by sequencing errors, micro-homologies and micro-insertions. For testing the performance of Sprites2, simulated and real datasets are used in our experiments, and some popular structural variation detection tools are compared with Sprites2. The experimental results show that Sprites2 can improve deletion detection performance. Sprites2 is publicly available at https://github.com/zhangzhen/sprites2.

Zhen Zhang, Jianxin Wang, Junwei Luo, Juan Shang, Min Li, Fang-Xiang Wu, Yi Pan
KSIBW: Predicting Kinase-Substrate Interactions Based on Bi-random Walk

Protein phosphorylation is an important chemical modification in the organism that regulates many cellular processes. In recent years, many algorithms for predicting kinase-substrate interactions have been proposed. However, most of those methods are mainly focused on utilizing protein sequence information. In this paper, we propose a computational framework, KSIBW, to predict kinase-substrate interactions based on bi-random walk. Unlike traditional methods, the protein-protein interaction (PPI) information are used to measure the similarities of kinase-kinase and substrate-substrate, respectively. Then, the bi-random walk is employed to identify potential kinase-substrate interactions. The experiment results show that our method outperforms other state-of-the-art algorithms in performance.

Canshang Deng, Qingfeng Chen, Zhixian Liu, Ruiqing Zheng, Jin Liu, Jianxin Wang, Wei Lan
XPredRBR: Accurate and Fast Prediction of RNA-Binding Residues in Proteins Using eXtreme Gradient Boosting

A variety of studies have shown that protein-RNA interactions play a vital role in many fundamental cellular processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. Identification of RNA-binding residues (RBR) in proteins is an key step to understand the mutual recognition mechanism underlying the protein-RNA interactions. In this paper, we proposed a novel method, XPredRBR, to predict the RNA binding residues in proteins, by exploiting the eXtreme Gradient Boosting (XGBoost) algorithm. Two types of new predictive features derived from residue interaction network and solvent exposures are combined with conventional sequence features and structural neighborhood features to predict RBR. We carried out empirical experiments on two datasets to demonstrate the performance of the proposed method. By 10-fold cross-validations, our method achieved the accuracy of 0.861, sensitivity of 0.872, MCC of 0.584 and AUC of 0.941 on the RBP170 dataset. On another independent test set RBP101, XPredRBR outperformed three traditional classifiers and seven existing RNA-binding residue methods. A case study on the chain E of 3PLA protein illustrated XPredRBR effectively identified most RNA-binding and non RNA-binding sites. Furthermore, XPredRBR is much faster than our previous method PredRBR. These experimental results show that our proposed method achieves state-of-the-art performance in predicting RNA-binding residues in proteins.

Lei Deng, Zuojin Dong, Hui Liu
A Biologically Meaningful Extension of the Efficient Method for Deleterious Mutations Prediction in RNAs: Insertions and Deletions in Addition to Substitution Mutations

The RNAmute and MultiRNAmute interactive java programs were developed to predict deleterious point mutations in RNA sequences, which intently cause a global conformational rearrangement of the secondary structure of the functional RNA molecules and thereby disrupt their function. RNAmute was designed to deal with only single point mutations in a brute force manner, while the MultiRNAmute tool uses an efficient approach to deal with multiple point mutations. The approach used in MultiRNAmute is based on the stabilization of the suboptimal RNA folding solutions and/or destabilization of the optimal MFE structure of the wild type RNA molecule. Both programs utilize the Vienna RNA package to find the optimal and suboptimal (in case of MultiRNAmute) RNA secondary structures. The main limitation of both programs is their ability to predict only substitution mutations and these programs were not designed to work with deletion or insertion mutations. Herein we develop an efficient approach, based on suboptimal folding solutions, to predict multiple point mutations consisting of deletions, insertions and substitution mutations. All RNAmute algorithms were validated on the TPP-riboswitch and some other functional RNAs.

Alexander Churkin, Danny Barash
Screening of Sonic Hedgehog (Shh) Inhibitors in the Hedgehog Signaling Pathway from Traditional Chinese Medicine (TCM) Database Through Structure-Based Pharmacophore Design

Colorectal cancer (CRC) is the second most death-leading cancer in the world. The development of CRC is closely related to the hedgehog signaling pathway. The abnormal activation of the pathway will initiate the binding of Sonic Hedgehog (Shh) to Ptch1 that can trigger the growth of abnormal cells that cause CRC. Traditional Chinese Medicine (TCM) compounds were used to inhibit Shh through structure-based pharmacophore design. This research was started by finding the pharmacophore feature of Shh, continued with docking simulation of Shh with TCM compounds. The best three TCM compounds, which are TCM-8941, TCM-28794 and TCM-32808, give the best ligand interaction and have the lowest Gibbs free binding energy. They also have good pharmacological properties that have been analyzed by using Toxtree v2.6.13, SwissADME and admetSAR. For further research, these TCM compounds may be used as drug candidates in colorectal cancer.

Ilmi Fadhilah Rizki, Mochammad Arfin Fardiansyah Nasution, Syafrida Siregar, Mega Maulina Ekawati, Usman Sumo Friend Tambunan
Novel Inhibitors of T315I Mutant BCR-ABL1 Tyrosine Kinase for Chronic Myeloid Leukemia Disease Through Fragment-Based Drug Design

The hallmark genetic abnormality of CML is named Philadelphia chromosome. Philadelphia chromosome occurs as a result of recombination of two genes, namely the cellular ABL gene on chromosome 9 and BCR gene located on chromosome 22. The Philadelphia chromosomal translocation is responsible for the ABL and BCR fusion. The ABL and BCR proteins play a central role in the pathogenesis of CML. The malignant transformation by BCR-ABL is critically dependent on its protein tyrosine kinase activity. It makes ABL kinase is an attractive target for therapeutic intervention. In this research, about 653,214 leadlike compounds were obtained from MOE database. The compounds were screened using Data Warrior v.4.6.1 and also docked to predict their binding affinity to BCR-ABL1 tyrosine kinase protein using MOE 2014.09 software. Fragment-based drug design was applied to find a new drug candidate. Finally, five new compounds were generated from this method. The compound LUT-1 has the highest potential due to the low ΔG binding score, acceptable RMSD score, and ADME-Tox result.

Satya Anindita, Atika Marnolia, Hersal Hermana Putra, Muhammad Chandra Haikal, Usman Sumo Friend Tambunan

HPC and CryoEM

Frontmatter
On k-Mismatch Shortest Unique Substring Queries Using GPU

k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ($$k=0$$) and at least 23 for approximate SUS queries over DNA sequences ($$k>0$$), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings.

Daniel W. Schultz, Bojian Xu
Memory-Efficient and Stabilizing Management System and Parallel Methods for RELION Using CUDA and MPI

In cryo-electron microscopy, RELION has been proven to be a powerful tool for high-resolution reconstruction and has quickly gained its popularity. However, as the data processed in cryoEM is large and the algorithm of RELION is computation-intensive, the refinement procedure of RELION appears quite time-consuming and memory-demanding. These two problems have become major bottlenecks for its usage. Even though there have been efforts on paralleling RELION, the global memory size still may not meet its requirement. Also as by now there is no automatic memory management system on GPU (Graphics Processing Unit), the fragmentation will increase with iteration. Eventually, it would crash the program. In our work, we designed a memory-efficient and stabilizing management system to guarantee the robustness of our program and the efficiency of GPU memory usage. To reduce the memory usage, we developed a novel RELION 2.0 data structure. Also, we proposed a weight calculation parallel algorithm to speedup the calculation. Experiments show that the memory system can avoid memory fragmentation and we can achieve better speedup ratio compared with RELION 2.0.

Jingrong Zhang, Zihao Wang, Yu Chen, Zhiyong Liu, Fa Zhang
GPU Accelerated Ray Tracing for the Beta-Barrel Detection from Three-Dimensional Cryo-EM Maps

Cryo-electron microscopy is a technique that is capable of producing high quality three-dimensional density maps of proteins. The identification of secondary structures from within these proteins is important to help understand the protein’s overall structure and function. One of the more commonly found secondary structures is the $$\beta $$ barrel. In previous papers, we presented a novel approach utilizing a genetic algorithm and ray tracing to identify and isolate $$\beta $$ barrels from the density maps. However, one key limitation of that approach was the computational cost of ray tracing portion. In this paper, we applied parallel processing and graphical processing units (GPU) to increase the performance of the ray tracing. We tested this method on both experimental and simulated cryo-EM density maps. The results suggest that we were successful in speeding up our method significantly using parallelization and graphical processing units.

Albert Ng, Adedayo Odesile, Dong Si
A Fast Genome Sequence Aligner Based on Minimal Perfect Hash Algorithm Realized with FPGA Based Heterogeneous Computing Platform

A fast genome sequence aligner is proposed in this paper. The alignment algorithm is based on minimal perfect hash, reducing the memory occupation and improving memory access efficiency. Several strategies and techniques are adopted to improve the speed and accuracy of the aligner. Realized with a field programmable gate array (FPGA) based heterogeneous computing platform, the aligner achieves similar accuracy compared with BWA-MEM while the speed is around 10 times faster than BWA-MEM.

Ke Huang, Shubo Yang, Zhaojian Luo, Ke Yang, Menghan Chen, Guopeng Wei, Jian Huang
A Pattern Recognition Tool for Medium-Resolution Cryo-EM Density Maps and Low-Resolution Cryo-ET Density Maps

Cryo-electron microscopy (Cryo-EM) and cryo-electron tomography (cryo-ET) produce 3-D density maps of biological molecules at a range of resolution levels. Pattern recognition tools are important in distinguishing biological components from volumetric maps with the available resolutions. One of the most distinct characters in density maps at medium (5–10 Å) resolution is the visibility of protein secondary structures. Although computational methods have been developed, the accurate detection of helices and β-strands from cryo-EM density maps is still an active research area. We have developed a tool for protein secondary structure detection and evaluation of medium resolution 3-D cryo-EM density maps that combines three computational methods (SSETracer, StrandTwister, and AxisComparison). The program was integrated in UCSF Chimera, a popular visualization software in the cryo-EM community. In related work, we have developed BundleTrac, a computational method to trace filaments in a bundle from lower resolution cryo-ET density maps. It has been applied to actin filament tracing in stereocilia with good accuracy and can be potentially added as a tool in Chimera.

Devin Haslam, Salim Sazzed, Willy Wriggers, Julio Kovacs, Junha Song, Manfred Auer, Jing He

Machine and Deep Learning

Frontmatter
Combining Sequence and Epigenomic Data to Predict Transcription Factor Binding Sites Using Deep Learning

Knowing the transcription factor binding sites (TFBSs) is essential for modeling the underlying binding mechanisms and follow-up cellular functions. Convolutional neural networks (CNNs) have outperformed methods in predicting TFBSs from the primary DNA sequence. In addition to DNA sequences, histone modifications and chromatin accessibility are also important factors influencing their activity. They have been explored to predict TFBSs recently. However, current methods rarely take into account histone modifications and chromatin accessibility using CNN in an integrative framework. To this end, we developed a general CNN model to integrate these data for predicting TFBSs. We systematically benchmarked a series of architecture variants by changing network structure in terms of width and depth, and explored the effects of sample length at flanking regions. We evaluated the performance of the three types of data and their combinations using 256 ChIP-seq experiments and also compared it with competing machine learning methods. We find that contributions from these three types of data are complementary to each other. Moreover, the integrative CNN framework is superior to traditional machine learning methods with significant improvements.

Fang Jing, Shao-Wu Zhang, Zhen Cao, Shihua Zhang
A Deep Learning Method for Prediction of Benign Epilepsy with Centrotemporal Spikes

Benign epilepsy with centrotemporal spikes (BECT) is the most common epilepsy in the children. The research of BECT mainly focuses on the comparative analysis of the BECT patients and the healthy controls. Different from the existing methods, we proposed a 3D convolution neural network (3DCNN) that directly predicts the disease of BECT from raw magnetic resonance imaging (MRI). The experiment shows our 3DCNN model get an $$89.80\%$$ accuracy in the five-fold cross-validation evaluation which is over a large margin than the benchmark method.

Ming Yan, Ling Liu, Sihan Chen, Yi Pan
LSTM Recurrent Neural Networks for Influenza Trends Prediction

Influenza-like illness (ILI) is an acute respiratory infection causes substantial mortality and morbidity. Predict Influenza trends and response to a health disease rapidly is crucial to diminish the loss of life. In this paper, we employ the long short term memory (LSTM) recurrent neural networks to forecast the influenza trends. We are the first one to use multiple and novel data sources including virologic surveillance, influenza geographic spread, Google trends, climate and air pollution to predict influenza trends. Moreover, We find there are several environmental and climatic factors have the significant correlation with ILI rate.

Liyuan Liu, Meng Han, Yiyun Zhou, Yan Wang
Predicting Gene-Disease Associations with Manifold Learning

In this study, we propose a manifold learning-based method for predicting disease genes by assuming that a disease and its associated genes should be consistent in some lower dimensional manifold. The 10-fold cross-validation experiments show that the area under of the receiver operating characteristic (ROC) curve (AUC) generated by our approach is 0.7452 with high-quality gene-disease associations in OMIM dataset, which is greater that of the competing method PBCF (0.5700). 9 out of top 10 predicted gene-disease associations can be supported by existing literature, which is better than the result (6 out of top 10 predicted association) of the PBCF. All these results illustrate that our method outperforms the competing method.

Ping Luo, Li-Ping Tian, Bolin Chen, Qianghua Xiao, Fang-Xiang Wu

Data Analysis and Methodology

Frontmatter
On Approaching the One-Sided Exemplar Adjacency Number Problem

Given one generic linear genome $$\mathcal{G}$$ with gene duplications (over n gene families), an exemplar genome G is a permutation obtained from $$\mathcal{G}$$ by deleting duplicated genes such that G contains exactly one gene from each gene family (i.e., G is a permutation of length n). If we relax the constraint such that $$G^{+}$$ is obtained in the same way but has length at least k, then we call $$G^{+}$$ a pseudo-exemplar genome. Given $$\mathcal{G}$$ and one exemplar genome H over the same set of n gene families, the One-sided Exemplar Adjacency Number problem (One-sided EAN) is defined as follows: delete duplicated genes from the genome $$\mathcal{G}$$ to obtain an exemplar genome G of length n, such that the number of adjacencies between G and H is maximized. It is known that the problem is NP-hard; in fact, almost as hard to approximate as Independent Set, even when each gene (from the same gene family) appears at most twice in the generic genome $$\mathcal{G}$$. To overcome the constraint on the length of G, we define a slightly more general problem (One-sided EAN+) where we only need to obtain a pseudo-exemplar genome $$G^{+}$$ from $$\mathcal{G}$$ (by deleting duplicated genes) such that the number of adjacencies in H and $$G^{+}$$ is maximized. While One-sided EAN+ contains One-sided EAN as a special case, it does give us some flexibility in designing an algorithm. Firstly, we reformulate and relax the One-sided EAN+ problem as the maximum independent set (MIS) on a colored interval graph and hence reduce the appearance of each gene to at most two times. We show that this new relaxation is still NP-complete, though a simple factor-2 approximation algorithm can be designed; moreover, we also prove that the problem cannot be approximated within $$2-\varepsilon $$ by a local search technique. Secondly, we use integer linear programming (ILP) to solve this relaxed problem exactly. Finally, we compare our results with the up-to-date software GREDU, with various simulation data. It turns out that our algorithm is more stable and can process genomes of length up to 12,000 (while GREDU sometimes can falter on such a large dataset).

Letu Qingge, Killian Smith, Sean Jungst, Binhai Zhu
Prediction of Type III Secreted Effectors Based on Word Embeddings for Protein Sequences

The type III secreted effectors (T3SEs) are virulence proteins that play an important role in the pathogenesis of Gram-negative bacteria. They are injected into the host cells by the pathogens, interfere with the immune system of the host cells, and help the growth and reproduction of the pathogens. It is a very challenging task to identify T3SEs because of the high diversity of their sequences and the lack of defined secretion signals. Moreover, their working mechanisms have not been fully understood yet. In order to speed up the recognition of T3SEs and the studies of type III secretion systems, computational tools for the prediction of T3SEs are in great demand. In this study, we regard the protein sequences as a special language. Inspired by the word2vec model in natural language processing, we convert the sequences into word embedding vectors in a similar manner with a specific segmentation strategy for protein sequences. And then we construct the T3SE predictor based on the new sequence feature representation. We conduct experiments on both mono-species data and multi-species data. The experimental results show that the new feature representation model has a competitive performance and can work together with the traditional features to enhance the identification of T3SEs.

Xiaofeng Fu, Yiqun Xiao, Yang Yang
Extending the Evolvability Model to the Prokaryotic World: Simulations and Results on Real Data

In 2006, Valiant introduced a variation to his celebrated PAC model to Biology, by which he wished to explain how such complex life mechanisms evolved in that short time by two simple mechanisms - random variation and natural selection. Soon after, several works extended and specialized his work to more specific processes. To the best of our knowledge, there is no such extension to the prokaryotic world, in which gene sharing is the prevailing mode of evolution.Here we extend the evolvability framework to accommodate horizontal gene transfer (HGT), the transfer of genetic material between unrelated organisms. While in a separate work we focused on the theoretical aspects of this extension and its learnability power, here the focus is on more practical and biological facets of this new model. Specifically, we focus on the evolutionary process of developing a trait and model it as the conjunction function. We demonstrate the speedup in learning time for a variant of conjunction to which learning algorithms are known. We also confront the new model with the recombination model on real data of E. coli strains under the task of developing pathogenicity and obtain results adhering to current existing knowledge.Apart from the sheer extension to the understudied prokaryotic world, our work offers comparisons of three different models of evolution under the same conditions, which we believe is unique and of a separate interest.The code of the simulations is freely available at: https://github.com/byohay/LearningModels.git.

Sagi Snir, Ben Yohay
Predicting Opioid Epidemic by Using Twitter Data

Opioid crisis was declared as a public health emergency in 2017 by the President of USA. According to the Centers for Disease Control and Prevention, more than 91 Americans die every day from an opioid overdose. Nearly $4B is provided to address the opioid epidemic in the 2018 spending bill and help fulfill the President’s Opioid Initiative.How to monitor and predict the opioid epidemic accurately and in real time? The traditional methods mainly use the hospital data and usually have a lag of several years. Even though they are accurate, the long lag period prevents us from monitoring and predicting the epidemic in real time. We observe that people discuss things related to the epidemic a lot in social media platforms. These user behavior data collected from social media platforms can potentially help us monitor and predict the epidemic in real time.In this paper, we study how to use Twitter to monitor the epidemic. We collect the historic tweets containing the set of keywords related to the epidemic. We count the frequency of the tweets posted at each month and each state. We compare the frequency values with the real-world death rates at each month and each state. We identify high correlation between tweet frequency values and real-world death rates. The statistical significance demonstrates that the Twitter data can be used for predicting the death rate and epidemic in future.

Yubao Wu, Pavel Skums, Alex Zelikovsky, David Campo Rendon, Xueting Liao

Analysis and Visualization Tools

Frontmatter
Cluster Matching Distance for Rooted Phylogenetic Trees

Phylogenetic trees are fundamental to biology and are benefitting several other research areas. Various methods have been developed for inferring such trees, and comparing them is an important problem in computational phylogenetics. Addressing this problem requires tree measures, but all of them suffer from problems that can severely limit their applicability in practice. This also holds true for one of the oldest and most widely used tree measures, the Robinson-Foulds distance. While this measure is satisfying the properties of a metric and is efficiently computable, it has a negatively skewed distribution, a poor range of discrimination and diameter, and may not be robust when comparing erroneous trees. The cluster distance is a measure for comparing rooted trees that can be interpreted as a weighted version of the Robinson-Foulds distance. We show that when compared with the Robinson-Foulds distance, the cluster distance is much more robust towards small errors in the compared trees, and has a significantly improved distribution and range.

Jucheol Moon, Oliver Eulenstein

RNA-Seq Data Analysis

Frontmatter
Truncated Robust Principal Component Analysis and Noise Reduction for Single Cell RNA-seq Data

The development of single cell RNA sequencing (scRNA-seq) has enabled innovative approaches to investigating mRNA abundances. In our study, we are interested in extracting the systematic patterns of scRNA-seq data in an unsupervised manner, thus we have developed two extensions of robust principal component analysis (RPCA). First, we present a truncated version of RPCA (tRPCA), that is much faster and memory efficient. Second, we introduce a noise reduction in tRPCA with $$L_2$$ regularization (tRPCAL2). Unlike RPCA that only considers a low-rank L and sparse S matrices, the proposed method can also extract a noise E matrix inherent in modern genomic data. We demonstrate its usefulness by applying our methods on the peripheral blood mononuclear cell (PBMC) scRNA-seq data. Particularly, the clustering of a low-rank L matrix showcases better classification of unlabeled single cells. Overall, the proposed variants are well-suited for high-dimensional and noisy data that are routinely generated in genomics.

Krzysztof Gogolewski, Maciej Sykulski, Neo Christopher Chung, Anna Gambin
Locality Sensitive Imputation for Single-Cell RNA-Seq Data

One of the most notable challenges in single cell RNA-Seq data analysis is the so called drop-out effect, where only a fraction of the transcriptome of each cell is captured. The random nature of drop-outs, however, makes it possible to consider imputation methods as means of correcting for drop-outs. In this paper we study some existing scRNA-Seq imputation methods and propose a novel iterative imputation approach based on efficiently computing highly similar cells. We then present the results of a comprehensive assessment of existing and proposed methods on real scRNA-Seq datasets with varying per cell sequencing depth.

Marmar Moussa, Ion I. Măndoiu
Backmatter
Metadaten
Titel
Bioinformatics Research and Applications
herausgegeben von
Fa Zhang
Zhipeng Cai
Pavel Skums
Dr. Shihua Zhang
Copyright-Jahr
2018
Electronic ISBN
978-3-319-94968-0
Print ISBN
978-3-319-94967-3
DOI
https://doi.org/10.1007/978-3-319-94968-0