main-content

## Über dieses Buch

This book constitutes the proceedings of the 15th International Symposium on Bioinformatics Research and Applications, ISBRA 2019, held in Barcelona, Spain, in June 2019.

The 22 full papers presented in this book were carefully reviewed and selected from 95 submissions. They were organized in topical sections named: genome analysis; systems biology; computational proteomics; machine and deep learning; and data analysis and methodology.

## Inhaltsverzeichnis

### Computing a Consensus Phylogeny via Leaf Removal

Abstract
Given a set $$\mathcal{T} = \{T_1, T_2, \ldots , T_m\}$$ of phylogenetic trees with the same leaf-label set X, we wish to remove some leaves from the trees so that there is a tree T with leaf-label set X displaying all the resulting trees. One objective is to minimize the total number of leaves removed from the trees, while the other is to minimize the maximum number of leaves removed from an input tree. Chauve et al. [6] refer to the problem with the first (respectively, second) objective as AST-LR (respectively, AST-LR-d), and show that both problems are NP-hard. They further present algorithms for the parameterized versions of both problems, but it seems that their algorithm for the parameterized version of AST-LR is flawed [7]. In this paper, we present a new algorithm for the parameterized version of AST-LR and also show that Chauve et al.’s algorithm for the parameterized version of AST-LR-d can be sped up by an exponential factor. We further design heuristic integer-linear programming (ILP for short) models for AST-LR and AST-LR-d. Our experimental results show that the heuristic models can be used to significantly speed up solving the exact models proposed in [7].
Zhi-Zhong Chen, Shohei Ueta, Jingyu Li, Lusheng Wang

### The Review of Bioinformatics Tool for 3D Plant Genomics Research

Abstract
Since the genome of the nucleus is a complicated three-dimensional spatial structure but not a single linear structure, biologists consider that 3D structure of plant chromatin is highly correlated with the function of the genome, which can be used to study the regulation mechanisms of genes and their evolutionary process. Because plants are more prone to chromosome structural variation and the 3D structure of plant chromatin are highly correlated with the function of the genome, it is important to investigate the impact of chromosome structural variation on gene expression by analyzing 3D structure. Here, we will briefly review the current bioinformatics tools for 3D plant genome study, which covers Hi-C data processing tools, then are the tools for A and B compartments identification, topologically associated domains (TAD) identification, identification of significant interactions and visualization. And then, we could provide the useful information for the related 3D plant genomics research scientists to select the appropriate tools according to their study. Finally, we discuss how to develop the future 3D genomic plant bioinformatics tools to keep up with the pace of scientific research development.
Xiangyu Yang, Zhenghao Li, Jingtian Zhao, Tao Ma, Pengchao Li, Le Zhang

### Sorting by Reversals, Transpositions, and Indels on Both Gene Order and Intergenic Sizes

Abstract
During the evolutionary process, the genome is affected by various genome rearrangements, which are events that modify large stretches of the genetic material. In the literature, several models were designed to estimate the number of events that occurred during the evolution, but these models represent genomes as a sequence of genes, overlooking the genetic material between consecutive genes. However, recent studies show that taking into account the genetic material present between consecutive genes can be more realistic. Reversal and transposition are genome rearrangements widely studied in the literature. A reversal inverts a segment of the genome while a transposition swaps the positions of two consecutive segments. Genomes also undergo non-conservative events (events that alter the amount of genetic material) such as insertion and deletion, which insert and remove genetic material from intergenic regions of the genome, respectively. We study problems considering both gene order and intergenic regions size. We investigate the reversal distance between two genomes in two scenarios: with and without non-conservative events. For both problems, we show that they belong to NP-hard problems class and we present a 4-approximation algorithm. We also study the reversal and transposition distance between two genomes (and the variation with non-conservative events) and we present a 6-approximation algorithm.
Klairton Lima Brito, Géraldine Jean, Guillaume Fertin, Andre Rodrigues Oliveira, Ulisses Dias, Zanoni Dias

### Unifying Gene Duplication, Loss, and Coalescence on Phylogenetic Networks

Abstract
Statistical methods were recently introduced for inferring phylogenetic networks under the multispecies network coalescent, thus accounting for both reticulation and incomplete lineage sorting. Two evolutionary processes that are ubiquitous across all three domains of life, but are not accounted for by those methods, are gene duplication and loss (GDL).
In this work, we devise a three-piece model—phylogenetic network, locus network, and gene tree—that unifies all the aforementioned processes into a single model of how genes evolve in the presence of ILS, GDL, and introgression within the branches of a phylogenetic network. To illustrate the power of this model, we develop an algorithm for estimating the parameters of a phylogenetic network topology under this unified model.
We demonstrate the application of the model and the accuracy of the algorithm on simulated as well as biological data.
Our work adds to the biologist’s toolbox of methods for phylogenomic inference by accounting for more complex evolutionary processes.
Peng Du, Huw A. Ogilvie, Luay Nakhleh

### The Cluster Affinity Distance for Phylogenies

Abstract
Studying phylogenetic trees is fundamental to biology and benefitting a vast variety of other research areas. Comparing such trees is essential to such studies for which a growing and diverse collection of tree distances are available. In practice, tree distances suffer from problems that can severely limit their applicability. Notably, these distances include the cluster matching distance that is adapted from the Robinson-Foulds distance to overcome many of the drawbacks of this traditional measure. However, at the same time, the cluster matching distance is much more confined in its application than the Robinson-Foulds distance and makes sacrifices for satisfying the properties of a metric. Here, we propose the cluster affinity distance, a new tree distance that is adapted from the cluster matching distance but has not its drawbacks. Nevertheless, as we show, the cluster affinity distance preserves all of the properties that make the matching distance appealing.
Jucheol Moon, Oliver Eulenstein

### Modeling SNP-Trait Associations and Realizing Privacy-Utility Tradeoff in Genomic Data Publishing

Abstract
Genomic data privacy arises as one of the most important concerns facing the wide commoditization of DNA-genotyping. In this paper, we study the problem of privacy preserved kin-genomic data publishing. The major challenge in protecting kin-genomic data privacy is to protect against powerful attackers with abundant background knowledge. We propose a probabilistic model based on factor graph with the knowledge of publicly available GWAS statistics to reveal the dependency relationship between genotypes and phenotypes. Furthermore, a genomic data sanitization method is proposed to protect against optimal inference attacks launched by powerful attackers.
Zaobo He, Jianqiang Li

### Simultaneous Multi-Domain-Multi-Gene Reconciliation Under the Domain-Gene-Species Reconciliation Model

Abstract
The recently developed Domain-Gene-Species (DGS) reconciliation framework, which jointly models the evolution of a domain family inside one or more gene families and the evolution of those gene families inside a species tree, represents one of the most powerful computational techniques for reconstructing detailed histories of domain and gene family evolution in eukaryotic species. However, the DGS reconciliation framework allows for the reconciliation of only a single domain tree (representing a single domain family present in one or more gene families from the species under consideration) at a time, i.e., each domain tree is reconciled separately without consideration of any other domain families that might be present in the gene trees under consideration. However, this can lead to conflicting gene-species reconciliations for gene trees containing multiple domain families.
In this work, we address this problem by extending the DGS reconciliation model to simultaneously reconcile a set of domain trees, a set of gene trees, and a species tree. The new model, which we call the multi-DGS (mDGS) reconciliation model, produces a consistent joint reconciliation showing the evolution of each domain tree in its corresponding gene trees and the evolution of each gene tree inside the species tree. We formalize the mDGS reconciliation framework and define the associated computational problem, provide a heuristic algorithm for estimating optimal mDGS reconciliations (both the DGS and mDGS reconciliation problems are NP-hard), and apply our algorithm to a large dataset of over 3800 domain trees and over 7100 gene trees from 12 fly species. Our analysis of this dataset reveals interesting underlying patterns of co-occurrence of domains and genes, demonstrates the importance of mDGS reconciliation, and shows that the proposed heuristic is effective at estimating optimal mDGS reconciliations.
Lei Li, Mukul S. Bansal

### IDNDDI: An Integrated Drug Similarity Network Method for Predicting Drug-Drug Interactions

Abstract
A drug-drug interaction(DDI) was defined as the pharmacological effect(s) of a drug influenced by another drug. The positive DDIs can improve the therapeutic effect of patients. However, the negative DDIs can lead serious results, such as drug withdrawal from market and even patient death. Currently, multiple pharmaceutical drugs have widely been used to treat complex diseases, such as cancer. The traditional biomedical experiments are very time-consuming and very costly to validate new DDIs. Therefore, it is appealing to develop computational methods to discover potential DDIs. In this study, we propose a new computational method (called IDNDDI) to predict novel DDIs. Based on the binary vector of drug chemical, biological and phenotype data, IDNDDI computes the integrated drug feature similarity by the cosine similarity method. In addition, the node-based drug network diffusion method is used to calculate the relational initial scores for new drugs. To systematically evaluate the prediction performance of IDNDDI and compare it with other prediction methods, we conduct the 5-fold cross validation and de novo drug validation. In terms of the AUC (area under the ROC curve)value, IDNDDI achieves the better prediction performance in the 5-fold cross validation, specifically, the AUC value is 0.9691, which is larger than the state-of-the-art L1E (L1 Classifier ensemble method) results of 0.9570. In addition, IDNDDI also obtains the best prediction result in the de novo drug validation and the AUC reaches 0.9292. The prediction ability in application of our method is also illustrated by case studies. IDNDDI is an effective DDI prediction method which can help to reduce adverse drug reactions and improve the efficiency of drug development progress.
Cheng Yan, Guihua Duan, Yayan Zhang, Fang-Xiang Wu, Yi Pan, Jianxin Wang

### Model Revision of Boolean Regulatory Networks at Stable State

Abstract
Models of biological regulatory networks are essential to understand the cellular processes. However, the definition of such models is still mostly manually performed, and consequently prone to error. Moreover, as new experimental data is acquired, models need to be revised and updated. Here, we propose a model revision tool, capable of proposing the set of minimum repairs to render a model consistent with a set of experimental observations. We consider four possible repair operations, giving preference to function repairs over topological ones. Also, we consider observations at stable state, i.e., we do not consider the model dynamics. We evaluate our tool on five known logical models. We perform random changes considering several parameter configurations to assess the tool repairing capabilities. Whenever a model is repaired under the time limit, the tool successfully produces the optimal solutions to repair the model. Also, the number of repair operations required is less than or equal to the number of random changes applied to the original model.
Filipe Gouveia, Inês Lynce, Pedro T. Monteiro

### Gene- and Pathway-Based Deep Neural Network for Multi-omics Data Integration to Predict Cancer Survival Outcomes

Abstract
Data integration of multi-platform based omics data from biospecimen holds promise of improving survival prediction and personalized therapies in cancer. Multi-omics data provide comprehensive descriptions of human genomes regulated by complex interactions of multiple biological processes such as genetic, epigenetic, and transcriptional regulation. Therefore, the integration of multi-omics data is essential to decipher complex mechanisms of human diseases and to enhance treatments based on genetic understanding of each patient in precision medicine. In this paper, we propose a gene- and pathway-based deep neural network for multi-omics data integration (MiNet) to predict cancer survival outcomes. MiNet introduces a multi-omics layer that represents multi-layered biological processes of genetic, epigenetic, and transcriptional regulation, in the gene- and pathway-based neural network. MiNet captures nonlinear effects of multi-omics data to survival outcomes via a neural network framework, while allowing one to biologically interpret the model. In the extensive experiments with multi-omics data of Gliblastoma multiforme (GBM) patients, MiNet outperformed the current cutting-edge methods including SurvivalNet and Cox-nnet. Moreover, MiNet’s model showed the capability to interpret a multi-layered biological system. A number of biological literature in GBM supported the biological interpretation of MiNet. The open-source software of MiNet in PyTorch is publicly available at https://​github.​com/​DataX-JieHao/​MiNet.
Jie Hao, Mohammad Masum, Jung Hun Oh, Mingon Kang

### Identifying Human Essential Genes by Network Embedding Protein-Protein Interaction Network

Abstract
Essential genes play an indispensable role in cell viability and fertility. Identifying human essential genes helps us to study the functions of human genes, but also provides a way for finding potential targets for cancer and other diseases. Recently, with the publishing of human essential gene data and the availability of a large amount of biological data, some computational methods have been proposed to predict human essential genes based on genes’ DNA sequence or their topological properties in the protein-protein interaction (PPI) network. However, there is still some room to improve the prediction accuracy. In this work, we propose a novel supervised method to predict human essential genes by network embedding protein-protein interaction network. Our method extracts the features of the genes in network by mapping them to a latent space of features that maximally preserves the relationships between the genes and their network neighborhoods. After that, the features are input into a SVM classifier to predict human essential genes. Two human PPI networks are employed to evaluate the effectiveness of our method. The prediction results show that our method outperforms the method that only uses genes’ sequence information, but also is obviously superior to the method utilizing genes’ centrality properties in the network as input features.
Wei Dai, Qi Chang, Wei Peng, Jiancheng Zhong, Yongjiang Li

### Automated Hub-Protein Detection via a New Fused Similarity Measure-Based Multi-objective Clustering Framework

Abstract
In the field of computational biology and bioinformatics, there have been limited studies on the development of protein-protein proximity measures which blend multiple sources of biological properties of protein. In Protein-Protein Interaction Network (PPIN), hub-proteins play a central role. There are many literature with user-studied different degree cut-offs for defining hub-proteins. Therefore, there is a need for a standard method for identifying hub-proteins without manually determining the degree cut-off. In the current research article, an effort has been made towards addressing both problems. At first, we have proposed a new Fused protein-protein Similarity measure - FuSim, which involves biological properties of both Gene Ontology (GO) and PPIN. Later, utilizing the proposed similarity measure, a multi-objective clustering algorithm-based automated hub-protein detection framework is developed.
Sudipta Acharya, Laizhong Cui, Yi Pan

### Improving Identification of Essential Proteins by a Novel Ensemble Method

Abstract
Essential proteins are indispensable for cell survival, and the identification of essential proteins plays a critical role in biological and pharmaceutical design research. Recently, some machine learning methods have been proposed by introducing effective protein features or by employing powerful classifiers. Seldom of them focused on improving the prediction accuracy by designing efficient strategies to ensemble different classifiers. In this work, a novel ensemble learning framework called by Tri-ensemble was proposed to integrate different classifiers, which selected three weak classifiers and trained these classifiers by continually adding the samples that are predicted to have abnormally high or abnormally low properties by the other two classifiers. We applied Tri-ensemble on predicting the essential protein of Yeast and E.coli. The results show that our approach achieves better performance than both individual classifiers and the other ensemble learning methods.
Wei Dai, Xia Li, Wei Peng, Jurong Song, Jiancheng Zhong, Jianxin Wang

### Deep Learning and Random Forest-Based Augmentation of sRNA Expression Profiles

Abstract
The lack of well-structured annotations in a growing amount of RNA expression data complicates data interoperability and reusability. Commonly used text mining methods extract annotations from existing unstructured data descriptions and often provide inaccurate output that requires manual curation. Automatic data-based augmentation (generation of annotations on the base of expression data) can considerably improve the annotation quality and has not been well-studied. We formulate an automatic augmentation of small RNA-seq expression data as a classification problem and investigate deep learning (DL) and random forest (RF) approaches to solve it. We generate tissue and sex annotations from small RNA-seq expression data for tissues and cell lines of homo sapiens. We validate our approach on 4243 annotated small RNA-seq samples from the Small RNA Expression Atlas (SEA) database. The average prediction accuracy for tissue groups is 98% (DL), for tissues - 96.5% (DL), and for sex - 77% (DL). The “one dataset out” average accuracy for tissue group prediction is 83% (DL) and 59% (RF). On average, DL provides better results as compared to RF, and considerably improves classification performance for ‘unseen’ datasets.
Jelena Fiosina, Maksims Fiosins, Stefan Bonn

Abstract
Opioid abuse epidemics is a major public health emergency in the US. Social media platforms have facilitated illicit drug trading, with significant amount of drug advertisement and selling being carried out online. In order to understand dynamics of drug abuse epidemics and design efficient public health interventions, it is essential to extract and analyze data from online drug markets. In this paper, we present a computational framework for automatic detection of illicit drug ads in social media, with Google+ being used for a proof-of-concept. The proposed SVM- and CNN-based methods have been extensively validated on the large dataset containing millions of posts collected using Google+ API. Experimental results demonstrate that our methods can efficiently identify illicit drug ads with high accuracy. Both approaches have been extensively validated using the dataset containing millions of posts collected using Google+ API. Experimental results demonstrate that both methods allow for accurate identification of illicit drug ads.
Fengpan Zhao, Pavel Skums, Alex Zelikovsky, Eric L. Sevigny, Monica Haavisto Swahn, Sheryl M. Strasser, Yubao Wu

### A Robustness Analysis of Dynamic Boolean Models of Cellular Circuits

Abstract
With ever growing amounts of omics data, the next challenge in biological research is the interpretation of these data to gain mechanistic insights about cellular function. Dynamic models of cellular circuits that capture the activity levels of proteins and other molecules over time offer great expressive power by allowing the simulation of the effects of specific internal or external perturbations on the workings of the cell. However, the study of such models is at its infancy and no large scale analysis of the robustness of real models to changing conditions has been conducted to date. Here we provide a computational framework to study the robustness of such models using a combination of stochastic simulations and integer linear programming techniques. We apply our framework to a large collection of cellular circuits and benchmark the results against randomized models. We find that the steady states of real circuits tend to be more robust in multiple aspects compared to their randomized counterparts.
Ariel Bruner, Roded Sharan

### Graph Transformations, Semigroups, and Isotopic Labeling

Abstract
The Double Pushout (DPO) approach for graph transformation naturally allows an abstraction level of biochemical systems in which individual atoms of molecules can be traced automatically within chemical reaction networks. Aiming at a mathematical rigorous approach for isotopic labeling design we convert chemical reaction networks (represented as directed hypergraphs) into transformation semigroups. Symmetries within chemical compounds correspond to permutations whereas (not necessarily invertible) chemical reactions define the transformations of the semigroup. An approach for the automatic inference of informative labeling of atoms is presented, which allows to distinguish the activity of different pathway alternatives within reaction networks. To illustrate our approaches, we apply them to the reaction network of glycolysis, which is an important and well understood process that allows for different alternatives to convert glucose into pyruvate.
Jakob L. Andersen, Daniel Merkle, Peter S. Rasmussen

### Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing

Abstract
Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed-up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time.
In this paper we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values in order to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup of 6.2x, outperforming previous methods. Software and Datasets are available at ISSH
Enrico Petrucci, Laurent Noé, Cinzia Pizzi, Matteo Comin

### DM-SIRT: A Distributed Method for Multi-tilt Reconstruction in Electron Tomography

Abstract
The ‘missing wedge’ of single tilt in electron tomography introduces severely artifacts into the reconstructed results. To reduce the ‘missing wedge’ effect, a widely used method is ‘multi-tilt reconstruction’, which collects projections using multiple different axes. However, as the number of tilt series increases, its computing and memory costs also rises. While the demand to speed up its reconstruction procedure grows, the huge memory requirement from the 3D structure and strong data dependencies from projections heavily limit its parallelization. In our work, we present a new fully distributed multi-tilt reconstruction framework named DM-SIRT. To improve the parallelism of the reconstruction process and reduce the memory requirements of each process, we formulate the multi-tilt reconstruction as a consensus optimization problem and design a distributed multi-tilt SIRT algorithm. To improve the reconstruction resolution, we applied a multi-agent consensus equilibrium (MACE) with a new data division strategy. Experiments show that along with the visually and quantitatively improvement in resolution, DM-SIRT can acquire a 5.4x speedup ratio compared to the raw multi-tilt reconstruction version. It also has 87% decrease of memory overhead and 8 times more scalable than the raw reconstruction version.
Zihao Wang, Jingrong Zhang, Xintong Liu, Zhiyong Liu, Xiaohua Wan, Fa Zhang

### PeakPass: Automating ChIP-Seq Blacklist Creation

Abstract
ChIP-Seq blacklists contain genomic regions that frequently produce artifacts and noise in ChIP-Seq experiments. To improve signal-to-noise ratio, ChIP-Seq pipelines often remove data points that map to blacklist regions. Existing blacklists have been compiled in a manual or semi-automated way. In this paper we describe PeakPass, an efficient method to generate blacklists, and present evidence that blacklists can increase ChIP-Seq data quality. PeakPass leverages machine learning and attempts to automate blacklist generation. PeakPass uses a random forest classifier in combination with genomic features such as sequence, annotated repeats, complexity, assembly gaps, and the ratio of multi-mapping to uniquely mapping reads to identify artifact regions. We have validated PeakPass on a large dataset and tested it for the purpose of upgrading a blacklist to a new reference genome version. We trained PeakPass on the ENCODE blacklist for the hg19 human reference genome, and created an updated blacklist for hg38. To assess the performance of this blacklist we tested 42 ChIP-Seq replicates from 24 experiments using the Relative Strand Correlation (RSC) metric as a quality measure. Using the blacklist generated by PeakPass resulted in a statistically significant increase in RSC over the existing ENCODE blacklist for hg38 – average RSC was increased by 50% over the ENCODE blacklist, while only filtering an average of 0.1% of called peaks.
Charles E. Wimberley, Steffen Heber

### Maximum Stacking Base Pairs: Hardness and Approximation by Nonlinear LP-Rounding

Abstract
Maximum stacking base pairs is a fundamental combinatorial problem from RNA secondary structure prediction under the energy model. The basic maximum stacking base pairs problem can be described as: given an RNA sequence, find a maximum number of base pairs such that each chosen base pair has at least one parallel and adjacent partner (i.e., they form a stacking). This problem is NP-hard, no matter whether the candidate base pairs follow the biology principle or are given explicitly as input. This paper investigates a restricted version of this problem where the base pairs are given as input and each base is associated with at most k (a constant) base pairs. We show that this restricted version is still APX-hard, even if the base pairs are weighted. Moreover, by a nonlinear LP-rounding method, we present an approximation algorithm with a factor $$\frac{32(k-1)^{3}e^{3}}{8(k-1)e-1}$$. Applying our algorithms on the simulated data, the actual approximation factor is in fact much better than this theoretical bound.
Lixin Liu, Haitao Jiang, Peiqiang Liu, Binhai Zhu, Daming Zhu

### Greedy Partition Distance Under Stochastic Models - Analytic Results

Abstract
Gene partitioning is a very common task in genomics, based on several criteria such as gene function, homology, interactions, and more. Given two such partitions, a metric to compare them is called for. One such metric is based on multi symmetric difference and elements are removed from both partitions until identity is reached. While such a task can be done accurately by a maximum weight bipartite matching, in common settings in comparative genomics, the standard algorithm to solve this problem might become impractical. In previous works we have studied the universal pacemaker (UPM) where genes are clustered according to mutation rate correlation, and suggested a very fast and greedy procedure for comparing partitions. This procedure is guaranteed to provide a poor approximation ratio of 1/2 under arbitrary inputs.
In this work we give a probabilistic analysis of this procedure under a common and natural stochastic environment. We show that under mild size requirements, and a sound model assumption, this procedure returns the correct result with high probability. Furthermore, we show that in the context of the UPM, this natural requirement holds automatically, rendering statistical consistency of this fast greedy procedure. We also discuss the application of this procedure in the comparative genomics rudimentary task of gene orthology where such a solution is imperative.
Sagi Snir

### Backmatter

Weitere Informationen