main-content

## Über dieses Buch

This book constitutes the proceedings of the 16th International Symposium on Bioinformatics Research and Applications, ISBRA 2020, held in Moscow, Russia, in December 2020.
The 23 full papers and 18 short papers presented in this book were carefully reviewed and selected from 131 submissions. They were organized in topical sections named: genome analysis; systems biology; computational proteomics; machine and deep learning; and data analysis and methodology.

## Inhaltsverzeichnis

### Mitochondrial Haplogroup Assignment for High-Throughput Sequencing Data from Single Individual and Mixed DNA Samples

The inference of mitochondrial haplogroups is an important step in forensic analysis of DNA samples collected at a crime scene. In this paper we introduced efficient inference algorithms based on Jaccard similarity between variants called from high-throughput sequencing data of such DNA samples and mutations collected in public databases such as PhyloTree. Experimental results on real and simulated datasets show that our mutation analysis methods have accuracy comparable to that of state-of-the-art methods based on haplogroup frequency estimation for both single-individual samples and two-individual mixtures, with a much lower running time.

### Signet Ring Cell Detection with Classification Reinforcement Detection Network

Identifying signet ring cells on pathological images is an important clinical task that highly relevant to cancer grading and prognosis. However, it is challenging as the cells exhibit diverse visual appearance in the crowded cellular image. This task is also less studied by computational methods so far. This paper proposes a Classification Reinforcement Detection Network (CRDet) to alleviate the detection difficulties. CRDet is composed of a Cascade RCNN architecture and a dedicated devised Classification Reinforcement Branch (CRB), which consists of a dedicated context pool module and a corresponding feature enhancement classifier, aiming at extracting more comprehensive and discriminative features from the cell and its surrounding context. With the reinforced features, the small-sized cell can be well characterized, thus a better classification is expected. Experiments on a public signet ring cell dataset demonstrate the proposed CRDet achieves a better performance compared with popular CNN-based object detection models.

Sai Wang, Caiyan Jia, Zhineng Chen, Xieping Gao

### SPOC: Identification of Drug Targets in Biological Networks via Set Preference Output Control

Biological networks describe the relationships among molecular elements and help in the deep understanding of the biological mechanisms and functions. One of the common problems is to identify the set of biomolecules that could be targeted by drugs to drive the state transition of the cells from disease states to health states called desired states as the realization of the therapy of complex diseases. Most previous studies based on the output control determine the set of steering nodes without considering available biological information. In this study, we propose a strategy by using the additionally available information like the FDA-approved drug targets to restrict the range for choosing steering nodes in output control instead, where we call it the Set Preference Output Control (SPOC) problem. A graphic-theoretic algorithm is proposed to approximately tackle it by using the Maximum Weighted Complete Matching (MWCM). The computation experiment results from two biological networks illustrate that our proposed SPOC strategy outperforms the full control and output control strategies to identify drug targets. Finally, the case studies further demonstrate the role of the combination therapy in two biological networks, which reveals that our proposed SPOC strategy is potentially applicable for more complicated cases.

Hao Gao, Min Li, Fang-Xiang Wu

### Identification of a Novel Compound Heterozygous Variant in NBAS Causing Bone Fragility by the Type of Osteogenesis Imperfecta

Biallelic mutations in the NBAS gene have been reported to cause three different clinical signs: short stature with optic nerve atrophy and Pelger-Huët anomaly (SOPH) syndrome, infantile liver failure syndrome 2 (ILFS2) and a combined severe phenotype including both SOPH and ILFS2 features. Here, we describe a case of a 6-year-old Yakut girl who presented with clinical signs of SOPH syndrome, acute liver failure (ALF) and bone fragility by the type of osteogenesis imperfecta (OI). Targeted panel sequencing for 494 genes of connective tissue diseases of the patient revealed that he carried novel compound heterozygous missense mutation in NBAS, c.2535G>T (p.Trp845Cys), c.5741G>A (p.Arg1914His). Mutation affect evolutionarily conserved amino acid residues and predicted to be highly damaging. Timely health care of patients with such a set of clinical spectrum of SOPH syndrome, ALF and bone fragility by the type of OI can contribute to establishment coordinated multispecialty management of the patient focusing on the health problems issues through childhood.

D. A. Petukhova, E. E. Gurinova, A. L. Sukhomyasova, N. R. Maksimova

### Isoform-Disease Association Prediction by Data Fusion

Alternative splicing enables a gene spliced into different isoforms, which are closely related with diverse developmental abnormalities. Identifying the isoform-disease associations helps to uncover the underlying pathology of various complex diseases, and to develop precise treatments and drugs for these diseases. Although many approaches have been proposed for predicting gene-disease associations and isoform functions, few efforts have been made toward predicting isoform-disease associations in large-scale, the main bottleneck is the lack of ground-truth isoform-disease associations. To bridge this gap, we propose a multi-instance learning inspired computational approach called IDAPred to fuse genomics and transcriptomics data for isoform-disease association prediction. Given the bag-instance relationship between gene and its spliced isoforms, IDAPred introduces a dispatch and aggregation term to dispatch gene-disease associations to individual isoforms, and reversely aggregate these dispatched associations to affiliated genes. Next, it fuses different genomics and transcriptomics data to replenish gene-disease associations and to induce a linear classifier for predicting isoform-disease associations in a coherent way. In addition, to alleviate the bias toward observed gene-disease associations, it adds a regularization term to differentiate the currently observed associations from the unobserved (potential) ones. Experimental results show that IDAPred significantly outperforms the related state-of-the-art methods.

Qiuyue Huang, Jun Wang, Xiangliang Zhang, Guoxian Yu

### EpIntMC: Detecting Epistatic Interactions Using Multiple Clusterings

Detecting epistatic interaction between multiple single nucleotide polymorphisms (SNPs) is crucial to identify susceptibility genes associated with complex human diseases. Stepwise search approaches have been extensively studied to greatly reduce the search space for follow-up SNP interactions detection. However, most of these stepwise methods are prone to filter out significant polymorphism combinations and thus have a low detection power. In this paper, we propose a two-stage approach called EpIntMC, which uses multiple clusterings to significantly shrink the search space and reduce the risk of filtering out significant combinations for the follow-up detection. EpIntMC firstly introduces a matrix factorization based approach to generate multiple diverse clusterings to group SNPs into different clusters from different aspects, which helps to more comprehensively explore the genotype data and reduce the chance of filtering out potential candidates overlooked by a single clustering. In the search stage, EpIntMC applies Entropy score to screen SNPs in each cluster, and uses Jaccard similarity to merge the most similar clusters into candidate sets. After that, EpIntMC uses exhaustive search on these candidate sets to precisely detect epsitatic interactions. Extensive simulation experiments show that EpIntMC has a higher (comparable) power than related competitive solutions, and results on Wellcome Trust Case Control Consortium (WTCCC) dataset also expresses its effectiveness.

Huiling Zhang, Guoxian Yu, Wei Ren, Maozu Guo, Jun Wang

### Improving Metagenomic Classification Using Discriminative k-mers from Sequencing Data

The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads to identify the species they contain. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of recall (the actual number of classified reads) the performances fall at around 50%. One of the reasons is the fact that the sequences in a sample can be very different from the corresponding reference genome, e.g. viral genomes are highly mutated. To address this issue, in this paper we study the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads. We evaluated the performance in different conditions against several other tools and the results showed an improved F-measure, especially when close reference genomes are not available.Availability: https://github.com/davide92/K2Mem.git

Davide Storato, Matteo Comin

### Dilated-DenseNet for Macromolecule Classification in Cryo-electron Tomography

Cryo-electron tomography (cryo-ET) combined with subtomogram averaging (STA) is a unique technique in revealing macromolecule structures in their near-native state. However, due to the macromolecular structural heterogeneity, low signal-to-noise-ratio (SNR) and anisotropic resolution in the tomogram, macromolecule classification, a critical step of STA, remains a great challenge.In this paper, we propose a novel convolution neural network, named 3D-Dilated-DenseNet, to improve the performance of macromolecule classification in STA. The proposed 3D-Dilated-DenseNet is challenged by the synthetic dataset in the SHREC contest and the experimental dataset, and compared with the SHREC-CNN (the state-of-the-art CNN model in the SHREC contest) and the baseline 3D-DenseNet. The results showed that 3D-Dilated-DenseNet significantly outperformed 3D-DenseNet but 3D-DenseNet is well above SHREC-CNN. Moreover, in order to further demonstrate the validity of dilated convolution in the classification task, we visualized the feature map of 3D-Dilated-DenseNet and 3D-DenseNet. Dilated convolution extracts a much more representative feature map.

Shan Gao, Renmin Han, Xiangrui Zeng, Xuefeng Cui, Zhiyong Liu, Min Xu, Fa Zhang

### Ess-NEXG: Predict Essential Proteins by Constructing a Weighted Protein Interaction Network Based on Node Embedding and XGBoost

Essential proteins are indispensable in the development of organisms and cells. Identification of essential proteins lays the foundation for the discovery of drug targets and understanding of protein functions. Traditional biological experiments are expensive and time-consuming. Considering the limitations of biological experiments, many computational methods have been proposed to identify essential proteins. However, lots of noises in the protein-protein interaction (PPI) networks hamper the task of essential protein prediction. To reduce the effects of these noises, constructing a reliable PPI network by introducing other useful biological information to improve the performance of the prediction task is necessary. In this paper, we propose a model called Ess-NEXG which integrates RNA-Seq data, subcellular localization information, and orthologous information, for the prediction of essential proteins. In Ess-NEXG, we construct a reliable weighted network by using these data. Then we use the node2vec technique to capture the topological features of proteins in the constructed weighted PPI network. Last, the extracted features of proteins are put into a machine learning classifier to perform the prediction task. The experimental results show that Ess-NEXG outperforms other computational methods.

Nian Wang, Min Zeng, Jiashuai Zhang, Yiming Li, Min Li

### mapAlign: An Efficient Approach for Mapping and Aligning Long Reads to Reference Genomes

Long reads play an important role for the identification of structural variants, sequencing repetitive regions, phasing of alleles, etc. In this paper, we propose a new approach for mapping long reads to reference genomes. We also propose a new method to generate accurate alignments of the long reads and the corresponding segments of reference genome. The new mapping algorithm is based on the longest common sub-sequence with distance constraints. The new (local) alignment algorithms is based on the idea of recursive alignment of variable size k-mers. Experiments show that our new method can generate better alignments in terms of both identity and alignment scores for both Nanopore and SMRT data sets. In particular, our method can align 91.53% and $$85.36\%$$ of letters on reads to identical letters on reference genomes for human individuals of Nanopore and SMRT data sets, respectively. The state-of-the-art method can only align $$88.44\%$$ and $$79.08\%$$ letters of reads for Nanopore and SMRT data sets, respectively. Our method is also faster than the state-of-the-art method. Availability: https://github.com/yw575/mapAlign

Wen Yang, Lusheng Wang

### Functional Evolutionary Modeling Exposes Overlooked Protein-Coding Genes Involved in Cancer

Numerous computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Compiling a catalog of cancer genes has profound implications for the understanding and treatment of the disease. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the evolutionary selection of genes by assessing the functional effects of mutations on protein-coding genes using a pre-trained machine-learning model. The framework compares the estimated effects of observed genetic variations against all possible single-nucleotide mutations in the coding human genome. Compared to existing methods, FABRIC makes minimal assumptions about the distribution of random mutations. To demonstrate its wide applicability, we applied FABRIC on both naturally occurring human variants and somatic mutations in cancer. In the context of cancer, ~3 M somatic mutations were extracted from over 10,000 cancerous human samples. Of the entire human proteome, 593 protein-coding genes show statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with contemporary cancer gene catalogs. Notably, the majority of these genes (426) are unlisted in these catalogs, but a substantial fraction of them is supported by literature. In the context of normal human evolution, we analyzed ~5 M common and rare variants from ~60 K individuals, discovering 6,288 significant genes. Over 98% of them are dominated by negative selection, supporting the notion of a strong purifying selection during the evolution of the healthy human population. We present the FABRIC framework as an open-source project with a simple command-line interface.

Nadav Brandes, Nathan Linial, Michal Linial

### Testing the Agreement of Trees with Internal Labels

The input to the agreement problem is a collection $$\mathcal {P}= \{\mathcal {T}_1, \mathcal {T}_2, \dots , \mathcal {T}_k\}$$ of phylogenetic trees, called input trees, over partially overlapping sets of taxa. The question is whether there exists a tree $$\mathcal {T}$$ , called an agreement tree, whose taxon set is the union of the taxon sets of the input trees, such that for each $$i \in \{1, 2, \dots , k\}$$ , the restriction of $$\mathcal {T}$$ to the taxon set of $$\mathcal {T}_i$$ is isomorphic to $$\mathcal {T}_i$$ . We give a $$\mathcal {O}(n k (\sum _{i \in [k]} d_i + \log ^2(nk)))$$ algorithm for a generalization of the agreement problem in which the input trees may have internal labels, where n is the total number of distinct taxa in $$\mathcal {P}$$ , k is the number of trees in $$\mathcal {P}$$ , and $$d_i$$ is the maximum number of children of a node in $$\mathcal {T}_i$$ .

David Fernández-Baca, Lei Liu

### SVLR: Genome Structure Variant Detection Using Long Read Sequencing Data

Genome structural variants have great impacts on human phenotype and diversity, and have been linked to numerous diseases. Long read sequencing technologies arise to make it possible to find structural variants of as long as ten thousand nucleotides. Thus, long read based structural variant detection has been drawing attention of many recent research projects, and many tools have been developed for long reads to detect structural variants recently.In this article, we present a new method, called SVLR, to detect Structural Variants based on Long Read sequencing data. Comparing to existing methods, SVLR can detect three new kinds of structural variants: block replacements, block interchanges and translocations. Although these new structural variants are structurally more complicated, SVLR achieves accuracies that are comparable to those of the classic structural variants. Moreover, for the classic structural variants that can be detected by state-of-the-art methods (e.g., SVIM and Sniffles), our experiments demonstrate recall improvements of up-to $$38\%$$ without harming the precisions (i.e., above $$78\%$$ ). We also point out three directions to further improve structural variant detection in the future.Source codes: https://github.com/GWYSDU/SVLR .

Wenyan Gu, Aizhong Zhou, Lusheng Wang, Shiwei Sun, Xuefeng Cui, Daming Zhu

### De novo Prediction of Drug-Target Interaction via Laplacian Regularized Schatten-p Norm Minimization

The identification of drug-target interactions plays a crucial role in drug discovery and design. However, capturing interactions between drugs and targets via traditional biochemical experiments is an extremely laborious, expensive and time-consuming procedure. Therefore, the use of computational methods for predicting potential interactions to guide the experimental verification has attracted a lot of attention. In this paper, we propose a new algorithm, named Laplacian Regularized Schatten-p Norm Minimization (LRSpNM), to predict potential target proteins for novel drugs and potential drugs for new targets. First, we take advantage of the drug and target similarity information to dynamically prefill the partial unknown interactions. Then based on the assumption that the interaction matrix is low-rank, we use Schatten-p norm minimization model to improve prediction performance in the new drug/target cases by combining the loss function with a Laplacian regularization term. Finally, we numerically solve the LRSpNM model by an efficient alternating direction method of multipliers (ADMM) algorithm. Performance evaluations on benchmark datasets show that LRSpNM achieves better and more robust performance than five state-of-the-art drug-target interaction prediction algorithms. In addition, we conduct case study in practical applications, which also illustrates the effectiveness of our proposed method.

Gaoyan Wu, Mengyun Yang, Yaohang Li, Jianxin Wang

### Diagnosis of ASD from rs-fMRI Images Based on Brain Dynamic Networks

The resting-state functional magnetic resonance imaging (rs-fMRI) as a non-invasive technique with the high spatial and temporal resolution can help characterize the pathogenesis of autism spectrum disorder (ASD). Some results have been achieved with machine learning techniques to diagnose ASD with rs-fMRI data. However, most of machine learning methods have neglected the temporal dependency of the time-series fMRI data. In this study, we propose a method for diagnosing ASD based on brain dynamic networks (BDNs) which are constructed with time series rs-fMRI brain image data to describe the dynamic relationship among multiple brain regions. The least squares method with the forward model selection method was used to establish BDNs, and the Bayesian information criterion (BIC) was adopted as the model selection criteria to avoid overfitting. The resulted DBNs are weighted directed networks. Then a feature extraction method was proposed to extract representative and discriminated features from BDNs. Lastly, machine learning classifiers were trained with the whole ABIDE I cohort to diagnose ASD. The accuracy of 88.8% was achieved, which is higher than any previously reported methods.

Hongyu Guo, Wutao Yin, Sakib Mostafa, Fang-Xiang Wu

### MiRNA-Disease Associations Prediction Based on Negative Sample Selection and Multi-layer Perceptron

MicroRNAs (miRNAs) are a class of non-coding RNAs of approximately 22 nucleotides. Cumulative evidence from biological experiments has confirmed that miRNAs play a key role in many complex human diseases. Therefore, the accurate identification of potential associations between miRNAs and diseases is beneficial to understanding the mechanisms of diseases, developing drugs and treating complex diseases. We propose a new method to predict miRNA-disease associations based on a negative sample selection strategy and multi-layer perceptron (called NMLPMDA). For obtaining more similarity information, NMLPMDA integrates the miRNA functional similarity and the Gaussian interaction profile (GIP) kernel similarity of miRNAs as the final miRNA similarity, and integrates the disease semantic similarity and the GIP kernel similarity of diseases as the final disease similarity. In particular, we propose a negative sample selection strategy based on common gene information to select more reliable negative samples from unknown miRNA-disease associations. The 5-fold cross validation is used to evaluate the performance of NMLPMDA and other competing methods. On four datasets (HMDD2.0-Yan, HMDD2.0-Lan, HMDD2.0-You, HMDD3.0), the AUC values of NMLPMDA are 0.9278, 0.9206, 0.9301 and 0.9350, respectively. In addition, we also illustrate the prediction ability of NMLPMDA in Lymphoma. As a result, 28 of the top 30 miRNAs associated with the disease have been validated experimentally in dbDEMC and previous studies, respectively. These experimental results indicate that NMLPMDA is a reliable model for predicting associations between miRNAs and diseases.

Na Li, Guihua Duan, Cheng Yan, Fang-Xiang Wu, Jianxin Wang

### Checking Phylogenetic Decisiveness in Theory and in Practice

Suppose we have a set X consisting of n taxa and we are given information from k loci from which to construct a phylogeny for X. Each locus offers information for only a fraction of the taxa. The question is whether this data suffices to construct a reliable phylogeny. The decisiveness problem expresses this question combinatorially. Although a precise characterization of decisiveness is known, the complexity of the problem is open. Here we relate decisiveness to a hypergraph coloring problem. We use this idea to (1) obtain lower bounds on the amount of coverage needed to achieve decisiveness, (2) devise an exact algorithm for decisiveness, (3) develop problem reduction rules, and use them to obtain efficient algorithms for inputs with few loci, and (4) devise an integer linear programming formulation of the decisiveness problem, which allows us to analyze data sets that arise in practice.

Ghazaleh Parvini, Katherine Braught, David Fernández-Baca

### TNet: Phylogeny-Based Inference of Disease Transmission Networks Using Within-Host Strain Diversity

The inference of disease transmission networks from genetic sequence data is an important problem in epidemiology. One popular approach for building transmission networks is to reconstruct a phylogenetic tree using sequences from disease strains sampled from (a subset of) infected hosts and infer transmissions based on this tree. However, most existing phylogenetic approaches for transmission network inference cannot take within-host strain diversity into account, which affects their accuracy, and, moreover, are highly computationally intensive and unscalable.In this work, we introduce a new phylogenetic approach, TNet, for inferring transmission networks that addresses these limitations. TNet uses multiple strain sequences from each sampled host to infer transmissions and is simpler and more accurate than existing approaches. Furthermore, TNet is highly scalable and able to distinguish between ambiguous and unambiguous transmission inferences. We evaluated TNet on a large collection of 560 simulated transmission networks of various sizes and diverse host, sequence, and transmission characteristics, as well as on 10 real transmission datasets with known transmission histories. Our results show that TNet outperforms two other recently developed methods, phyloscanner and SharpTNI, that also consider within-host strain diversity using a similar computational framework. TNet is freely available open-source from https://compbio.engr.uconn.edu/software/TNet/ .

Saurav Dhar, Chengchen Zhang, Ion Mandoiu, Mukul S. Bansal

### Cancer Breakpoint Hotspots Versus Individual Breakpoints Prediction by Machine Learning Models

Genome rearrangement is a hallmark of all cancers. Cancer breakpoint prediction appeared to be a difficult task, and various machine learning models did not achieve high prediction power. We investigated the power of machine learning models to predict breakpoint hotspots selected with different density thresholds and also compared prediction of hotspots versus individual breakpoints. We found that hotspots are considerably better predicted than individual breakpoints. While choosing a selection criterion, the test ROC AUC only is not enough to choose the best model, the lift of recall and lift of precision should be taken into consideration. Investigation of the lift of recall and lift of precision showed that it is impossible to select one criterion of hotspot selection for all cancer types but there are three to four distinct groups of cancer with similar properties. Overall the presented results point to the necessity to choose different hotspots selection criteria for different types of cancer.

Kseniia Cheloshkina, Islam Bzhikhatlov, Maria Poptsova

### Integer Linear Programming Formulation for the Unified Duplication-Loss-Coalescence Model

The classical Duplication-Loss-Coalescence parsimony model (DLC-model) is a powerful tool when studying the complex evolutionary scenarios of simultaneous duplication-loss and deep coalescence events in evolutionary histories of gene families. However, inferring such scenarios is an intrinsically difficult problem and, therefore, prohibitive for larger gene families typically occurring in practice. To overcome this stringent limitation, we make the first step by describing a non-trivial and flexible Integer Linear Programming (ILP) formulation for inferring DLC evolutionary scenarios. To make the DLC-model more practical, we then introduce two sensibly constrained versions of the model and describe two respectively modified versions of our ILP formulation reflecting these constraints. Using a simulation study, we showcase that our constrained ILP formulation computes evolutionary scenarios that are substantially larger than the scenarios computable under our original ILP formulation and DLCPar. Further, scenarios computed under our constrained DLC-model are overall remarkably accurate when compared to corresponding scenarios under the original DLC-model.

Javad Ansarifar, Alexey Markin, Paweł Górecki, Oliver Eulenstein

### In Silico-Guided Discovery of Potential HIV-1 Entry Inhibitors Mimicking bNAb N6: Virtual Screening, Docking, Molecular Dynamics, and Post-Molecular Modeling Analysis

An integrated computational approach to in silico drug design was used to identify novel HIV-1 entry inhibitor scaffolds mimicking broadly neutralizing antibody (bNAb) N6 targeting CD4-binding site of the viral gp120 protein. This computer-based approach included (i) generation of pharmacophore models representing 3D-arrangements of chemical functionalities that make bNAb N6 active towards CD4-binding site of gp120, (ii) shape and pharmacophore-based identification of the N6-mimetic candidates by a web-oriented virtual screening platform Pharmit, (iii) molecular docking of the identified compounds with gp120, (iv) optimization of the docked ligand/gp120 complexes using semiempirical quantum chemical method PM7, and (v) molecular dynamics simulations of the docked structures followed by binding free energy calculations. As a result, six hits able to mimic the key interactions of N6 with the Phe-43 cavity of gp120 were selected as the most probable N6-mimetic candidates. The pivotal role in the interaction of these compounds with gp120 is shown to play multiple van der Waals contacts with conserved residues of the hydrophobic Phe-43 cavity critical for the HIV-1 binding to cellular receptor CD4, as well as hydrogen bond with Asp-368gp120 that increase the chemical affinity without activating unwanted allosteric effect. According to the data of molecular dynamics, the complexes of the identified molecules with gp120 are energetically stable and show the lower values of binding free energy compared with the HIV-1 entry inhibitors NBD-11021 and DMJ-II-121 used in the calculations as a positive control. Taken together, the findings obtained suggest that these compounds may serve as promising scaffolds for the development of novel, highly potent and broad anti-HIV-1 therapeutics.

Alexander M. Andrianov, Grigory I. Nikolaev, Yuri V. Kornoushenko, Anna D. Karpenko, Ivan P. Bosko, Alexander V. Tuzikov

### Learning Structural Genetic Information via Graph Neural Embedding

Learning continuous vector representations of genes has been proved to be conducive for many bioinformatics tasks as it can incorporate information of various sources including gene interactions and gene-disease interactions. However, most of the existing approaches, following a paradigm stemmed from the natural language processing community, treat the embedding context in a flat fashion such as a sequence, and tend to overlook the fact that proteins are more likely to function together. In this study, we propose an unsupervised gene embedding algorithm which utilizes graph convolutional network to learn structural information of genes from their neighborhoods in genetic interaction networks. We also propose a neighborhood sampling strategy to generate training samples. Our approach does not assume conditional independence of the node neighborhood and focuses on learning structural information. We compare our method against state-of-the-art baselines and experimental results demonstrate the effectiveness of our approach.

Yuan Xie, Yulong Pei, Yun Lu, Haixu Tang, Yuan Zhou

### The Cross-Interpretation of QSAR Toxicological Models

The investigation of influence of the molecular structure of different organic compounds on acute, developmental toxicity, mutagenicity has been carried out with the usage of 2D simplex representation of molecular structure and Support Vector Machine (SVM), Random Forest (RF), Gradient Boosting Machine (GBM), Partial Least Squares (PLS). Suitable QSAR (Quantitative Structure - Activity Relationships) models were obtained. The study was focused on QSAR model interpretation. The aim of the study was to develop a set of structural fragments that steadily increase various types of toxicity. The interpretation allowed to detail the molecular environment of known toxicophors and to propose new fragments.

Oleg Tinkov, Pavel Polishchuk, Veniamin Grigorev, Yuri Porozov

### A New Network-Based Tool to Analyse Competing Endogenous RNAs

Interactions between microRNA targets are defined as competing endogenous RNAs. After discovery of the repressive activity of microRNAs with different mechanisms, various experimental or computational approaches have been developed to understand the relationships among their targets. We developed a package ceRNAnetsim that provides network-based computational method as considering the expressions and interaction factors of microRNAs and their targets. By using ceRNA targets that have similar expression value as trigger on a relatively small network with 4 microRNAs and 20 gene targets, the perturbation efficiency of these ceRNAs on the network has been shown to be significantly different. However, the change was observed in the time (or iteration) to gaining steady-state of nodes on the network. So, we have provided the package which defines a user-friendly method for understanding complex ceRNA relationships, simulating the fluctuating behaviors of ceRNAs, clarifying the mechanisms of regulation and defining potentially important ceRNA elements. The ceRNAnetsim package can be found in Bioconductor software packages.

Selcen Ari Yuka, Alper Yilmaz

### Deep Ensemble Models for 16S Ribosomal Gene Classification

In bioinformatics analysis, the correct identification of an unknown sequence by subsequent matching with a known sequence is a crucial and critical initial step. One of the constantly evolving open and challenging areas of research is understanding the adaptation of microbiome communities derived from different environment as well as human gut. The critical component of such studies is to analyze 16s rRNA gene sequence and classify it to a corresponding taxonomy. Thus far recent literature discusses such sequence classification tasks being solved using many algorithms such as early methods of k-mer frequency matching, and assembly-based clustering or advanced methods of machine learning algorithms– for instance, random forests, naïve Bayesian techniques, and recently deep learning architectures. Our previous work focused on a comprehensive study of 16s rRNA gene classification by implementing simplistic singular neural models of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). The outcome of this study demonstrated very promising classification results for family, genus and species taxonomic levels, prompting an immediate investigation into deep ensemble models for problem at hand. In this study, we attempt to classify 16s rRNA gene using deep ensemble models along with a hybrid model that emulates an ensemble in its early convolutional layers followed by a recurrent layer.

Heta P. Desai, Anuja P. Parameshwaran, Rajshekhar Sunderraman, Michael Weeks

### Search for Tandem Repeats in the First Chromosome from the Rice Genome

Using the RPWM method, we searched for tandem repeats of 2 to 50 nucleotides long in the rice genome. We compared the effectiveness of the RPWM method with Mreps, T-reks, Tandem Repeat Finder and ATR Hunter. About 70% of the tandem repeats found could not be found by other algorithms. The correlation of dispersed repeats and transposons with tandem repeats was studied in this work. We assumed that some of the dispersed repeats and transposons originated from tandem repeats

Eugene V. Korotkov, Anastasya M. Kamionskaya, Maria A. Korotkova

### Deep Learning Approach with Rotate-Shift Invariant Input to Predict Protein Homodimer Structure

The ability to predict protein complexes is important for applications in drug design and generating models of high accuracy in the cell. Recently deep learning techniques showed a significant success in protein structure prediction, but a protein docking problem is unsolved yet. We developed a two-staged approach which consists of deep convolutional neural network to predict protein contact map for homodimers and optimization procedure based on gradient descent to build the homodimer structure from the contact map. Neural network uses the distance map calculated as all pairwise Euclidian distances between CB atoms of protein 3D structure as input, which is invariant to rotation and translation. The network has a large receptive filed to capture patterns in contacts between residues. The suggested approach could be generalized to heterodimers because it does not depend on symmetry features inherent in homodimers. The presented algorithm could be also used for scoring protein homodimers models in docking.

Anna Hadarovich, Alexander Kalinouski, Alexander V. Tuzikov

### Development of a Neural Network-Based Approach for Prediction of Potential HIV-1 Entry Inhibitors Using Deep Learning and Molecular Modeling Methods

A generative adversarial autoencoder for the rational design of potential HIV-1 entry inhibitors able to block the region of the viral envelope protein gp120 critical for the virus binding to cellular receptor CD4 was developed using deep learning methods. In doing so, the following studies were carried out: (i) the architecture of the neural network was constructed; (ii) a virtual compound library of potential anti-HIV-1 agents for training the neural network was formed; (iii) molecular docking of all compounds from this library with gp120 was made and calculations of the values of binding free energy were performed; (iv) molecular fingerprints for chemical compounds from the training dataset were generated; (v) training the neural network was implemented followed by estimation of the learning outcomes and work of the autoencoder. The validation of the neural network on a wide range of compounds from the ZINC database was carried out. The use of the neural network in combination with virtual screening of chemical databases was shown to form a productive platform for identifying the basic structures promising for the design of novel antiviral drugs that inhibit the early stages of HIV infection.

Grigory I. Nikolaev, Nikita A. Shuldov, Arseny I. Anischenko, Alexander V. Tuzikov, Alexander M. Andrianov

### In Silico Design and Evaluation of Novel Triazole-Based Compounds as Promising Drug Candidates Against Breast Cancer

Computational development of novel triazole-based aromatase inhibitors (AIs) was carried out followed by investigation of the possible interaction modes of these compounds with the enzyme and prediction of the binding affinity by tools of molecular modeling. In doing so, in silico design of potential AIs candidates fully satisfying the Lipinski’s “rule of five” was performed using the concept of click chemistry. Complexes of these drug-like molecules with the enzyme were then simulated by molecular docking and optimized by semiempirical quantum chemical method PM7. To identify the most promising compounds, stability of the PM7-based ligand/aromatase structures was estimated in terms of the values of binding free energies and dissociation constants. At the final stage, structures of the top ranking compounds bound to aromatase were analyzed by molecular dynamic simulations and binding free energy calculations. As a result, eight hits that specifically interact with the aromatase catalytic site and exhibit the high-affinity ligand binding were selected for the final analysis. The selected AIs candidates show strong attachment to the enzyme active site, suggesting that these small drug-like molecules may present good scaffolds for the development of novel potent drugs against breast cancer.

Alexander M. Andrianov, Grigory I. Nikolaev, Yuri V. Kornoushenko, Sergei A. Usanov

### Identification of Essential Genes with NemoProfile and Various Machine Learning Models

Genes are sequences of nucleotide in DNA that encode proteins. Essential genes are a type of genes that are critical and indispensable for an organism’s survival. Many network-based algorithms have been developed to identify essential genes. We introduce a novel approach to predict essential genes that are based on network motif profiles (NemoProfile) and various machine learning models. Experimental results show that NemoProfile is an effective data feature generated from biological networks, and balanced data is a critical factor to improve the overall performance.

Yangxiao Wang, Wooyoung Kim

### NemoLib: Network Motif Libraries for Network Motif Detection and Analysis

Network motifs are frequent and unique subgraph patterns located inside networks, and have been applied to solve various biological problems. Due to the high computational costs of performing network motif analysis, various tools have been created to make the process more efficient. However, existing tools lack extensible functionality and provide limited output formats. This restricts the ability to use network motif analysis for extensive and exhaustive experiments in real problems. We provide NemoLib (Network Motif Libraries) as a general purpose tool for detection and analysis of network motifs. It is an easily adoptable and highly accessible tool with a focus on efficiency and extensibility.

### Estimating Enzyme Participation in Metabolic Pathways for Microbial Communities from RNA-seq Data

Metatranscriptome sequence data analysis is necessary for understanding biochemical changes in the microbial community and their effects. In this paper, we propose a methodology to estimate activities of individual metabolic pathways to better understand the activity of the entire metabolic network. Our novel pipeline includes an expectation-maximization based estimation of enzyme expression and simultaneous estimation of pathway activity level and enzyme participation level in each pathway. We applied our novel pipeline to metatranscriptome data generated from surface water planktonic communities sampled over a day-night cycle in the Northern Gulf of Mexico (Louisiana Shelf). Our results show the estimated enzyme expression, pathway activity levels as well as enzyme participation levels in each pathway are robust and stable across all data points. In contrast to expression of enzymes, the estimated activity levels of significant number of metabolic pathways strongly correlate with the environmental parameters.

F. Rondel, R. Hosseini, B. Sahoo, S. Knyazev, I. Mandric, Frank Stewart, I. I. Măndoiu, B. Pasaniuc, A. Zelikovsky

### Identification of Virus-Receptor Interactions Based on Network Enhancement and Similarity

As a main composition of the human-associated microbiome, viruses are directly associated with our health and disease. The receptor-binding is critical for the virus infection. So identifying potential virus-receptor interactions will help systematically understand the mechanisms of virus-receptor interactions and effectively treat infectious diseases caused by viruses. Several computational models have been developed to identify virus-receptor interactions based on assumption that similar viruses show similar interaction patterns with receptors and vice versa, but the performance need to be improved. Furthermore, the virus network and the receptor network are also noisy. Therefore, we present a new prediction model (NERLS) to identify potential virus-receptor interactions based on Network Enhancement, virus sequence information and receptor sequence information by Regularized Least Squares. Firstly, the virus network is constructed based on the virus sequence similarity and Gaussian interaction profile (GIP) kernel similarity of viruses by a mean method. They are calculated based on the viral RefSeq genomes downloaded from NCBI and known virus-receptor interactions, respectively. Similarly, we also use the same mean method to construct the receptor network based on the amino acid sequence similarity and known virus-receptor interactions. Then Network Enhancement is applied to denoise the virus network and the receptor network. Finally, we employ the regularized least squares algorithm to identify potential virus-receptor interactions. The 10-fold cross validation (10CV) experimental results indicate that an average Area Under Curve (AUC) values of NERLS is 0.8930, which is superior to other computing models of 0.8675 (IILLS), 0.7959 (BRWH), 0.7577 (LapRLS), and 0.7128 (CMF). Furthermore, the Leave One Out Cross Validation (LOOCV) experimental results also show that NERLS can achieve the AUC values of 0.9210, which is better than other models (IILLS: 0.9061, BRWH: 0.8105, LapRLS: 0.7713, CMF: 0.7491). In addition, a case study also confirms the effectiveness of NERLS in predicting potential virus-receptor interactions.

Lingzhi Zhu, Cheng Yan, Guihua Duan

### Enhanced Functional Pathway Annotations for Differentially Expressed Gene Clusters

Biological pathway enrichment analysis is mainly applied to interpret correlated behaviors of activated gene clusters. In traditional approaches, significant pathways were highlighted based on hypergeometric distribution statistics and calculated P-values. However, two important factors are ignored for enrichment analysis, including fold-change levels of gene expression and gene locations on biological pathways. In addition, several reports have shown that noncoding RNAs could inhibit/activate target genes and affect the results of over-representation analysis. Hence, in this study, we provided an alternative approach to enhance functional gene annotations by considering different fold-change levels, gene locations in a pathway, and non-coding RNA associated genes simultaneously. By considering these additional factors, the ranking of significant P-values would be rearranged and several important and associated biological pathways could be successfully retrieved. To demonstrate superior performance, we used two experimental RNA-seq datasets as samples, including Birc5a and HIF2α knocked down in zebrafish during embryogenesis. Regarding Birc5a knock-down experiments, two biological pathways of sphingolipid metabolism and Herpes simplex infection were additionally identified; for HIF2α knock-down experiments, four missed biological pathways could be re-identified including ribosome biogenesis in eukaryotes, proteasome, purine metabolism, and complement and coagulation cascades. Thus, a comprehensive enrichment analysis for discovering significant biological pathways could be overwhelmingly retrieved and it would provide integrated and suitable annotations for further biological experiments.

Chun-Cheng Liu, Tao-Chuan Shih, Tun-Wen Pai, Chin-Hwa Hu, Lee-Jyi Wang

### Automated Detection of Sleep Apnea from Abdominal Respiratory Signal Using Hilbert-Huang Transform

Sleep Apnea (SA) seriously affects human life and health. In recent years, many studies use polysomnography (PSG) to detect sleep apnea, but it is expensive and inconvenient. In order to solve this problem, this paper proposes a method to detect sleep apnea automatically by using a single Abdominal Respiratory Signal. In this method, Hilbert-Huang Transform (HHT) is used to extract frequency domain features, and combined with time domain features. Then sleep apnea is detected by machine learning methods such as Support Vector Machine, AdaBoosting and Random Forest (RF). The experimental results show that HHT can extract significant frequency domain features, and the accuracy of sleep apnea detection can reach 95% using Random Forest method. This method is better than the existing methods in the convenience and accuracy of detection. It is more suitable for family environment, and has a wide range of application prospects.

Xingfeng Lv, Jinbao Li, Qin Yan

### Na/K-ATPase Glutathionylation: in silico Modeling of Reaction Mechanisms

Na,K-ATPase is a redox-sensitive transmembrane protein. Understanding the mechanisms of Na,K-ATPase redox regulation can help to prevent impairment of Na,K-ATPase functioning under pathological conditions and reduce damage and death of cells. One of the basic mechanisms to protect Na,K-ATPase against stress oxidation is the glutathionylation reaction that is aimed to reduce several principal oxidized cysteines (244, 458, and 459) that are involved in Na,K-ATPase action regulation. In this study, we carried out in silico modeling to evaluate glutathione affinity on various stages of Na,K-ATPase action cycle, as well as to discover a reaction mechanism of disulfide bond formation between reduced glutathione and oxidized cysteine. To achieve this goal both glutathione and Na,K-ATPase conformer sampling was applied, the reliability of the protein-ligand complexes was examined by MD assay, the reaction mechanism was studied using semi-empirical PM6-D3H4 approach that could have a deal with large organic systems optimization.

Yaroslav V. Solovev, Daria S. Ostroverkhova, Gaik Tamazian, Anton V. Domnin, Anastasya A. Anashkina, Irina Yu. Petrushanko, Eugene O. Stepanov, Yu. B. Porozov

### HiChew: a Tool for TAD Clustering in Embryogenesis

The three-dimensional structure of the Drosophila chromatin has been shown to change at the early stages of embryogenesis from the state with no local structures to compartmentalized chromatin segregated into topologically associated domains (TADs). However, the dynamics of TAD formation and its association with the expression and epigenetics dynamics is not fully understood. As TAD calling and analysis of TAD dynamics have no standard, universally accepted solution, we have developed HiChew, a specialized tool for segmentation of Hi-C maps into TADs of a given expected size and subsequent clustering of TADs based on their dynamics during the embryogenesis. To validate the approach, we demonstrate that HiChew clusters correlate with genomic and epigenetic characteristics. Particularly, in accordance with previous findings, the maturation rate of TADs is positively correlated with the number of housekeeping genes per TAD and negatively correlated with the length of housekeeping genes. We also report a high positive correlation of the maturation rate of TADs with the growth rate of the associated ATAC-Seq signal.

Nikolai S. Bykov, Olga M. Sigalova, Mikhail S. Gelfand, Aleksandra A. Galitsyna

### SC1: A Tool for Interactive Web-Based Single Cell RNA-Seq Data Analysis

Single cell RNA-seq (scRNA-Seq) is critical for studying cellular function and phenotypic heterogeneity as well as the development of tissues and tumors. Here, we present a web-based interactive scRNA-Seq data analysis tool publicly accessible at https://sc1.engr.uconn.edu . The tool implements a novel method of selecting informative genes based on Term-Frequency Inverse-Document-Frequency (TF-IDF) scores and provides a broad range of methods for cell clustering, differential expression, gene enrichment, interactive visualization, and cell cycle analysis. In just a few steps, researchers can generate a comprehensive initial analysis and gain powerful insights from their single cell RNA-seq data.

Marmar Moussa, Ion I. Măndoiu

### Quantitative Analysis of the Dynamics of Maternal Gradients in the Early Drosophila Embryo

Predetermination, formation and maintenance of the primary morphogenetic gradient (bicoid, bcd, gradient) of the early Drosophila embryo involves many interrelated processes. Here we focus on a systems biological analysis of the bcd mRNA redistribution in an early embryo. The results of the quantitative analysis of experimental data, together with the results of their dynamic modeling, substantiate the role of active transport in the redistribution of the bcd mRNA.

Ekaterina M. Myasnikova, Victoria Yu. Samuta, Alexander V. Spirov

### Atom Tracking Using Cayley Graphs

While atom tracking with isotope-labeled compounds is an essential and sophisticated wet-lab tool in order to, e.g., illuminate reaction mechanisms, there exists only a limited amount of formal methods to approach the problem. Specifically when large (bio-)chemical networks are considered where reactions are stereo-specific, rigorous techniques are inevitable. We present an approach using the right Cayley graph of a monoid in order to track atoms concurrently through sequences of reactions and predict their potential location in product molecules. This can not only be used to systematically build hypothesis or reject reaction mechanisms (we will use the mechanism “Addition of the Nucleophile, Ring Opening, and Ring Closure” as an example), but also to infer naturally occurring subsystems of (bio-)chemical systems. We will exemplify the latter by analysing the carbon traces within the TCA cycle and infer subsystems based on projections of the right Cayley graph onto a set of relevant atoms.

Marc Hellmuth, Daniel Merkle, Nikolai Nøjgaard

### Backmatter

Weitere Informationen