main-content

## Über dieses Buch

The two-volume set LNBI 11465 and LNBI 11466 constitutes the proceedings of the 7th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2019, held in Granada, Spain, in May 2019.

The total of 97 papers presented in the proceedings, was carefully reviewed and selected from 301 submissions. The papers are organized in topical sections as follows:

Part I: High-throughput genomics: bioinformatics tools and medical applications; omics data acquisition, processing, and analysis; bioinformatics approaches for analyzing cancer sequencing data; next generation sequencing and sequence analysis; structural bioinformatics and function; telemedicine for smart homes and remote monitoring; clustering and analysis of biological sequences with optimization algorithms; and computational approaches for drug repurposing and personalized medicine.

Part II: Bioinformatics for healthcare and diseases; computational genomics/proteomics; computational systems for modelling biological processes; biomedical engineering; biomedical image analysis; and biomedicine and e-health.

## Inhaltsverzeichnis

### A Coarse-Grained Representation for Discretizable Distance Geometry with Interval Data

We propose a coarse-grained representation for the solutions of discretizable instances of the Distance Geometry Problem (DGP). In several real-life applications, the distance information is not provided with high precision, but an approximation is rather given. We focus our attention on protein instances where inter-atomic distances can be either obtained from the chemical structure of the molecule (which are exact), or through experiments of Nuclear Magnetic Resonance (which are generally represented by real-valued intervals). The coarse-grained representation allows us to extend a previously proposed algorithm for the Discretizable DGP (DDGP), the branch-and-prune (BP) algorithm. In the standard BP, atomic positions are fixed to unique positions at every node of the search tree: we rather represent atomic positions by a pair consisting of a feasible region, together with a most-likely position for the atom in this region. While the feasible region is a constant during the search, the associated position can be refined by considering the new distance constraints that appear at further layers of the search tree. To perform the refinement task, we integrate the BP algorithm with a spectral projected gradient algorithm. Some preliminary computational experiments on artificially generated instances show that this new approach is quite promising to tackle real-life DGPs.

Antonio Mucherino, Jung-Hsin Lin, Douglas S. Gonçalves

### Fragment-Based Drug Design to Discover Novel Inhibitor of Dipeptidyl Peptidase-4 (DPP-4) as a Potential Drug for Type 2 Diabetes Therapy

Diabetes mellitus is among the highest cause of death in the world. Medicinal treatment of diabetes mellitus can be achieved by inhibiting Dipeptidyl Peptidase-4 (DPP-4). This enzyme rapidly inactivates incretin, which acts as a glucoregulatory hormone in the human body. Fragment-based drug design through computational studies was conducted to discover novel DPP-4 inhibitors. About 7,470 fragments out of 343,798 natural product compounds were acquired from applying Astex Rule of Three. The molecular docking simulation was performed on the filtered fragments against the binding site of DPP-4. Fragment-based drug design was carried out by growing new structures from the potential fragments by employing DataWarrior software. The generated ligand libraries were evaluated based on the toxicity properties before underwent virtual screening, rigid, and induced-fit molecular docking simulation. Selected ligands were subjected to the pharmacological and toxicological property analysis by applying DataWarrior, Toxtree, and SWISSADME software. According to the ligand affinity, which based on the ∆G binding value and molecular interaction along with the pharmacological properties of the ligand, two best ligands, namely FGR-2 and FGR-3, were chosen as the novel inhibitor of DPP-4. Further in vitro, in vivo, and clinical trial analysis must be executed in order to validate the selected ligands therapeutic activity as drug candidates for type 2 diabetes.

Eka Gunarti Ningsih, Muhammad Fauzi Hidayat, Usman Sumo Friend Tambunan

### Discovery of Novel Alpha-Amylase Inhibitors for Type II Diabetes Mellitus Through the Fragment-Based Drug Design

Diabetes mellitus is a metabolic disorder leading to hyperglycemia and organ damage. In 2017, the International Diabetes Federation (IDF) reported that about 425 million people living with diabetes, most of which suffer from type 2 diabetes mellitus. The drug development for controlling glucose level is crucial to treat people with type 2 diabetes mellitus. Alpha-amylase plays an imperative role in carbohydrate hydrolysis. Hence, the inhibition of alpha-amylase, which halt the glucose absorption, can be a promising pathway for developing type 2 diabetes mellitus drugs. Natural product has been known as the lead drugs for various diseases. In this research, the fragment merging drug design was performed by employing both the existing drug, voglibose, as the template and the natural product compounds to generate newly constructed ligands. The fragments were acquired from ZINC15 natural product database and then were screened according to Astex’s Rules of Three, pharmacophore properties, and molecular docking simulation. The 482 selected fragments were evaluated under Lipinski’s Rule of Five and toxicity effects using DataWarrior software. The ligands underwent molecular flexible docking simulation followed by the ADME-Tox prediction by using Toxtree, AdmetSAR, and SwissADME software. In the end, two lead compounds showed the best properties as an alpha-amylase inhibitor based on their low ΔGbinding, acceptable RMSD score, favorable pharmacological properties, and molecular interaction.

Yulianti, Agustinus Corona Boraelis Kantale, Usman Sumo Friend Tambunan

### Compression of Nanopore FASTQ Files

The research and development of tools for genomic data compression has focused so far on data generated by second-generation sequencing technologies, while third-generation technologies, such as nanopore technologies, have received little attention in the data compression research community. In this paper, we investigate compression schemes for nanopore FASTQ files. We propose a nanopore quality scores compressor, called DualCtx, which yields significant improvements in compression performance with respect to the state-of-the-art. We also extend DualCtx to a full FASTQ compressor, termed DualFqz, by substituting DualCtx for the quality score compression module in a variant of Fqzcomp. We tested DualFqz and various existing compressors on a large nanopore data set. The results show that DualFqz achieves the best compression performance. The experiments also show that most current implementations of compressors fail to execute correctly on files with long variable length reads.DualCtx and DualFqz are freely available for download at: https://github.com/guidufort/DualFqz .

Guillermo Dufort y Álvarez, Gadiel Seroussi, Pablo Smircich, José Sotelo, Idoia Ochoa, Álvaro Martín

### De novo Transcriptome Assembly of Solea senegalensis v5.0 Using TransFlow

Senegalese sole is an economically important flatfish species in aquaculture. Development of new bioinformatics resources allows the optimization of its breeding in fisheries. Sequencing data from larvae in different development stages obtained from different sequencing platforms (more than 270 M of Illumina paired-end reads and more than 3 M of Roche/454 reads) were used. Due to the high complexity of the samples, an optimized version of TransFlow, an automated, reproducible and flexible framework for de novo transcriptome assembly, was used to get the most complete de novo transcriptome assembly. Best transcriptome selection was based on the principal component analysis provided by TransFlow. Two transcriptomes, one all-Illumina and other reconciling Illumina and Roche/454, were selected and annotated using Full-LengtherNext, and the tentative transcripts were filtered by alignment to partial genomic sequences to avoid artifacts. The reconciled non-redundant assembly composed by Illumina and Roche/454 reads seems to be the best strategy. It consists of 55 440 transcripts of which 22 683 code for 17 570 different proteins described in databases. The obtained v5.0 reduces the number of tentative transcripts by 79,33% compared v4.0, what will increase the precision of future transcriptomic studies.

José Córdoba-Caballero, Pedro Seoane-Zonjic, Manuel Manchado, M. Gonzalo Claros

### Deciphering the Role of PKC in Calpain-CAST System Through Formal Modeling Approach

Calcium-activated calpain has critical role in a variety of calcium regulated processes. Calcium activates two other proteins, Calpastatin (CAST) and Protein Kinase C (PKC) to make a regulatory network which is pivotal in cell physiology. CAST binds with calpain to form complex for hampering its hyperactivation. PKC phosphorylates CAST while calpain proteolyzes active PKC and increases calcium influx. Based on biological knowledge, a qualitative (discrete) model is constructed that provides new insights into the dynamics of calpain-CAST and PKC relationship. The model predicts that PKC maintains calpain-CAST complex by interacting with both active calpain and CAST. It is also observed that in physiological condition, there is a homeostatic behavior between calcium, CAST and PKC. Some significant discrete cycles are also identified by analyzing betweenness centralities of the discrete states. There is one stable state in the model in which calpain and calcium are hyperactivated while CAST and PKC are inactivated. The model is validated through the stochastic Petri Net model that further reveals its quantitative dynamical behaviors. Physiology is perturbed by hyperactivation of calpain which results in the deregulation of homeostasis. Both models suggest that inhibition of calpain by CAST is a better therapeutic strategy which requires healthy assistance from PKC. In conclusion, homeostasis of calcium, CAST and PKC is pivotal for a healthy state.

Javaria Ashraf, Jamil Ahmad, Zaheer Ul-Haq

### The Application of Machine Learning Algorithms to Diagnose CKD Stages and Identify Critical Metabolites Features

Background: Chronic kidney disease (CKD) is a progressive and heterogeneous disorder that affects kidney structures and functions. Now it becomes one of the major challenges of public health. Early-stage detection, specialized stage treatments can significantly defer or prevent the progress of CKDs. Currently, clinical CKD stage diagnoses are mainly based on the level of glomerular filtration rate (GFR). However, there are many different equations and approaches to estimate GFR, which can cause inaccurate and contradictory results.Methods: In this study, we provided a novel method and used machine learning techniques to construct high-performance CKD stage diagnosis models to diagnose CKDs stages without estimating GFR.Results: We analyzed a dataset of positive metabolite levels in blood samples, which were measured by mass spectrometry. We also developed a feature selection algorithm to identify the most critical and correlated metabolite features related to CKD developments. Then, we used selected metabolite features to construct improved and simplified CKD stage diagnosis models, which significantly reduced the diagnosis cost and time when compared with previous prediction models. Our improved model could achieve over 98% accuracy in CKD prediction. Furthermore, we applied unsupervised learning algorithms to further validate our models and results. Finally, we studied the correlations between the selected metabolite features and CKD developments. The selected metabolite features provided insights into CKD early stage diagnosis, pathophysiological mechanisms, CKD treatments, and drug development.

Bing Feng, Ying-Yong Zhao, Jiexi Wang, Hui Yu, Shiva Potu, Jiandong Wang, Jijun Tang, Yan Guo

### Expression Change Correlations Between Transposons and Their Adjacent Genes in Lung Cancers Reveal a Genomic Location Dependence and Highlights Cancer-Significant Genes

Recent studies using high-throughput sequencing technologies have demonstrated that transposable elements seem to be involved not only in some cancer onset but also in cancer development. However, their activity is not easy to assess due to the large number of copies present throughout the genome. In this study NearTrans bioinformatic workflow has been used with RNA-seq data from 16 local patients with lung cancer, 8 with adenocarcinoma and 8 with small cell lung cancer. We have found 16 TE-gene pairs significantly expressed in the first disease, and 32 TE-gene pairs the second. Interestingly, some of the genes have been previously described as oncogenes, indicating that normal lung cell compromised on an oncogenic change displays some transposon expression reprogramming that seems to be genome-location dependent. Supporting this is the finding that most differentially expressed transposons change their expression in the same direction than their adjacent genes, and with a similar level of change. The analysis of adjacent genes may reveal or confirm important lung cancer biomarkers as well as new insights in its molecular basis.

Macarena Arroyo, Rafael Larrosa, M. Gonzalo Claros, Rocío Bautista

### Signal Processing Based CNV Detection in Bacterial Genomes

Copy number variation (CNV) plays important role in drug resistance in bacterial genomes. It is one of the prevalent forms of structural variations which leads to duplications or deletions of regions with varying size across the genome. So far, most studies were concerned with CNV in eukaryotic, mainly human, genomes. The traditional laboratory methods as microarray genome hybridization or genotyping methods are losing its effectiveness with the omnipotent increase of fully sequenced genomes. Methods for CNV detection are predominantly targeted at eukaryotic sequencing data and only a few of tools is available for CNV detection in prokaryotic genomes. In this paper, we propose the CNV detection algorithm derived from state-of-the-art methods for peaks detection in the signal processing domain. The modified method of GC normalization with higher resolution is also presented for the needs of the CNV detection. The performance of the algorithms are discussed and analyzed.

Robin Jugas, Martin Vitek, Denisa Maderankova, Helena Skutkova

### Dependency Model for Visible Aquaphotomics

The main idea of this research is the extension of the aquaphotomics method to the visible range of the spectrum. Already known as a fact that each chemical element has a unique pattern in the absorption of electromagnetic radiation. Such a structure is a spectrum bands absorbed by an element and is called its ‘fingerprint’. The fingerprint section is presented in a wide range of spectrum, including the visible part. Absorption in the visible spectrum provides unique information about the elements or compounds present in water. This allows to analyze the concentration of microparticles and chemical elements in water due to changes in the molecular water system, presented in the form of a spectral picture of water. The results presented in this paper prove the existence of a correlation between some parameters of water and its spectral characteristics.

Vladyslav Bozhynov, Pavel Soucek, Antonin Barta, Pavla Urbanova, Dinara Bekkozhayeva

### Image Based Individual Identification of Sumatra Barb (Puntigrus Tetrazona)

The paper deal with the individual fish identification of the same species based on digital image of the fish. The proof of concept of image based individual identification is introduced on the small group fish. The method is completely noninvasive and can overcome the disadvantages of standard invasive identification such as tagging. The experiments proved the hypothesis that the visible patterns on Sumatra Barb (Puntigrus tetrazona) body can be used for individual identification. In the first step, the database of 43 fish (was created by the taking of the images of fish in different pose. Images were taken in an aquarium with a water. After data collection, data was processed by the image processing methods to determine the features. The simple nearest neighbor classification was used to test individual identification. The accuracy of classification was 100%. The method proved the hypothesis that the visible pattern on Sumatra Barb can be used for fully automated individual fish identification. It can be substituted current practice of fish identification based on tagging and marking. The long-term stability of the pattern and the classification power for large fish group should be studied in the future.

Dinara Bekkozhayeva, Mohammademehdi Saberioon, Petr Cisar

### Alignment of Sequences Allowing for Non-overlapping Unbalanced Translocations of Adjacent Factors

Unbalanced translocations take place when two unequal chromosome sub-sequences swap, resulting in an altered genetic sequence. Such large-scale gene modification are among the most frequent chromosomal alterations, accounted for 30% of all losses of heterozygosity. However, despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of aligning sequences allowing for this kind of modification.In this paper we investigate the sequence alignment problem when the edit operations are non-overlapping unbalanced translocations of adjacent factors.Specifically, we present an alignment algorithm for the problem working in $$\mathcal {O}(m^3)$$ -time and $$\mathcal {O}(m^3)$$ -space, where m is the length of the involved sequences. To the best of our knowledge this is the first solution in literature for the alignment problem allowing for unbalanced translocations of factors.

Simone Faro, Arianna Pavone

### Probability in HPLC-MS Metabolomics

This article is pinpointing the importance of the probabilistic methods for the analysis of the HPLC-MS measurement datasets in metabolomics research. The approach presents the ability to deal with the different noise sources and the process of the probability assignment is demonstrated in its general form.The illustrative examples of the probability functions and propagation into subsequent processing and analysis steps consist of precision correction, noise probability, segmentation, spectra comparison, and biomatrices effects on calibration curve estimation.The possible advantages of probability propagation in more data handling are also discussed.

Jan Urban

### Pipeline for Electron Microscopy Images Processing

This article is summarizing the general subtask pipeline during the processing and analysis of electron microscopy images. The overview is going from data acquisition, through noise description, filtration, segmentation, to detection. There are emphasized the difference from the expected conditions in macro-world imaging. The illustrative parameterization and statistical classification are explained on the immunolabeling example.

Pavla Urbanova, Vladyslav Bozhynov, Dinara Bekkozhayeva, Petr Císař, Miloš Železný

### A Greedy Algorithm for Detecting Mutually Exclusive Patterns in Cancer Mutation Data

Some somatic mutations are reported to present mutually exclusive patterns. It is a basic computational problem to efficiently extracting mutually exclusive patterns from cancer mutation data. In this article, we focus on the inter-set mutual exclusion problem, which is to group the genes into at least two sets, with the mutations in the different sets mutually exclusive. The proposed algorithm improves the calculation of the score of mutual exclusion. The improved measurement considers the percentage of supporting cases, the approximate exclusivity degree and the pair-wise similarities of two genes. Moreover, the proposed algorithm adopts a greedy strategy to generate the sets of genes. Different from the existing approaches, the greedy strategy considers the scores of mutual exclusion between both the genes and virtual genes, which benefits the selection with the size restrictions. We conducted a series of experiments to verify the performance on simulation datasets and TCGA dataset consisting of 477 real cases with more than 10 million mutations within 28507 genes. According to the results, our algorithm demonstrated good performance under different simulation configurations. In addition, it outperformed CoMEt, a widely-accepted algorithm, in recall rates and accuracies on simulation datasets. Moreover, some of the exclusive patterns detected from TCGA dataset were supported by published literatures.

Chunyan Yang, Tian Zheng, Zhongmeng Zhao, Xinnuo He, Xuanping Zhang, Xiao Xiao, Jiayin Wang

### Qualitative Comparison of Selected Indel Detection Methods for RNA-Seq Data

RNA sequencing (RNA-Seq) provides both gene expression and sequence information, which can be exploited for a joint approach to explore cell processes in general and diseases caused by genomic variants in particular. However, the identification of insertions and deletions (indels) from RNA-Seq data, which for instance play a significant role in the development, detection, and treatment of cancer, still poses a challenge. In this paper, we present a qualitative comparison of selected methods for indel detection from RNA-Seq data. More specifically, we benchmarked two promising aligners and two filter methods on simulated as well as on real RNA-Seq data. We conclude that in cases where reliable detection of indels is crucial, e.g. in a clinical setting, the usage of our pipeline setup is superior to other state-of-the-art approaches.

Tamara Slosarek, Milena Kraus, Matthieu-P. Schapranow, Erwin Boettinger

### Structural and Functional Features of Glutathione Reductase Transcripts from Olive (Olea europaea L.) Seeds

The olive seed is a promising by product generated in the olive oil related industries, with increasing interest because of its nutritional value and potential nutraceutical properties. Knowledge concerning the antioxidant capacity of this new alimentary material is scarce. Moreover, oxidative homeostasis and signaling involved physiological processes such as development, dormancy and germination in the olive seed are also unknown. Glutathione (one of the most abundant antioxidants in plant cells), is crucial for seeds physiology, and for defense and detoxification mechanisms. The availability of glutathione in its reduced (GSH) and oxidized (GSSG) forms, the ratio of both forms (GSH/GSSG), and their concurrence in other numerous metabolic pathways is tightly regulated by numerous enzymes. Prominent among these enzymes is glutathione reductase (GR), which has been considered essential for seedling growth and development. The present work aims to increase the knowledge about the functional insights of GR in olive seeds. Searching in the olive transcriptome, at least 19 GR homologues (10 from seed and 9 from vegetative tissue) were identified and retrieved. An in silico analysis was carried out, which included phylogeny, 3-D modelling of the N-terminus, and the prediction of cellular localization and post-translational modifications (PTM) for these gene products. The high variability of forms detected for this enzyme in olive seeds and their susceptibility to numerous PTMs suggest a relevant role for this enzyme in redox metabolism and signalling events.

Elena Lima-Cabello, Isabel Martínez-Beas, Estefanía García-Quirós, Rosario Carmona, M. Gonzalo Claros, Jose Carlos Jimenez-Lopez, Juan de Dios Alché

### Prediction of Thermophilic Proteins Using Voting Algorithm

Thermophilic proteins have widely used in food, medicine, tanning, and oil drilling. By analyzing the protein sequence, the superior structure and properties of the protein sequence are obtained, which is used to efficiently predict the protein species. In this paper, a voting algorithm was designed independently. Protein features and dimensions were extracted and reduced, respectively. Data was predicted by WEKA. Next, the voting algorithm was applied to the data obtained by the above processing. In this experiment, the highest accuracy rate of 93.03% was achieved. This experiment has at least two advantages: First, the voting algorithm was developed independently. Second, any optimization method was not used for this experiment, which prevents over-fitting. Therefore, voting is a very effective strategy for the thermal stability of proteins. The prediction data set used in this paper can be freely downloaded from http://lab.malab.cn/~lijing/thermo_data.html .

Jing Li, Pengfei Zhu, Quan Zou

### Classifying Breast Cancer Histopathological Images Using a Robust Artificial Neural Network Architecture

Pathological diagnosis is the standard for the diagnosis and identification of breast malignancies. Computer-aided diagnosis (CAD) is widely applied in pathological image analysis to help pathologists improving the accuracy, efficiency, and consistency in diagnosis. The traditional CAD methods rely on the expert domain knowledge, time-consuming feature engineering, which is insufficient to real-world systems. In recent studies, deep learning methods have been explored to improve the performance of pathological CAD. However, typical deep methods mainly suffer from the following limitations on pathological image classification. (i) The model cannot extract rich and informative features due to the shallow network structure. (ii) The commonly adopted patch-wise classification strategy makes it impossible to obtain the global features at the image level. To address the two issues, in this paper we propose to use a deep ResNet structure with Convolutional Block Attention Module (CBAM), in order to extract richer and finer features from pathological images. Moreover, we abandon the patch-wise classification strategy and perform an end-to-end training instead. The public BreakHis dataset is used to evaluate our proposed method. The results show that our model achieves a significant improvement over the baseline methods.

Xianli Zhang, Yinbin Zhang, Buyue Qian, Xiaotong Liu, Xiaoyu Li, Xudong Wang, Changchang Yin, Xin Lv, Lingyun Song, Liang Wang

### Spatial Attention Lesion Detection on Automated Breast Ultrasound

Automated Breast Ultrasound (ABUS) is widely applied in breast screening mainly because of its non-invasive, and radiation-free nature, and the high interoperator reproducibility. Due to the complexity and high volume of data, reading ABUS images is a routine but time-consuming task for sonographers. Accordingly, the computer-aided diagnosis (CAD) has been introduced to help, in order to detect breast lesion efficiently. Traditional techniques such as watershed and fuzzy c-means did not perform satisfactorily, due to the strong underlying assumptions and complex image processing. Lately, deep learning has been explored in medical image analysis. However, it often leads to high false positive rates, which is mainly caused by its requirement of abundant training data and the lack of domain knowledge. To address these issues, we propose a novel lesion detection framework based on the U-net segmentation architecture, and explore a novel method using spatial feature map and attention skip connection. We retrospectively evaluate our model on the data of 142 patients with 305 lesions and 70 no-lesion volumes, and it significantly outperforms the comparison methods with the sensitivity of 92.1% with 1.92 false positives per volume. The promising results suggest that our proposed framework is a solid tool to assist ABUS in breast screening.

Feiqian Wang, Xiaotong Liu, Buyue Qian, Litao Ruan, Rongjian Zhao, Changchang Yin, Na Yuan, Rong Wei, Xin Ma, Jishang Wei

### Essential Protein Detection from Protein-Protein Interaction Networks Using Immune Algorithm

The prediction of essential proteins in protein-protein interaction (PPI) networks plays a pivotal part in improving the cognition of biological organisms. This study presents a novel computational technique, called EPIA, to discover essential proteins by employing immune algorithm. In EPIA, each antibody denotes a candidate essential protein set, which is initialized in a random way among all proteins in a PPI network. Then the vaccine is extracted based on the prediction results of the existing essential protein identification methods. Next, EPIA utilizes four operators, crossover, mutation, vaccination and immune selection to update the antibody population and search for the optimal candidate essential protein set. The experimental results on two species (Saccharomyces cerevisiae and Drosophila melanogaster) demonstrate that EPIA can obtain a better performance on identifying essential proteins compared to other existing methods.

Xiaoqin Yang, Xiujuan Lei, Jiayin Wang

### Integrating Multiple Datasets to Discover Stage-Specific Cancer Related Genes and Stage-Specific Pathways

Investigating the evolution of complex diseases through different disease stages is critical for understanding the root cause of these diseases, which is fundamental for their accurate prognosis and effective treatment. There have been numerous studies that have identified many single genes, static modules and individual pathways related cancer progression, but few attempt has been developed to identify specific genes and pathways interactions related individual disease stages via data integration. To address these issues, we have proposed a general working flow, to reveal disease stages dynamics by joint analysis of multi-level datasets. Our contribution is two-fold. Firstly, we present a classical regression method to identify stage-specific cancer genes, where the gene expression and DNA methylation datasets are integrated. Secondly, we construct a pathway evolution network, which considered interactions among specific mapped pathways and their overlapped genes. Interestingly, the potential discovered biological functions from this network together with the common bridges and genes, not only help us to understand the functional evolution and dynamics of complex diseases in a more deep fashion, but also useful for clinical management to design customized drugs with more effective therapy.

Bolin Chen, Chaima Aouiche, Xuequn Shang

### Integrated Detection of Copy Number Variation Based on the Assembly of NGS and 3GS Data

The genomic coverage of copy number variations (CNVs) ranges from 5% to 10%, which is one of the essential pathogenic factors of human diseases. The detection of large CNVs is still defective. However, the read length of the third-generation sequencing (3GS) data is longer than that of the next-generation sequencing (NGS) data, which can theoretically solve the defect that the long variation can’t be detected. However, due to the low accuracy of the 3GS data, it is difficult to apply in practice. To a large extent, it is a supplement to the NGS data research. To solve these problems, we developed a new mutation detection tool named AssCNV23 in this paper. Firstly, this tool corrects the 3GS data to solve the problem of high error rate, and then combines the results of a variety of mutation detection tools to improve the accuracy of the initial mutation set and to solve the detection bias of a single detection tool. At the same time, the high-quality 3GS data was introduced by AssCNV23 to guide the NGS data to assemble, and then detects the CNV after getting enough length data. Finally, to improve the detection efficiency, the tool generates images containing the sequence depth information based on the read depth strategy and uses the convolutional neural network to detect the existing CNVs. The experimental results show that AssCNV23 guarantees a high level of breakpoint accuracy and performs well in identifying large variation. Compared with other tools, the deep learning model has advantages in accuracy and sensitivity, and Matthew correlation coefficient (MCC) performs well in various experiments. This algorithm is relatively reliable.

Feng Gao, Liwei Gao, JingYang Gao

### Protein Remote Homology Detection Based on Profiles

As a most important task in protein sequence analysis, protein remote homology detection has been extensively studied for decades. Currently, the profile-based methods show the state-of-the-art performance. Position-Specific Frequency Matrix (PSFM) is a widely used profile. The reason is that this profile contains evolutionary information, which is critical for protein sequence analysis. However, there exists noise information in the profiles introduced by the amino acids with low frequencies, which are not likely to occur in the corresponding sequence positions during evolutionary process. In this study, we propose one method to remove the noise information in the PSFM by removing the amino acids with low frequencies and two a profile can be generated, called Top frequency profile (TFP). Autocross covariance (ACC) transformation is performed on the profile to convert them into fixed length feature vectors. Combined with Support Vector Machines (SVMs), the predictor is constructed. Evaluated on a benchmark dataset, experimental results show that the proposed method outperforms other state-of-the-art predictors for protein remote homology detection, indicating that the proposed method is useful tools for protein sequence analysis. Because the profiles generated from multiple sequence alignments are important for protein structure and function prediction, the TFP will has many potential applications.

Qing Liao, Mingyue Guo, Bin Liu

### Reads in NGS Are Distributed over a Sequence Very Inhomogeneously

Distribution of read starts over a sequences genetic entity is studied. Key question was whether the starts are distributed uniformly and homogeneously along a sequence, or there exist some spots of the increased local density of the starts. To answer the question, 15 bacterial genomes have been studied. It was found that some genomes exhibit extremely far distribution pattern, from an homogeneity, while others show lower level of the inhomogeneity. The inhomogeneity level was determined through the Kullback-Leibler distance between the real string distribution, and that one bearing the most probable continuations of the shorter strings.

### Differential Expression Analysis of ZIKV Infected Human RNA Sequence Reveals Potential Genetic Biomarkers

Zika virus (ZIKV) infection is considered to be an emerging viral outbreak due to its link to diseases like microcephaly, Guillain-Barre Syndrome in human which is an alarming concern. In this study, we implemented our reproducible RNA-seq analysis pipeline to quantify RNA-seq data in terms of transcripts, and gained common expression results from intersection of three differential expression identification tools. This uncovered significant DEGs of high consensus, significant DEGs of moderate consensus, significant DEGs of low consensus. Moreover, the highly significant DEGs provided us with six DEGs which are transcription factors, which may be involved in the altered biological process somehow. The presented study provides researchers with highly reproducible pipeline for viral studies as well as the novel computational findings for the transcription factors (TFs) involved in ZIKV infection which could enable the researchers to develop new therapeutic strategies to tackle the infection.

### Identification of Immunoglobulin Gene Usage in Immune Repertoires Sequenced by Nanopore Technology

The immunoglobulin receptor represents a central molecule in acquired immunity. The complete set of immunoglobulins present in an individual is known as immunological repertoire. The identification of this repertoire is particularly relevant in immunology and cancer research and diagnostics. In a seminal work we provided a proof of concept of the novel ARTISAN-PCR amplification method, we adapted this technology for sequencing using Nanopore technology. This approach may represent a faster, more portable and cost-effective alternative to current methods. In this study we present the pipeline for the analysis of immunological repertoires obtained by this approach. This paper shows the performance of immune repertoires sequenced by Nanopore technology, using measures of error, coverage and gene usage identification.In the bioinformatic methodology used in this study, first, Albacore Base calling software, was used to translate the electrical signal of Nanopore to DNA bases. Subsequently, the sequons, introduced during amplification, were aligned using bl2seq from Blast. Finally, selected reads were mapped using IMGT/HighV-QUEST and IgBlast.Our results demonstrate the feasibility of immune repertoire sequencing by Nanopore technology, obtaining higher depth than PacBio sequencing and better coverage than pair-end based technologies. However, the high rate of systematic errors indicates the need of improvements in the analysis pipeline, sequencing chemistry and/or molecular amplification.

Roberto Ahumada-García, Jorge González-Puelma, Diego Álvarez-Saravia, Ricardo J. Barrientos, Roberto Uribe-Paredes, Xaviera A. López-Cortés, Marcelo A. Navarrete

### Flexible and Efficient Algorithms for Abelian Matching in Genome Sequence

Approximate matching in strings is a fundamental and challenging problem in computer science and in computational biology, and increasingly fast algorithms are highly demanded in many applications including text processing and dna sequence analysis. Recently efficient solutions to specific approximate matching problems on genomic sequences have been designed using a filtering technique, based on the general abelian matching problem, which firstly locates the set of all candidate matching positions and then perform an additional verification test on the collected positions.The abelian pattern matching problem consists in finding all substrings of a text which are permutations of a given pattern. In this paper we present a new class of algorithms based on a new efficient fingerprint computation approach, called Heap-Counting, which turns out to be fast, flexible and easy to be implemented. We prove that, when applied for searching short patterns on a dna sequence, our solutions have a linear worst case time complexity. In addition we present an experimental evaluation which shows that our newly presented algorithms are among the most efficient and flexible solutions in practice for the abelian matching problem in dna sequences.

Simone Faro, Arianna Pavone

### Analysis of Gene Regulatory Networks Inferred from ChIP-seq Data

Computational network biology aims to understand cell behavior through complex network analysis. The Chromatin ImmunoPrecipitation sequencing (ChIP-seq) technique allows interrogating the physical binding interactions between proteins and DNA using Next-Generation Sequencing. Taking advantage of this technique, in this study we propose a computational framework to analyze gene regulatory networks built from ChIP-seq data. We focus on two different cell lines: GM12878, a normal lymphoblastoid cell line, and K562, an immortalised myelogenous leukemia cell line. In the proposed framework, we preprocessed the data, derived network relationships in the data, analyzed their network properties, and identified differences between the two cell lines through network comparison analysis. Throughout our analysis, we identified known cancer genes and other genes that may play important roles in chronic myelogenous leukemia.

Eirini Stamoulakatou, Carlo Piccardi, Marco Masseroli

### Function vs. Taxonomy: The Case of Fungi Mitochondria ATP Synthase Genes

We studied the relations between triplet composition of the family of mitochondrial atp6, atp8 and atp9 genes, their function, and taxonomy of the bearers. The points in 64-dimensional metric space corresponding to genes have been clustered. It was found the points are separated into three clusters corresponding to those genes. 223 mitochondrial genomes have been enrolled into the database.

Michael Sadovsky, Victory Fedotovskaya, Anna Kolesnikova, Tatiana Shpagina, Yulia Putintseva

### Non-Coding Regions of Chloroplast Genomes Exhibit a Structuredness of Five Types

We studied the statistical properties of non-coding regions of chloroplast genomes of 391 plants. To do that, each non-coding region has been tiled with a set of overlapping fragments of the same length, and those fragments were transformed into triplet frequency dictionaries. The dictionaries were clustered in 64-dimensional Euclidean space. Five types of the distributions were identified: ball, ball with tail, ball with two tails, lens with tail, and lens with two tails. Besides, the multi-genome distribution has been studied: there are ten species performing an isolated and distant cluster; surprisingly, there is no immediate and simple relation in taxonomy composition of these clusters.

### Characteristics of Protein Fold Space Exhibits Close Dependence on Domain Usage

With the growth of the PDB and simultaneous slowing of the discovery of new protein folds, we may be able to answer the question of how discrete protein fold space is. Studies by Skolnick et al. (PNAS, 106, 15690, 2009) have concluded that it is in fact continuous. In the present work we extend our initial observation (PNAS, 106(51) E137, 2009) that this conclusion depends upon the resolution with which structures are considered, making the determination of what resolution is most useful of importance. We utilize graph theoretical approaches to investigate the connectedness of the protein structure universe, showing that the modularity of protein domain architecture is of fundamental importance for future improvements in structure matching, impacting our understanding of protein domain evolution and modification. We show that state-of-the-art structure superimposition algorithms are unable to distinguish between conformational and topological variation. This work is not only important for our understanding of the discreteness of protein fold space, but informs the more critical question of what precisely should be spatially aligned in structure superimposition. The metric-dependence is also investigated leading to the conclusion that fold usage in homology reduced datasets is very similar to usage across all of PDB and should not be ignored in large scale studies of protein structure similarity.

Michael T. Zimmermann, Fadi Towfic, Robert L. Jernigan, Andrzej Kloczkowski

### Triplet Frequencies Implementation in Total Transcriptome Analysis

We studied the structuredness in total transcriptome of Siberian larch. To do that, the contigs from total transcriptome has been labeled with the reads comprising the tissue specific transcriptomes, and the distribution of the contigs from the total transcriptome has been developed with respect to the mutual entropy of the frequencies of occurrence of reads from tissue specific transcriptomes. It was found that a number of contigs contain comparable amounts of reads from different tissues, so the chimeric transcripts to be extremely abundant. On the contrary, the transcripts with high tissue specificity do not yield a reliable clustering revealing the tissue specificity. This fact makes usage of total transcriptome for the purposes of differential expression arguable.

### A Hierarchical and Scalable Strategy for Protein Structural Classification

Protein function prediction is a relevant but challenging task as protein structural data is a large and complex information. With the increase of biological data available there is a demand for computational methods to annotate and help us make sense of this data deluge. Here we propose a model and a data mining based strategy to perform protein structural classification. We are particularly interested in hierarchical classification schemes. To evaluate the proposed strategy, we conduct three experiments using as input protein structural data from biological databases (CATH, SCOPe and BRENDA). Each dataset is associated with a well known hierarchical classification scheme (CATH, SCOP, EC number). We show that our model accuracy ranges from 86% to 95% when predicting CATH, SCOP and EC Number levels respectively. To the best of our knowledge, ours is the first work to reach such high accuracy when dealing with very large data sets.

Vinício F. Mendes, Cleiton R. Monteiro, Giovanni V. Comarela, Sabrina A. Silveira

### Protein Structural Signatures Revisited: Geometric Linearity of Main Chains are More Relevant to Classification Performance than Packing of Residues

Structural signature is a set of characteristics that unequivocally identifies protein folding and the nature of interactions with other proteins or binding compounds. We investigate the use of the geometric linearity of the main chain as a key feature for structural classification. Using polypeptide main chain atoms as structural signature, we showed that this signature is better to preciselly classify than using C $$\alpha$$ only. Our results are equivalent in precision to a structural signature built including artificial points between C $$\alpha$$ s and hence we believe this improvement in classification precision occurs due to the strengthening of geometric linearity.

João Arthur F. Gadelha Campelo, Cleiton Rodrigues Monteiro, Carlos Henrique da Silveira, Sabrina de Azevedo Silveira, Raquel Cardoso de Melo-Minardi

### Positioning Method for Arterial Blood Pressure Monitoring Wearable Sensor

Measuring blood pressure in real time using wearable sensors mounted directly on the patient’s body is promising tool for assessing the state of the cardiovascular system and signalling symptoms of cardiovascular diseases. To solve this problem, we developed a new type of wearable arterial blood pressure monitoring sensor. Constructively, this sensor can be embedded in a flexible bracelet for measuring the pressure in the underlying radial artery. Due to the very small measuring pads (less than 1 mm $$^{2}$$ ) and, consequently, the ability to accurately position the contact pad directly over the artery, it is possible to ensure high quality of blood pressure measurement. However, since the artery itself is generally not visible, the correct positioning of the sensor is a non-trivial problem. In the paper we propose the solution of the problem – the positioning based on monitoring the pulse wave signals using three channels from closely spaced pads of a three-chamber pneumatic sensor.

### Study of the Detection of Falls Using the SVM Algorithm, Different Datasets of Movements and ANOVA

Falls are becoming a major public health problem, which is intensified by the aging of the population. Falls are one of the main causes of death among the elderly and in population groups that develop risk activities. In this sense, technologies can provide solutions to improve this situation. In this work we have analyzed different repositories of movements and falls designed to test decision algorithms in automatic fall detection systems. The objectives of the study are: firstly, to clarify what are the characteristics of the most significant accelerometry signals to identify a fall and secondly, to analyze the possibility of extrapolating the learning achieved with a certain database when tested with another one. As a novelty with respect to other works in the literature, the statistical significance of the results has been systematically evaluated by the analysis of variance (ANOVA).

José Antonio Santoyo-Ramón, Eduardo Casilari-Pérez, José Manuel Cano-García

### Influence of Illuminance on Sleep Onset Latency in IoT Based Lighting System Environment

The exposure to the light has a great influence on human beings in their everyday life. Various lighting sources produce light that reaches the human eye and influences a rhythmic release of melatonin hormone, that is a sleep promoting factor.Since the development of new technologies provides more control over illuminance, this work uses an IoT based lighting system to set up dim and bright scenarios. A small study has been performed on the influence of illuminance on sleep latency. The system consists of different light bulbs, sensors and a central bridge which are interconnected like a mesh network. Also, a mobile app has been developed, that allows to adjust the lighting in various rooms. With the help of a ferro-electret sensor, like applied in sleep monitoring systems, a subject’s sleep was monitored. The sensor is placed below the mattress and it collects data, which is stored and processed in a cloud or in other alternative locations.The research was conducted on healthy young subjects after being previously exposed to the preconfigured illuminance for at least three hours before bedtime. The results indicate correlation between sleep onset latency and exposure to different illuminance before bedtime. In a dimmed environment, the subject fell asleep in average 28% faster compared to the brighter environment.

Mislav Jurić, Maksym Gaiduk, Ralf Seepold

### Efficient Online Laplacian Eigenmap Computation for Dimensionality Reduction in Molecular Phylogeny via Optimisation on the Sphere

Reconstructing the phylogeny of large groups of large divergent genomes remains a difficult problem to solve, whatever the methods considered. Methods based on distance matrices are blocked due to the calculation of these matrices that is impossible in practice, when Bayesian inference or maximum likelihood methods presuppose multiple alignment of the genomes, which is itself difficult to achieve if precision is required. In this paper, we propose to calculate new distances for randomly selected couples of species over iterations, and then to map the biological sequences in a space of small dimension based on the partial knowledge of this genome similarity matrix. This mapping is then used to obtain a complete graph from which a minimum spanning tree representing the phylogenetic links between species is extracted. This new online Newton method for the computation of eigenvectors that solves the problem of constructing the Laplacian eigenmap for molecular phylogeny is finally applied on a set of more than two thousand complete chloroplasts.

Stéphane Chrétien, Christophe Guyeux

### PROcket, an Efficient Algorithm to Predict Protein Ligand Binding Site

To carry out functional annotation of proteins, the most crucial step is to identify the ligand binding site (LBS) information. Although several algorithms have been reported to identify the LBS, most have limited accuracy and efficiency while considering the number and type of geometrical and physio-chemical features used for such predictions. In this proposed work, a fast and accurate algorithm “PROcket” has been implemented and discussed. The algorithm uses grid-based approach to cluster the local residue neighbors that are present on the solvent accessible surface of proteins. Further with inclusion of selected physio-chemical properties and phylogenetically conserved residues, the algorithm enables accurate detection of the LBS. A comparative study with well-known tools; LIGSITE, LIGSITECS, PASS and CASTptool was performed to analyze the performance of our tool. A set of 48 ligand-bound protein structures from different families were used to compare the performance of the tools. The PROcket algorithm outperformed the existing methods in terms of quality and processing speed with 91% accuracy while considering top 3 rank pockets and 98% accuracy considering top 5 rank pockets.

Rahul Semwal, Imlimaong Aier, Pritish Kumar Varadwaj, Slava Antsiperov

### Gene Expression High-Dimensional Clustering Towards a Novel, Robust, Clinically Relevant and Highly Compact Cancer Signature

Precision medicine, a highly disruptive paradigm shift in healthcare targeting the personalizing treatment, heavily relies on genomic data. However, the complexity of the biological interactions, the important number of genes as well as the lack of substantial patient’s clinical data consist a tremendous bottleneck on the clinical implementation of precision medicine. In this work, we introduce a generic, low dimensional gene signature that represents adequately the tumor type. Our gene signature is produced using LP-stability algorithm, a high dimensional center-based unsupervised clustering algorithm working in the dual domain, and is very versatile as it can consider any arbitrary distance metric between genes. The gene signature produced by LP-stability reports at least 10 times better statistical significance and $$35\%$$ better biological significance than the ones produced by two referential unsupervised clustering methods. Moreover, our experiments demonstrate that our low dimensional biomarker (27 genes) surpass significantly existing state of the art methods both in terms of qualitative and quantitative assessment while providing better associations to tumor types than methods widely used in the literature that rely on several omics data.

Enzo Battistella, Maria Vakalopoulou, Théo Estienne, Marvin Lerousseau, Roger Sun, Charlotte Robert, Nikos Paragios, Eric Deutsch

### When Mathematics Outsmarts Cancer

Mathematics has become essential in cancer biology. Recent developments in high-throughput molecular profiling techniques enable assessing molecular states of tumors in great detail. Cancer genome data are collected at a large scale in numerous clinical studies and in international consortia, such as The Cancer Genome Atlas and the International Cancer Genome Consortium. Developing mathematical models that are consistent with and predictive of the true underlying biological mechanisms is a central goal of cancer biology. In this work, we used percolations and power-law models to study protein-protein interactions in cancer fusions. We used site-directed knockouts to understand the modular components of fusion protein-protein interaction networks, thereby providing models for target-based drug predictions.

Somnath Tagore, Milana Frenkel-Morgenstern

### Influence of the Stochasticity in the Model on the Certain Drugs Pharmacodynamics

In this paper I analyze the impact of the stochasticity on the three different levels (genes, mRNA and protein) on the of drug pharmacodynamics of a large class of drugs. I focus on the basic mechanisms underlying the dose-response curves considering two elementary molecular circuits. Both consist in the gene activation/deactivation, then gene transcription and following translation into the corresponding protein. In the first circuit gene activation and deactivation are spontaneous whereas gene deactivation rate in the second circuit depends on the protein level introducing negative feedback. In both cases drug is assumed to enhance the protein degradation level and the success of the therapy is considered as lowering the protein level below given threshold for given time. My numerical simulation shows that the level on which the stochasticity is introduced to the model (none, genes, mRNA, protein) influences not only the shape of dose-response curves but also the value of the critical dose i.e. the dose which causes of the positive response to the therapy in at least half of the cells.

Krzysztof Puszynski

### Graph Model for the Identification of Multi-target Drug Information for Culinary Herbs

Drug discovery strategies based on natural products are re-emerging as a promising approach. Due to its multi-target therapeutic properties, natural compounds in herbs produce greater levels of efficacy with fewer adverse effects and toxicity than monotherapies using synthetic compounds. However, the study of these medicinal herbs featuring multi-components and multi-targets requires an understanding of complex relationships, which is one of the fundamental goals in the discovery of drugs using natural products. Relational database systems such as the MySQL and Oracle store data in multiple tables, which are less efficient when data such as the one from natural compounds contain many relationships requiring several joins of large tables. Recently, there has been a noticeable shift in paradigm to NoSQL databases, especially graph databases, which was developed to natively represent complex high throughput dynamic relations. In this paper, we demonstrate the feasibility of using a graph-based database to capture the dynamic biological relationships of natural plant products by comparing the performance of MySQL and one of the most widely used NoSQL graph databases called Neo4j. Using this approach we have developed a graph database HerbMicrobeDB (HbMDB), and integrated herbal drug information, herb-targets, metabolic pathways, gut-microbial interactions and bacterial-genome information, from several existing resources. This NoSQL database contains 1,975,863 nodes, 3,548,314 properties and 2,511,747 edges. While probing the database and testing complex query execution performance of MySQL versus Neo4j, the latter outperformed MySQL and exhibited a very fast response for complex queries, whereas MySQL displayed latent or unfinished responses for complex queries with multiple-join statements. We discuss information convergence of pharmacochemistry, bioactivities, drug targets, and interaction networks for 24 culinary herbs and human gut microbiome. It is seen that all the herbs studied contain compounds capable of targeting a minimum of 55 enzymes and a maximum of 250 enzymes involved in biochemical pathways important in disease pathology.

Suganya Chandrababu, Dhundy Bastola

### On Identifying Candidates for Drug Repurposing for the Treatment of Ulcerative Colitis using Gene Expression Data

The notion of repurposing of existing drugs to treat both common and rare diseases has gained traction from both academia and pharmaceutical companies. Given the high attrition rates, massive time, money, and effort of brand-new drug development, the advantages of drug repurposing in terms of lower costs and shorter development time have become more appealing. Computational drug repurposing is promising approach and has shown great potential in tailoring genomic findings to the development of treatments for diseases. However, there are still challenges involved in building a standard computational drug repurposing solution for high-throughput analysis and the implementation to clinical practice. In this study, we applied the computational drug repurposing approaches for Ulcerative Colitis (UC) patients to provide better treatment for this disabling disease. Repositioning drug candidates were identified, and these findings provide a potentially effective therapeutics for the treatment of UC patients. This preliminary computational drug repurposing pipeline will be extended in the near future to help realize the full potential of drug repurposing.

Suyeon Kim, Ishwor Thapa, Ling Zhang, Hesham Ali

### Backmatter

Weitere Informationen