main-content

## Über dieses Buch

This two-volume set LNBI 10813 and LNBI 10814 constitutes the proceedings of the 6th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2018, held in Granada, Spain, in April 2018.The 88 regular papers presented were carefully reviewed and selected from 273 submissions. The scope of the conference spans the following areas: bioinformatics for healthcare and diseases; bioinformatics tools to integrate omics dataset and address biological question; challenges and advances in measurement and self-parametrization of complex biological systems; computational genomics; computational proteomics; computational systems for modelling biological processes; drug delivery system design aided by mathematical modelling and experiments; generation, management and biological insights from big data; high-throughput bioinformatic tools for medical genomics; next generation sequencing and sequence analysis; interpretable models in biomedicine and bioinformatics; little-big data. Reducing the complexity and facing uncertainty of highly underdetermined phenotype prediction problems; biomedical engineering; biomedical image analysis; biomedical signal analysis; challenges in smart and wearable sensor design for mobile health; and healthcare and diseases.

## Inhaltsverzeichnis

### Trends in Online Biomonitoring

We are living in a digital world and Internet of Things is our last and still ongoing revolution [1]. In the last ten years, the number of devices which are connected to the internet increased ten times. This revolutionary is happening mainly in the industrial area, mainly from efficiency, time and cost reasons.Just step behind this industrial revolution is raising another branch of this new market. This new world is called biomonitoring [2]. Even people who are not involved in research or industry are now facing the changes connected with possibilities of online bioindicators. More and more are shown devices for monitoring cardiovascular activity during sports, work and sicknesses. These personal devices usually work as a first indicator of a dangerous situation for health. Even personal electro encefalo graphs devices are hit in last years. With this device, the user is able to control another devices or processes only using user’s thoughts [3].These new methods and products are not usually used in the fishery research area. We are proud to introduce you a novel approach to non-invasive online aquatic organism monitoring systems. This innovative research is a combination of cybernetics, biophysics and zoology. The usage of methods developed primarily for aquatic organisms is widely spread into early warning systems area. From this point of view, we can look at the animals as a living sensors.This article is a review of three non-invasive online biomonitoring methods and one crucial water parameters online monitoring system.

Antonín Bárta, Pavel Souček, Vladyslav Bozhynov, Pavla Urbanová, Dinara Bekkozhayeova

### SARAEasy: A Mobile App for Cerebellar Syndrome Quantification and Characterization

The assessment of latent variables in neurology is mostly achieved using clinical rating scales. Mobile applications can simplify the use of rating scales, providing a quicker quantitative evaluation of these latent variables. However, most health mobile apps do not provide user input validation, they make mistakes at their recommendations, and they are not sufficiently transparent in the way they are run. The goal of the paper was to develop a novel mobile app for cerebellar syndrome quantification and clinical phenotype characterization. SARAEasy is based on the Scale for Assessment and Rating of Ataxia (SARA), and it incorporates the clinical knowledge required to interpret the patient status through the identified phenotypic abnormalities. The quality of the clinical interpretation achieved by the app was evaluated using data records from anonymous patients suffering from SCA36, and the functionality and design was assessed through the development of a usability survey. Our study shows that SARAEasy is able to automatically generate high-quality patient reports that summarize the set of phenotypic abnormalities explaining the achieved cerebellar syndrome quantification. SARAEasy offers low-cost cerebellar syndrome quantification and interpretation for research and clinical purposes, and may help to improve evaluation.

Haitham Maarouf, Vanessa López, Maria J. Sobrido, Diego Martínez, Maria Taboada

### Case-Based Reasoning Systems for Medical Applications with Improved Adaptation and Recovery Stages

Case-Based Reasoning Systems (CBR) are in constant evolution, as a result, this article proposes improving the retrieve and adaption stages through a different approach. A series of experiments were made, divided in three sections: a proper pre-processing technique, a cascade classification, and a probability estimation procedure. Every stage offers an improvement, a better data representation, a more efficient classification, and a more precise probability estimation provided by a Support Vector Machine (SVM) estimator regarding more common approaches. Concluding, more complex techniques for classification and probability estimation are possible, improving CBR systems performance due to lower classification error in general cases.

X. Blanco Valencia, D. Bastidas Torres, C. Piñeros Rodriguez, D. H. Peluffo-Ordóñez, M. A. Becerra, A. E. Castro-Ospina

### Constructing a Quantitative Fusion Layer over the Semantic Level for Scalable Inference

We present a methodology and a corresponding system to bridge the gap between prioritization tools with fixed target and unrestricted semantic queries. We describe the advantages of an intermediate level of networks of similarities and relevances: (1) it is derived from raw, linked data (2) it ensures efficient inference over partial, inconsistent and noisy cross-domain, cross-species linked open data, (3) preserved transparency and decomposability of the inference allows semantic filters and preferences to control and focus of the inference, (4) high-dimensional, weakly significant evidences, such as overall summary statistics could also be used in the inference, (5) quantitative and rank based inference primitives can be defined, and (6) queries are unrestricted, e.g. prioritized variables, and (7) it allows wider access for non-technical experts. We provide a step-by-step guide for the methodology using a macular degeneration model, including drug, target and disease domains. The system and the model presented in the paper are available at bioinformatics.mit.bme.hu/QSF.

Andras Gezsi, Bence Bruncsics, Gabor Guta, Peter Antal

### Effects of External Voltage in the Dynamics of Pancreatic β-Cells: Implications for the Treatment of Diabetes

The influence of exposure to electric and magnetic fields in pancreatic islets are still scarce and controversial, and it is difficult to conduct a comparison of existing studies due to the different research methods employed. Here, computational simulations were used to study the burst patterns in pancreatic beta cell exposure to constant voltage pulses. Results show that burst patterns in pancreatic beta cells are dependent on the applied voltage and that some voltages may even inhibit this phenomenon. There are critical voltages, such as 2.16 mV, in which the burst change from a medium oscillation to a slow oscillation phase or 3.5 mV that induces transition in the burst from slow to fast oscillation phase. Voltage pulse higher than 3.5 mV leads to the extinction of bursts and, therefore, inhibits the process of insulin secretion. These results are reforced by phase plane analysis.

Ramón E. R. González, José Radamés Ferreira da Silva, Romildo Albuquerque Nogueira

### ISaaC: Identifying Structural Relations in Biological Data with Copula-Based Kernel Dependency Measures

The goal of this paper is to develop a novel statistical framework for inferring dependence between distributions of variables in omics data. We propose the concept of building a dependence network using a copula-based kernel dependency measures to reconstruct the underlying association network between the distributions. ISaaC is utilized for reverse-engineering gene regulatory networks and is competitive with several state-of-the-art gene regulatory inferrence methods on DREAM3 and DREAM4 Challenge datasets. An open-source implementation of ISaaC is available at https://bitbucket.org/HossamAlmeer/isaac/.

Hossam Al Meer, Raghvendra Mall, Ehsan Ullah, Nasreddine Megrez, Halima Bensmail

### Inspecting the Role of PI3K/AKT Signaling Pathway in Cancer Development Using an In Silico Modeling and Simulation Approach

PI3K/AKT signaling pathway plays a crucial role in the control of functions related to cancer biology, including cellular proliferation, survival, migration, angiogenesis and apoptosis; what makes this signaling pathway one of the main processes involved in cancer development. The analysis and prediction of the anticancer targets acting over the PI3K/AKT signaling pathway requires of a deep understanding of its signaling elements, the complex interactions that take place between them, as well as the global behaviors that arise as a result, that is, a systems biology approach. Following this methodology, in this work, we propose an in silico modeling and simulation approach of the PI3K class I and III signaling pathways, for exploring its effect over AKT and SGK proteins, its relationship with the deregulated growth control in cancer, its role in metastasis, as well as for identifying possible control points. The in silico approach provides symbolic abstractions and accurate algorithms that allow dealing with crucial aspects of the cellular signal transduction such as compartmentalization, topology and timing. Our results show that the activation or inhibition of target signaling elements in the overall signaling pathway can change the outcome of the cell, turning it into apoptosis or proliferation.

Pedro Pablo González-Pérez, Maura Cárdenas-García

### Cardiac Pulse Modeling Using a Modified van der Pol Oscillator and Genetic Algorithms

This paper proposes an approach for modeling cardiac pulses from electrocardiographic signals (ECG). A modified van der Pol oscillator model (mvP) is analyzed, which, under a proper configuration, is capable of describing action potentials, and, therefore, it can be adapted for modeling a normal cardiac pulse. Adequate parameters of the mvP system response are estimated using non-linear dynamics methods, like dynamic time warping (DTW). In order to represent an adaptive response for each individual heartbeat, a parameter tuning optimization method is applied which is based on a genetic algorithm that generates responses that morphologically resemble real ECG. This feature is particularly relevant since heartbeats have intrinsically strong variability in terms of both shape and length. Experiments are performed over real ECG from MIT-BIH arrhythmias database. The application of the optimization process shows that the mvP oscillator can be used properly to model the ideal cardiac rate pulse.

Fabián M. Lopez-Chamorro, Andrés F. Arciniegas-Mejia, David Esteban Imbajoa-Ruiz, Paul D. Rosero-Montalvo, Pedro García, Andrés Eduardo Castro-Ospina, Antonio Acosta, Diego Hernán Peluffo-Ordóñez

### Visible Aquaphotomics Spectrophotometry for Aquaculture Systems

The water quality is an important question for the environment as well as for the aquaculture and it does not matter which system you are using. It can be water treatment systems, water supply systems, pond treatment system or aquaponics system. For various purposes from simple water monitoring in maintenance, regulation, control and optimization to behavior models in biometrics, biomonitoring, biophysics and bioinformatics, it is necessary to observe wide field of variables. This article discusses and describes a method of biomonitoring, which is called Aquaphotomics. Aquaphotomics is a term introduced to define the application of spectrophotometry in the near infrared region (NIR) in order to understand the influence of water on the structure and function of biological systems. Currently aquaphotomics is focused on the NIR part of light spectrum, while we want to broaden this investigation to also include the visible part.

Vladyslav Bozhynov, Pavel Soucek, Antonin Barta, Pavla Urbanova, Dinara Bekkozhayeva

### Resolution, Precision, and Entropy as Binning Problem in Mass Spectrometry

The analysis mass spectra is dependent on the initial resolution and precision estimation. The novel method of relative entropy, combines the detection of the false precision, statistical binning problem, and the change of information content into one task. The methodological approach as well as relevant objectives are discussed in the first two parts of the work, including mathematical justification. The method of relative entropy has comparable results to the false precision detection, however using different approach. The binning problem solution is estimated via maximization of the relative entropy as a criterion parameter for objective magnitude rounding. The approach is verified on the real high resolution measurements with known presence of false precision. The method could be generalized for wider spectrum of data binnig/precision tasks.

Jan Urban

### Discrimination Between Normal Driving and Braking Intention from Driver’s Brain Signals

Advanced driver-assistance systems (ADAS) are in-car technologies that record and process vehicle and road information to take actions to reduce the risk of collision. These technologies however do not use information obtained directly from the driver such as the brain activity. This work proposes the recognition of brake intention using driver’s electroencephalographic (EEG) signals recorded in real driving situations. Five volunteers participated in an experiment that consisted on driving a car and braking in response to a visual stimulus. Driver’s EEG signals were collected and employed to assess two classification scenarios, pre-stimulus vs pos-stimulus and no-braking vs brake-intention. Classification results showed across-all-participants accuracies of 85.2 ± 5.7% and 79 ± 9.1%, respectively, which are above the chance level. Further analysis on the second scenario showed that true positive rate (77.1%) and true negative rate (79.3%) were very similar, which indicates no bias in the classification between no-braking vs brake-intention. These results show that driver’s EEG signals can be used to detect brake intention, which could be useful to take actions to avoid potential collisions.

Efraín Martínez, Luis Guillermo Hernández, Javier Mauricio Antelis

### Unsupervised Parametrization of Nano-Objects in Electron Microscopy

The observation of the nano sized objects in electron microscopy is demanding an automation of evaluation of captured images. The analysis of the digital images should be focused on objects detection, classification, and parametrization. In this work, three different examples of bioinformatical tasks are presented and discussed. The sphericity of such objects is one of the key parameter in nano object detections. The parametrization has to deal with specific properties of electron microscopy images, like high level of noise, low contrast, uneven background, and few pixels per objects. The presented approach combines unsupervised filtration and automatic object detection. The result is the software application with simple graphic user interface.

Pavla Urbanová, Norbert Cyran, Pavel Souček, Antonín Bárta, Vladyslav Bozhynov, Dinara Bekkhozhayeva, Petr Císař, Miloš Železný

### Models of Multiple Interactions from Collinear Patterns

Each collinear pattern should be made up of a large number of feature vectors which are located on a plane in a multidimensional feature space. Data subset located on a plane can represent linear interactions between multiple variables (features, genes). Collinear (flat) patterns can be efficiently extracted from large, multidimensional data sets through minimization of the collinearity criterion function which is convex and piecewise linear (CPL). Flat patterns extracted from representative data sets could provide an opportunity to discover new, important interaction models. As an example, exploration of data sets representing clinical practice and genetic testing could result in multiple interaction models of phenotype and genotype features.

Leon Bobrowski, Paweł Zabielski

### Identification of the Treatment Survivability Gene Biomarkers of Breast Cancer Patients via a Tree-Based Approach

Studying breast cancer survivability among different patients who received various treatments may help to understand the relationship between the survivability and treatment therapy based on the gene expression. In this work, we built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment (which is either hormone therapy (H), radiotherapy (R), or surgery (S)) will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach which partitions breast cancer patients according to survivability classes; each node in the tree is associated to a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset consisting of 347 treated breast cancer patients and identified potential biomarker subsets with accuracies ranging from 80.9% to 100%. We have investigated the roles of many biomarkers through the literature.

Ashraf Abou Tabl, Abedalrhman Alkhateeb, Luis Rueda, Waguih ElMaraghy, Alioune Ngom

### Workflows and Service Discovery: A Mobile Device Approach

Bioinformatics has moved from command-line standalone programs to web-service based environments. Such trend has resulted in an enormous amount of online resources which can be hard to find and identify, let alone execute and exploit. Furthermore, these resources are aimed -in general- to solve specific tasks. Usually, this tasks need to be combined in order to achieve the desired results. In this line, finding the appropriate set of tools to build up a workflow to solve a problem with the services available in a repository is itself a complex exercise. Issues such as services discovering, composition and representation appear. On the technological side, mobile devices have experienced an incredible growth in the number of users and technical capabilities. Starting from this reality, in the present paper, we propose a solution for service discovering and workflow generation while distinct approaches of representing workflows in a mobile environment are reviewed and discussed. As a proof of concept, a specific use case has been developed: we have embedded an expanded version of our Magallanes search engine into mORCA, our mobile client for bioinformatics. Such composition delivers a powerful and ubiquitous solution that provides the user with a handy tool for not only generate and represent workflows, but also services, data types, operations and service types discovery.

Ricardo Holthausen, Sergio Díaz-Del-Pino, Esteban Pérez-Wohlfeil, Pablo Rodríguez-Brazzarola, Oswaldo Trelles

### Chloroplast Genomes Exhibit Eight-Cluster Structuredness and Mirror Symmetry

Chloroplast genomes have eight-cluster structuredness, in triplet frequency space. Small fragments of a genome converted into a triplet frequency dictionaries are the elements to be clustered. Typical structure consists of eight clusters: six of them correspond to three different positions of a reading frame shifted for 0, 1 and 2 nucleotides (in two opposing strands), the seventh cluster corresponds to a junk regions of a genome, and the eighth cluster is comprised by the fragments with excessive $$\mathsf {GC}$$-content bearing specific RNA genes. The structure exhibits a specific symmetry.

Michael Sadovsky, Maria Senashova, Andrew Malyshev

### Are Radiosensitive and Regular Response Cells Homogeneous in Their Correlations Between Copy Number State and Surviving Fraction After Irradiation?

Biomarkers of radiosensitivity are currently a widespread research interest due to a demand for a sufficient method of prediction of cell response to ionizing radiation. Copy Number State (CNS) alterations may significantly influence individual radiosensitivity. However, their possible impact has not been entirely investigated yet. The purpose of this research was to select markers for which CNS change is significantly associated with the surviving fraction after irradiation with 2 Gy dose (SF2), which is a commonly used measure of cellular radiosensitivity. Moreover, a new strategy of combining qualitative and quantitative approaches is proposed as the identification of potential biomarkers is based not only on the overall SF2 and CNS correlation, but also on differences of it between radiosensitive and regular response cell strains. Four patterns of association are considered and functional analysis and Gene Ontology enrichment analysis of obtained sets of genomic positions are performed. Proposed strategy provides a comprehensive insight into the strength and direction of association between CNS and cellular radiosensitivity. Obtained results suggest that commonly used approach of group comparison based on testing two samples against each other is not sufficient in terms of radiosensitivity since this is not a discrete variable and division into sensitive, normal and resistant individuals is always stipulated.

Joanna Tobiasz, Najla Al-Harbi, Sara Bin Judia, Salma Majid, Ghazi Alsbeih, Joanna Polanska

### Protein Tertiary Structure Prediction via SVD and PSO Sampling

We discuss the use of the Singular Value Decomposition as a model reduction technique in Protein Tertiary Structure prediction, alongside to the uncertainty analysis associated to the tertiary protein predictions via Particle Swarm Optimization (PSO). The algorithm presented in this paper corresponds to the category of the decoy-based modelling, since it first finds a good protein model located in the low energy region of the protein energy landscape, that is used to establish a three-dimensional space where the free-energy optimization and search is performed via an exploratory version of PSO. The ultimate goal of this algorithm is to get a representative sample of the protein backbone structure and the alternate states in an energy region equivalent or lower than the one corresponding to the protein model that is used to establish the expansion (model reduction), obtaining as result other protein structures that are closer to the native structure and a measure of the uncertainty in the protein tertiary protein reconstruction. The strength of this methodology is that it is simple and fast, and serves to alleviate the ill-posed character of the protein structure prediction problem, which is very highly dimensional, improving the results when it is performed in a good protein model of the low energy region. To prove this fact numerically we present the results of the application of the SVD-PSO algorithm to a set of proteins of the CASP competition whose native’s structures are known.

Óscar Álvarez, Juan Luis Fernández-Martínez, Ana Cernea, Zulima Fernández-Muñiz, Andrzej Kloczkowski

### Fighting Fire with Fire: Computational Prediction of Microbial Targets for Bacteriocins

Recently, we have witnessed the emergence of bacterial strains resistant to all known antibacterials. Due to several limitations of existing experimental methods, these events justify the need of computer-aided methods to systematically and rationally identify new antibacterial agents. Here, we propose a methodology for the systematic prediction of interactions between bacteriocins and bacterial protein targets. The protein-bacteriocin interactions are predicted using a mesh of classifiers previously developed by the authors, allowing the identification of the best bacteriocin candidates for antibiotic use and potential drug targets.

Edgar D. Coelho, Joel P. Arrais, José Luís Oliveira

### A Graph-Based Approach for Querying Protein-Ligand Structural Patterns

In the context of protein engineering and biotechnology, the discovery and characterization of structural patterns is very relevant as it can give fundamental insights about protein structures. In this paper we present GSP4PDB, a bioinformatics web tool that lets the users design, search and analyze protein-ligand structural patterns inside the Protein Data Bank (PDB). The novel feature of GSP4PDB is that a protein-ligand structural pattern is graphically designed as a graph such that the nodes represent protein’s components and the edges represent structural relationships. The resulting graph pattern is transformed into a SQL query, and executed in a PostgreSQL database system where the PDB data is stored. The results of the search are presented using a textual representation, and the corresponding binding-sites can be visualized using a JSmol interface.

Renzo Angles, Mauricio Arenas

### Predicting Disease Genes from Clinical Single Sample-Based PPI Networks

Experimentally identifying disease genes is time-consuming and expensive, and thus it is appealing to develop computational methods for predicting disease genes. Many existing methods predict new disease genes from protein-protein interaction (PPI) networks. However, PPIs are changing during cells’ lifetime and thus only using the static PPI networks may degrade the performance of algorithms. In this study, we propose an algorithm for predicting disease genes based on centrality features extracted from clinical single sample-based PPI networks (dgCSN). Our dgCSN first constructs a single sample-based network from a universal static PPI network and the clinical gene expression of each case sample, and fuses them into a network according to the frequency of each edge appearing in all single sample-based networks. Then, centrality-based features are extracted from the fused network to capture the property of each gene. Finally, regression analysis is performed to predict the probability of each gene being disease-associated. The experiments show that our dgCSN achieves the AUC values of 0.893 and 0.807 on Breast Cancer and Alzheimer’s disease, respectively, which are better than two competing methods. Further analysis on the top 10 prioritized genes also demonstrate that dgCSN is effective for predicting new disease genes.

Ping Luo, Li-Ping Tian, Bolin Chen, Qianghua Xiao, Fang-Xiang Wu

### Red Blood Cell Model Validation in Dynamic Regime

Our work is set in the area of microfluidics, and deals with behavior of fluid and blood cells in microfluidic devices. The aim of this article is to validate our numerical model of red blood cell. This is done by comparing computer simulation with existing laboratory experiment. The experiment is exploring the velocity and deformation of blood cells in a hyperbolic microchannel. Our research confirms that the deformation of the red blood cell in the simulation is comparable with the results from the experiment, as long as the fluid velocity profile in the simulation fits the fluid velocity profile of the experiment. This validates the elastic parameters of the red blood cell model.

Kristína Kovalčíková, Alžbeta Bohiniková, Martin Slavík, Isabelle Mazza Guimaraes, Ivan Cimrák

### Exploiting Ladder Networks for Gene Expression Classification

The application of deep learning to biology is of increasing relevance, but it is difficult; one of the main difficulties is the lack of massive amounts of training data. However, some recent applications of deep learning to the classification of labeled cancer datasets have been successful. Along this direction, in this paper, we apply Ladder networks, a recent and interesting network model, to the binary cancer classification problem; our results improve over the state of the art in deep learning and over the conventional state of the art in machine learning; achieving such results required a careful adaptation of the available datasets and tuning of the network.

Guray Golcuk, Mustafa Anil Tuncel, Arif Canakoglu

### Simulation of Blood Flow in Microfluidic Devices for Analysing of Video from Real Experiments

Simulation of microfluidic devices is a great tool for optimizing these devices. For the development of simulation models, it is necessary to ensure a sufficient degree of simulation accuracy. Accuracy is ensured by measuring appropriate values that tell us about the course of the simulation and can also be measured in a real experiment. Measured values will simplify the real situation so that we can develop the model for a specific purpose and measure the values that are relevant to the research. In this article we present the approach in which the data we have gained from simulation are used to improve the quality of data processing from video from a real experiment.

Hynek Bachratý, Katarína Bachratá, Michal Chovanec, František Kajánek, Monika Smiešková, Martin Slavík

### Alignment-Free Z-Curve Genomic Cepstral Coefficients and Machine Learning for Classification of Viruses

Accurate detection of pathogenic viruses has become highly imperative. This is because viral diseases constitute a huge threat to human health and wellbeing on a global scale. However, both traditional and recent techniques for viral detection suffer from various setbacks. In codicil, some of the existing alignment-free methods are also limited with respect to viral detection accuracy. In this paper, we present the development of an alignment-free, digital signal processing based method for pathogenic viral detection named Z-Curve Genomic Cesptral Coefficients (ZCGCC). To evaluate the method, ZCGCC were computed from twenty six pathogenic viral strains extracted from the ViPR corpus. Naïve Bayesian classifier, which is a popular machine learning method was experimentally trained and validated using the extracted ZCGCC and other alignment-free methods in the literature. Comparative results show that the proposed ZCGCC gives good accuracy (93.0385%) and improved performance to existing alignment-free methods.

Emmanuel Adetiba, Oludayo O. Olugbara, Tunmike B. Taiwo, Marion O. Adebiyi, Joke A. Badejo, Matthew B. Akanle, Victor O. Matthews

### A Combined Approach of Multiscale Texture Analysis and Interest Point/Corner Detectors for Microcalcifications Diagnosis

Screening programs use mammography as primary diagnostic tool for detecting breast cancer at an early stage. The diagnosis of some lesions, such as microcalcifications, is still difficult today for radiologists. In this paper, we proposed an automatic model for characterizing and discriminating tissue in normal/abnormal and benign/malign in digital mammograms, as support tool for the radiologists. We trained a Random Forest classifier on some textural features extracted on a multiscale image decomposition based on the Haar wavelet transform combined with the interest points and corners detected by using Speeded Up Robust Feature (SURF) and Minimum Eigenvalue Algorithm (MinEigenAlg), respectively. We tested the proposed model on 192 ROIs extracted from 176 digital mammograms of a public database. The model proposed was high performing in the prediction of the normal/abnormal and benign/malignant ROIs, with a median AUC value of $$98.46\%$$ and $$94.19\%$$, respectively. The experimental result was comparable with related work performance.

Liliana Losurdo, Annarita Fanizzi, Teresa M. A. Basile, Roberto Bellotti, Ubaldo Bottigli, Rosalba Dentamaro, Vittorio Didonna, Alfonso Fausto, Raffaella Massafra, Alfonso Monaco, Marco Moschetta, Ondina Popescu, Pasquale Tamborra, Sabina Tangaro, Daniele La Forgia

### An Empirical Study of Word Sense Disambiguation for Biomedical Information Retrieval System

Document representation is an important stage to ensure the indexation of biomedical document. The ordinary way to represent a text is a bag of words BoW, This Representation suffers from the lack of sense in resulting representations ignoring all semantics that reside in the original text; instead of, the Conceptualization using background knowledge enriches document representation models. Three strategies can be used in order to realize the conceptualization task: Adding Concept, Partial Conceptualization, and Complete Conceptualization. While searching polysemic term corresponding senses in semantic resources, multiple matches are detected then introduce some ambiguities in the final document representation, three strategies for Disambiguation can be used: First Concept, All Concepts and Context-Based. SenseRelate is a well-known Context-Based algorithm, which uses a fixed window size and taking into consideration the distance weight on how far the terms in the context are from the target word. This may impact negatively on the yielded concepts or senses, we propose a simple modified version of SenseRelate algorithm namely NoDistanceSenseRelate, which simply ignore the distance that is the terms in the context will have the same distance weight. In order to evaluate the effect of the conceptualization strategies and Disambiguation strategies in the indexing process, in this study, several experiments have been conducted using OHSUMED corpus on a biomedical information retrieval system. The obtained results using OHSUMED corpus show that the Context-Based methods (SenseRelate and NoDistanceSenseRelate) outperform the others ones when applying Adding Concept Conceptualization strategy results using Biomedical Information retrieval system. The obtained results prove the evidence of adding the sense of concepts to the Term Representation in the IR process.

Mohammed Rais, Abdelmonaime Lachkar

### Modelling the Release of Moxifloxacin from Plasma Grafted Intraocular Lenses with Rotational Symmetric Numerical Framework

A rotational symmetric finite element model is constructed to simulate the release of moxifloxacin from different types of plasma-grafted intraocular lenses, utilizing general discontinuous boundary conditions to describe the interface between lens and outside medium. Such boundary conditions allow for the modelling of partitioning and interfacial mass transfer resistance. Due to its rotational symmetry, the shape of the optical part of the intraocular lens is fully taken into account.Two types of polyacrylates were plasma-grafted to the intraocular lens to act as barriers for the release of the loaded drug. Simulations are carried out and compared to release experiments to infer drug-material properties, which is crucial for optimising therapeutic effects.

Kristinn Gudnason, Sven Sigurdsson, Fjola Jonsdottir, A. J. Guiomar, A. P. Vieira, P. Alves, P. Coimbra, M. H. Gil

### Predicting Tumor Locations in Prostate Cancer Tissue Using Gene Expression

Prostate cancer can be missed due to the limited number of biopsies or the ineffectiveness of standard screening methods. Finding gene biomarkers for prostate cancer location and analyzing their transcriptomics can help clinically understand the development of the disease and improve treatment efficiency. In this work, a classification model is built based on gene expression measurements of samples from patients who have cancer on the left, right, and both lobes of the prostate as classes.A hybrid feature selection is used to select the best possible set of genes that can differentiate the three classes. Standard machine learning classifiers with the one-versus-all technique are used to select potential biomarkers for each laterality class. RNA-sequencing data from The Cancer Genome Atlas (TCGA) Prostate Adenocarcinoma (PRAD) was used. This dataset consists of 450 samples from different patients with different cancer locations. There are three primary locations within the prostate: left, right and bilateral. Each sample in the dataset contains expression levels for each of the 60,488 genes; the genes are expressed in Transcripts Per Kilobase Million (TPM) values.The results show promising prediction prospect for prostate cancer laterality. With 99% accuracy, a support vector machine (SVM) based on a radial basis function kernel (SVM-RBF) was able to identify each group from the others using the subset of genes. Three groups of genes (RTN1, HLA-DMB, MRI1 and others) were found to be differentially expressed among the three different tumor locations. The findings were validated using multiple findings in the literature, which confirms the relationship between those genes and prostate cancer.

Osama Hamzeh, Abedalrhman Alkhateeb, Luis Rueda

### Concept of a Module for Physical Security of Material Secured by LIMS

Automation and miniaturization are the current trends. The eHealth program of digitization of all parts of the healthcare system is concerned as well. The main purpose is to improve patient’s care, information gathering and its provision in the frame of given structure. The data protection constitutes an essential and integral part of the whole system at all its levels, including labs. The following text is devoted to the protection of storage and manipulation of biomedical samples. The HW module connected to the lab IS used for this purpose. The module is able not only to regulate access to the followed material but also to send logs to the central database. The proposal is based on the requirement for minimal financial investment, ergonomic provision and the range of provided functions. From the essence of the proposal, the Module can be inserted into the grid sensors of the ambient lab system. Its interfaces allow to process information from the connected sensors, to evaluate it and to create commands for the subsequent elements. The solution is demonstrated on a fridge since, in general, it is one of the most critical places in the lab.

Pavel Blazek, Kamil Kuca, Ondrej Krejcar

### scFeatureFilter: Correlation-Based Feature Filtering for Single-Cell RNAseq

Single cell RNA sequencing is becoming increasingly popular due to rapidly evolving technology, decreasing costs and its wide applicability. However, the technology suffers from high drop-out rate and high technical noise, mainly due to the low starting material. This hinders the extraction of biological variability, or signal, from the data. One of the first steps in the single cell analysis pipelines is, therefore, to filter the data to keep the most informative features only. This filtering step is often done by arbitrarily selecting a threshold.In order to establish a data-driven approach for the feature filtering step, we developed an R package, scFeatureFilter, which uses the lack of correlation between features as a proxy for the presence of high technical variability. As a result, the tool filters the input data, selecting for the features where the biological variability is higher than technical noise.

Angeles Arzalluz-Luque, Guillaume Devailly, Anagha Joshi

### NearTrans Can Identify Correlated Expression Changes Between Retrotransposons and Surrounding Genes in Human Cancer

Recent studies using high-throughput sequencing technologies have demonstrated that transposable elements (TEs) seem to be involved not only in some cancer onset but also in cancer development. New dedicated tools have been recently designed to quantify the global expression of the different families of TEs from RNA-seq data, but the identification of the particular, differentially expressed TEs would provide more profitable results. To fill the gap, here it is presented NearTrans, a bioinformatic workflow that takes advantage of gEVE (a database of endogenous viral elements) to determine differentially expressed TEs as well as the activity of genes surrounding them to study if changes in TE expression is correlated with nearby genes. An especial requirement is that input RNA-seq reads must derive from normal and cancerous tissue from the same patient. NearTrans has been tested using RNA-seq data from 14 patients with prostate cancer, where two HERVs (HERVH-int and HERV17-int) and three LINE-1 (L1PA3, L1PA4 and L1PA7) were over-expressed in separate positions of the genome. Only one of the nearby genes (ACSM1) is over-expressed in prostate cancer, in agreement with the literature. Three (PLA2G5, UBE2MP1 and MIR4675) change their expression between normal and tumor cell, although the change is not statistically significant. The fifth (LOC101928437) is highly distant to the L1PA7 and their correlation is unlikely. These results are supporting that, in some cases such as the HERVs, TE expression can be governed by the genome context related with cancer, while in others, such as the LINEs, their expression is less related with the genome context, even though they are surrounded by genes potentially involved in cancer. Therefore, NearTrans seems to be a suitable and useful workflow to discover or corroborate genes involved in cancer that might be used as specific biomarkers for the diagnosis, prognosis or treatment of cancer.

Rafael Larrosa, Macarena Arroyo, Rocío Bautista, Carmen María López-Rodríguez, M. Gonzalo Claros

### An Interactive Strategy to Visualize Common Subgraphs in Protein-Ligand Interaction

Interactions between proteins and ligands play an important role in biological processes of living systems. For this reason, the development of computational methods to facilitate the understanding of the ligand-receptor recognition process is fundamental, since this comprehension is a major step towards ligand prediction, target identification, lead discovery, among others. This article presents a visual interactive interface to explore protein-ligand interactions and their conserved substructures for a set of similar proteins. The protein-ligand interface is modeled as bipartite graphs, where nodes represents protein and ligand atoms, and edges depicts interactions between them. Such graphs are the input to search for frequent subgraphs that are the conserved interaction patterns over the datasets. To illustrate the potential of our strategy, we used two test datasets, Ricin and human CDK2. Availability: http://dcc.ufmg.br/~alexandrefassio/gremlin/.

Alexandre V. Fassio, Charles A. Santana, Fabio R. Cerqueira, Carlos H. da Silveira, João P. R. Romanelli, Raquel C. de Melo-Minardi, Sabrina de A. Silveira

### Meta-Alignment: Combining Sequence Aligners for Better Results

Analysing next generation sequencing data often involves the use of a sequence aligner to map the sequenced reads against a reference. The output of this process is the basis of many downstream analyses and its quality thus critical. Many different alignment tools exist, each with a multitude of options, creating a vast amount of possibilities to align sequences. Choosing the correct aligner and options for a specific dataset is complex, and yet it can have a major impact on the quality of the data analysis. We propose a new approach in which we combine the output of multiple sequence aligners to create an improved sequence alignment files. Our novel approach can be used to either increase the sensitivity or the specificity of the alignment process. The software is freely available for non-commercial usage at http://gnaty.phenosystems.com/.

Beat Wolf, Pierre Kuonen, Thomas Dandekar

### Exploiting In-memory Systems for Genomic Data Analysis

With the increasing adoption of next generation sequencing technology in the medical practice, there is an increasing demand for faster data processing to gain immediate insights from the patient’s genome. Due to the extensive amount of genomic information and its big data nature, data processing takes long time and delays are often experienced. In this paper, we show how to exploit in-memory platforms for big genomic data analysis, with focus on the variant analysis workflow. We will determine where different in-memory techniques are used in the workflow and explore different memory-based strategies to speed up the analysis. Our experiments show promising results and encourage further research in this area, especially with the rapid advancement in memory and SSD technologies.

Zeeshan Ali Shah, Mohamed El-Kalioby, Tariq Faquih, Moustafa Shokrof, Shazia Subhani, Yasser Alnakhli, Hussain Aljafar, Ashiq Anjum, Mohamed Abouelhoda

### Improving Metagenomic Assemblies Through Data Partitioning: A GC Content Approach

Assembling metagenomic data sequenced by NGS platforms poses significant computational challenges, especially due to large volumes of data, sequencing errors, and variations in size, complexity, diversity and abundance of organisms present in a given metagenome. To overcome these problems, this work proposes an open-source, bioinformatic tool called GCSplit, which partitions metagenomic sequences into subsets using a computationally inexpensive metric: the GC content. Experiments performed on real data show that preprocessing short reads with GCSplit prior to assembly reduces memory consumption and generates higher quality results, such as an increase in the size of the largest contig and N50 metric, while both the L50 value and the total number of contigs produced in the assembly were reduced. GCSplit is available at https://github.com/mirand863/gcsplit.

Fábio Miranda, Cassio Batista, Artur Silva, Jefferson Morais, Nelson Neto, Rommel Ramos

### Quality Assessment of High-Throughput DNA Sequencing Data via Range Analysis

In the recent literature, there appeared a number of studies for the quality assessment of sequencing data. These efforts, to a great extent, focused on reporting the statistical parameters regarding the distribution of the quality scores and/or the base-calls in a FASTQ file. We investigate another dimension for the quality assessment motivated by the fact that reads including long intervals having fewer errors improve the performances of the post-processing tools in the downstream analysis. Thus, the quality assessment procedures proposed in this study aim to analyze the segments on the reads that are above a certain quality. We define an interval of a read to be of desired–quality when there are at most k quality scores less than or equal to a threshold value v, for some k and v provided by the user. We present the algorithm to detect those ranges and introduce new metrics computed from their lengths. These metrics include the mean values for the longest, shortest, average, cubic average, coefficient variation, and segment numbers of the fragment lengths in each read that are appropriate according to the k and v input parameters. We also provide a new software tool QASDRA for quality assessment of sequencing data via range analysis, which is available at https://github.com/ali-cp/QASDRA.git. QASDRA creates the quality assessment report of an input FASTQ file according to the user-specified k and v parameters. It also has the capabilities to filter out the reads according to the metrics introduced.

Ali Fotouhi, Mina Majidi, M. Oğuzhan Külekci

### A BLAS-Based Algorithm for Finding Position Weight Matrix Occurrences in DNA Sequences on CPUs and GPUs

Finding all matches of a set of position weight matrices (PWMs) in large DNA sequences is a compute-intensive task. We propose a light-weight algorithm inspired by high performance computing techniques in which the problem of finding PWM occurrences is expressed in terms of matrix-matrix products which can be performed efficiently by highly optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs. It is competitive on CPUs with state-of-the-art software for matching PWMs in terms of runtime while requiring far less memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 41 min with a p-value of $$10^{-4}$$ using a 24-core machine. On a dual GPU system, the same task can be performed in under 5 min.

Jan Fostier

### Analyzing the Differences Between Reads and Contigs When Performing a Taxonomic Assignment Comparison in Metagenomics

Metagenomics is an inherently complex field in which one of the primary goals is to determine the compositional organisms present in an environmental sample. Thereby, diverse tools have been developed that are based on the similarity search results obtained from comparing a set of sequences against a database. However, to achieve this goal there still are affairs to solve such as dealing with genomic variants and detecting repeated sequences that could belong to different species in a mixture of uneven and unknown representation of organisms in a sample. Hence, the question of whether analyzing a sample with reads provides further understanding of the metagenome than with contigs arises. The assembly yields larger genomic fragments but bears the risk of producing chimeric contigs. On the other hand, reads are shorter and therefore their statistical significance is harder to asses, but there is a larger number of them. Consequently, we have developed a workflow to assess and compare the quality of each of these alternatives. Synthetic read datasets beloging to previously identified organisms are generated in order to validate the results. Afterwards, we assemble these into a set of contigs and perform a taxonomic analysis on both datasets. The tools we have developed demonstrate that analyzing with reads provide a more trustworthy representation of the species in a sample than contigs especially in cases that present a high genomic variability.

Pablo Rodríguez-Brazzarola, Esteban Pérez-Wohlfeil, Sergio Díaz-del-Pino, Ricardo Holthausen, Oswaldo Trelles

### Estimating the Length Distributions of Genomic Micro-satellites from Next Generation Sequencing Data

Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. In contrast to unique genome, genomic micro-satellites expose high intrinsic polymorphisms, which mainly derive from variability in length. Length distributions are widely used to represent the polymorphisms. Recent studies report that some micro-satellites alter their length distributions significantly in tumor tissue samples comparing to the ones observed in normal samples, which becomes a hot topic in cancer genomics. Several state-of-the-art approaches are proposed to identify the length distributions from the sequencing data. However, the existing approaches can only handle the micro-satellites shorter than one read length, which limits the potential research on long micro-satellite events. In this article, we propose a probabilistic approach, implemented as ELMSI that estimates the length distributions of the micro-satellites longer than one read length. The core algorithm works on a set of mapped reads. It first clusters the reads, and a k-mer extension algorithm is adopted to detect the unit and breakpoints as well. Then, it conducts an expectation maximization algorithm to approach the true length distributions. According to the experiments, ELMSI is able to handle micro-satellites with the length spectrum from shorter than one read length to 10 kbps scale. A series of comparison experiments are applied, which vary the numbers of micro-satellite regions, read lengths and sequencing coverages, and ELMSI outperforms MSIsensor in most of the cases.

Xuan Feng, Huan Hu, Zhongmeng Zhao, Xuanping Zhang, Jiayin Wang

### CIGenotyper: A Machine Learning Approach for Genotyping Complex Indel Calls

Complex insertion and deletion (complex indel) is a rare category of genomic structural variations. A complex indel presents as one or multiple DNA fragments inserted into the genomic location where a deletion occurs. Several studies emphasize the importance of complex indels, and some state-of-the-art approaches are proposed to detect them from sequencing data. However, genotyping complex indel calls is another challenged computational problem because some commonly used features for genotyping indel calls from the sequencing data could be invalid due to the components of complex indels. Thus, in this article, we propose a machine learning approach, CIGenotyper to estimate genotypes of complex indel calls. CIGenotyper adopts a relevance vector machine (RVM) framework. For each candidate call, it first extracts a set of features from the candidate region, which usually includes the read depth, the variant allelic frequency for aligned contigs, the numbers of the splitting and discordant paired-end reads, etc. For a complex indel call, given its features to a trained RVM, the model outputs the genotype with highest likelihood. An algorithm is also proposed to train the RVM. We compare our approach to two popular approaches, Gindel and Pindel, on multiple groups of artificial datasets. The results of our model outperforms them on average success rates in most of the cases when vary the coverages of the given data, the read lengths and the distributions of the lengths of the pre-set complex indels.

Tian Zheng, Yang Li, Yu Geng, Zhongmeng Zhao, Xuanping Zhang, Xiao Xiao, Jiayin Wang

### Genomic Solutions to Hospital-Acquired Bacterial Infection Identification

Hospital acquired infections (HAIs) are notorious for their likelihood of fatal outcomes in infected patients due to rapid bacterial mutation rates, consequent resistance to antibiotic treatments and stubbornness to treatment, let alone eradication, to the point they have become a challenge to medical science. A fast and accurate method to identify HAI will assist in the diagnosis and identification of appropriate patient treatment and in controlling future outbreaks. Based on recently developed new methods for genomic data extraction, representation and analysis in bioinformatics, we propose an entirely new method for species identification. The accuracy of the new methods is very competitive and in several cases outperforms the standard spectroscopic protein-based MALDI-TOF MS commonly used in clinical microbiology laboratories and public healthcare settings, at least prior to translation to a clinical setting. The proposed method relies on a model of hybridization that is robust to frameshifts and thus is likely to provide resilience to length variability in the sonication of the samples, probably one of the major challenges in a translation to clinical settings.

Max H. Garzon, Duy T. Pham

### Kernel Conditional Embeddings for Associating Omic Data Types

Computational methods are needed to combine diverse type of genome-wide data in a meaningful manner. Based on the kernel embedding of conditional probability distributions, a new measure for inferring the degree of association between two multivariate data sources is introduced. We analyze the performance of the proposed measure to integrate mRNA expression, DNA methylation and miRNA expression data.

Ferran Reverter, Esteban Vegas, Josep M. Oller

### Metastasis of Cutaneous Melanoma: Risk Factors, Detection and Forecasting

In this work, we present a quantitative analysis of cutaneous melanoma based on 615 patients attended in Cruces University Hospital between 1988 and 2012. First, we studied which characteristics are more associated with the metastasis of this kind of cancer. We observed that people with light eyes, light hair, an ulcerated nevus, or exposed to the sun during working hours, had more risk to suffer from metastasis. Besides, a big diameter or a thick nevus (measured by Breslow’s depth) were also associated with this condition. Next, we evaluated the metastasis detection capability of the tests performed in this hospital, which indicated that X-rays and CT scan were the best techniques for metastasis detection, since they identified this condition successfully in 80% and 93.5% of the cases, respectively. Moreover, we concluded that the blood test was very inaccurate, since it recognized the presence of metastasis in only 40% of the cases and failed in the rest. Consequently, we suggest the replacement of this test in order to save money, time, and avoid the misdiagnosis of cutaneous melanoma metastasis. Finally, we built a predictive model to forecast the time that takes for metastasis to happen, based on Breslow’s depth. This tool could be used not only for improving the programming appointment management of the dermatology section, but also for detecting metastasis sooner.

Iker Malaina, Leire Legarreta, Maria Dolores Boyano, Jesus Gardeazabal, Carlos Bringas, Luis Martinez, Ildefonso Martinez de la Fuente

### Graph Theory Based Classification of Brain Connectivity Network for Autism Spectrum Disorder

Connections in the human brain can be examined efficiently using brain imaging techniques such as Diffusion Tensor Imaging (DTI), Resting-State fMRI. Brain connectivity networks are constructed by using image processing and statistical methods, these networks explain how brain regions interact with each other. Brain networks can be used to train machine learning models that can help the diagnosis of neurological disorders. In this study, two types (DTI, fMRI) of brain connectivity networks are examined to retrieve graph theory based knowledge and feature vectors of samples. The classification model is developed by integrating three machine learning algorithms with a naïve voting scheme. The evaluation of the proposed model is performed on the brain connectivity samples of patients with Autism Spectrum Disorder. When the classification model is compared with another state-of-the-art study, it is seen that the proposed method outperforms the other one. Thus, graph-based measures computed on brain connectivity networks might help to improve diagnostic capability of in-silico methods. This study introduces a graph theory based classification model for diagnostic purposes that can be easily adapted for different neurological diseases.

Ertan Tolan, Zerrin Isik

### Detect and Predict Melanoma Utilizing TCBR and Classification of Skin Lesions in a Learning Assistant System

In this paper, case-based reasoning is used as a problem-solving method in the development of DePicT Melanoma CLASS. It is a textual case-based system to detect and predict melanoma utilizing text information and image classification. Each case contains disease description and possible recommendation as references (images or texts). Case description has an image gallery and a word association profile which is the association strengths between the stages/types of melanoma and its symptoms and characteristics (keywords from text references). Therefore, in the retrieval and reuse process, first, requested problem which is as a new incoming case have to be retrieved from all collected cases, then, the solution of the most similar case is selected and recommended to users. In this paper support vector machine (SVM) and k-nearest neighbor (k-NN) classifiers are also used with the extracted features of skin lesions. A region growing method is applied by initialization of seed points for the segmentation. DePicT Melanoma CLASS is tested on sample texts and 400 images from ISIC archive dataset including two classes of Melanoma and it achieves 63% accuracy for the overall system.

Sara Nasiri, Matthias Jung, Julien Helsper, Madjid Fathi

### On the Use of Betweenness Centrality for Selection of Plausible Trajectories in Qualitative Biological Regulatory Networks

Qualitative modeling approach is widely used to study the behavior of Biological Regulatory Networks. The approach uses directed graphs also called as , to represent system dynamics. As the number of genes increase, the complexity of stategraph increases exponentially. The identification of important trajectories and isolation of more probable dynamics from less significant ones constitutes an important problem in qualitative modeling of biological networks. In this work, we implement a parallel approach for identification of important dynamics in qualitative models. Our implementation uses the concept of . For parallelization, we used a Java based library MPJ Express to implement our approach. We evaluate the performance of our implementation on well known case study of bacteriophage lambda. We demonstrate the effectiveness of our implementation by selecting important trajectories and correlating with experimental data.