main-content

## Über dieses Buch

This book constitutes the thoroughly refereed post-conference proceedings of the 14th International Meeting on Computational. Intelligence Methods for Bioinformatics and Biostatistics, CIBB 2017, held in Cagliari, Italy, in September 2017.
The 19 revised full papers presented were carefully reviewed and selected from 44 submissions. The papers deal with the application of computational intelligence to open problems in bioinformatics, biostatistics, systems and synthetic biology, medical informatics, computational approaches to life sciences in general.

## Inhaltsverzeichnis

### An Open-Source Tool for Managing Time-Evolving Variant Annotation

Abstract
During the past decade, genomics has been drawing more and more attention, thanks to the introduction of fast and accurate sequencing strategies. Accumulation of data is fast and the amount of information to be managed and integrated is snowballing. While new variants are discovered every day, we still do not know enough about the human genome to have a final understanding of all the implications that they could have from a clinical point of view. When inherited diseases are considered, variants clinical classification may change over time, in relation to new discoveries. In this scenario, software solutions that help operators in the analysis and maintenance of constantly changing genomic data are relevant in the field of modern molecular medicine. In this paper we present GLIMS (short for Genomics Laboratory Information Management System), an open-source laboratory information management system for genomic data that allows to deal with time-evolving variant annotations. This solution answers to the need of genomic laboratories to keep up with their knowledge about variants and annotations, so as to provide patients with up-to-date reports. We illustrate the architecture of GLIMS modules that are in charge of keeping the database of variants updated and reclassifying patients’ variants. Then, we demonstrate (via the use of GLIMS) that variant clinical classifications are changing rapidly even in ClinVar, one of the most known and cited genomic databases, thus underlining the need for a tool that tracks changes over time.
Ilio Catallo, Eleonora Ciceri, Stefania Stenirri, Stefania Merella, Alberto Sanna, Maurizio Ferrari, Paola Carrera, Sauro Vicini

### Extracting Few Representative Reconciliations with Host Switches

Abstract
Phylogenetic tree reconciliation is the approach commonly used to investigate the coevolution of sets of organisms such as hosts and symbionts. Given a phylogenetic tree for each such set, respectively denoted by H and S, together with a mapping $$\phi$$ of the leaves of S to the leaves of H, a reconciliation is a mapping $$\varrho$$ of the internal vertices of S to the vertices of H which extends $$\phi$$ with some constraints.
Given a cost for each reconciliation, a huge number of most parsimonious ones are possible, even exponential in the dimension of the trees. Without further information, any biological interpretation of the underlying coevolution would require that all optimal solutions are enumerated and examined. The latter is however impossible without providing some sort of high level view of the situation. One approach would be to extract a small number of representatives, based on some notion of similarity or of equivalence between the reconciliations.
In this paper, we define two equivalence relations that allow one to identify many reconciliations with a single one, thereby reducing their number. Extensive experiments indicate that the number of output solutions greatly decreases in general. By how much clearly depends on the constraints that are given as input.
Mattia Gastaldello, Tiziana Calamoneri, Marie-France Sagot

### A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction

Abstract
The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir.
In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant ($$\le 1\times 10^{-5}$$). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes.
Filippo Utro, Daniel E. Platt, Laxmi Parida

### Cancer Mutational Signatures Identification with Sparse Dictionary Learning

Abstract
Somatic DNA mutations are a characteristic of cancerous cells, being usually key in the origin and development of cancer. In the last few years, somatic mutations have been studied in order to understand which processes or conditions may generate them, with the purpose of developing prevention and treatment strategies. In this work we propose a novel sparse regularised method that aims at extracting mutational signatures from somatic mutations. We developed a pipeline that extracts the dataset from raw data and performs the analysis returning the signatures and their relative usage frequencies. A thorough comparison between our method and the state of the art procedure reveals that our pipeline can be used alternatively without losing information and possibly gaining more interpretability and precision.
Veronica Tozzo, Annalisa Barla

### Icing: Large-Scale Inference of Immunoglobulin Clonotypes

Abstract
Immunoglobulin (IG) clonotype identification is a fundamental open question in modern immunology. An accurate description of the IG repertoire is crucial to understand the variety within the immune system of an individual, potentially shedding light on the pathogenetic process. Intrinsic IG heterogeneity makes clonotype inference an extremely challenging task, both from a computational and a biological point of view. Here we present icing, a framework that allows to reconstruct clonal families also in case of highly mutated sequences. icing has a modular structure, and it is designed to be used with large next generation sequencing (NGS) datasets, a technology which allows the characterisation of large-scale IG repertoires. We extensively validated the framework with clustering performance metrics on the results in a simulated case. icing is implemented in Python, and it is publicly available under FreeBSD licence at https://​github.​com/​slipguru/​icing.
Federico Tomasi, Margherita Squillario, Alessandro Verri, Davide Bagnara, Annalisa Barla

### Adenine: A HPC-Oriented Tool for Biological Data Exploration

Abstract
adenine is a machine learning framework designed for biological data exploration and visualization. Its goal is to help bioinformaticians achieving a first and quick overview of the main structures underlying their data. This software tool encompasses state-of-the-art techniques for missing values imputing, data preprocessing, dimensionality reduction and clustering. adenine has a scalable architecture which seamlessly work on single workstations as well as on high-performance computing facilities. adenine is capable of generating publication-ready plots along with quantitative descriptions of the results. In this paper we provide an example of exploratory analysis on a publicly available gene expression data set of colorectal cancer samples. The software and its documentation are available at https://​github.​com/​slipguru/​adenine under FreeBSD license.
Samuele Fiorini, Federico Tomasi, Margherita Squillario, Annalisa Barla

### Disease–Genes Must Guide Data Source Integration in the Gene Prioritization Process

Abstract
One of the main issues in detecting the genes involved in the etiology of genetic human diseases is the integration of different types of available functional relationships between genes. Numerous approaches exploited the complementary evidence coded in heterogeneous sources of data to prioritize disease-genes, such as functional profiles or expression quantitative trait loci, but none of them to our knowledge posed the scarcity of known disease-genes as a feature of their integration methodology. Nevertheless, in contexts where data are unbalanced, that is, where one class is largely under-represented, imbalance-unaware approaches may suffer a strong decrease in performance. We claim that imbalance-aware integration is a key requirement for boosting performance of gene prioritization (GP) methods. To support our claim, we propose an imbalance-aware integration algorithm for the GP problem, and we compare it on benchmark data with other state-of-the-art integration methodologies.
Marco Frasca, Jean Fred Fontaine, Giorgio Valentini, Marco Mesiti, Marco Notaro, Dario Malchiodi, Miguel A. Andrade-Navarro

### Ensembling Descendant Term Classifiers to Improve Gene - Abnormal Phenotype Predictions

Abstract
The Human Phenotype Ontology (HPO) provides a standard categorization of the phenotypic abnormalities encountered in human diseases and of the semantic relationship between them. Quite surprisingly the problem of the automated prediction of the association between genes and abnormal human phenotypes has been widely overlooked, even if this issue represents an important step toward the characterization of gene-disease associations, especially when no or very limited knowledge is available about the genetic etiology of the disease under study. We present a novel ensemble method able to capture the hierarchical relationships between HPO terms, and able to improve existing hierarchical ensemble algorithms by explicitly considering the predictions of the descendant terms of the ontology. In this way the algorithm exploits the information embedded in the most specific ontology terms that closely characterize the phenotypic information associated with each human gene. Genome-wide results obtained by integrating multiple sources of information show the effectiveness of the proposed approach.
Marco Notaro, Max Schubach, Marco Frasca, Marco Mesiti, Peter N. Robinson, Giorgio Valentini

### GP-Based Grammatical Inference for Classification of Amyloidogenic Sequences

Abstract
In this paper several methods of grammar induction problem are examined in the context of biological sequence analysis. In addition to this, a new method which generates noncircular context-free grammars is proposed. It has been shown through a computational experiment that the proposed, evolutionary-inspired approach overcomes statistically—with respect to classification quality—other grammatical inference algorithms on the sequences from a real amyloidogenic dataset.
Wojciech Wieczorek, Olgierd Unold

### Estimation of Kinetic Reaction Constants: Exploiting Reboot Strategies to Improve PSO’s Performance

Abstract
The simulation and analysis of mathematical models of biological systems require a complete knowledge of the reaction kinetic constants. Unfortunately, these values are often difficult to measure, but they can be inferred from experimental data in a process known as Parameter Estimation (PE). In this work, we tackle the PE problem using Particle Swarm Optimization (PSO) coupled with three different reboot strategies, which aim to reinitialize particle positions to avoid local optima. In particular, we highlight the better performance of PSO coupled with the reboot strategies with respect to standard PSO. Finally, since the PE requires a huge number of simulations at each iteration of PSO, we exploit cupSODA, a GPU-powered deterministic simulator, which performs all simulations and fitness evaluations in parallel.
Simone Spolaor, Andrea Tangherloni, Leonardo Rundo, Paolo Cazzaniga, Marco S. Nobile

### Haplotype and Repeat Separation in Long Reads

Abstract
Resolving the correct structure and succession of highly similar sequence stretches is one of the main open problems in genome assembly. For non haploid genomes this includes determining the sequences of the different haplotypes. For all but the smallest genomes it also involves separating different repeat instances. In this paper we discuss methods for resolving such problems in third generation long reads by classifying alignments between long reads according to whether they represent true or false read overlaps. The main problem in this context is the high error rate found in such reads, which greatly exceeds the variance between the similar regions we want to separate. Our methods can separate read classes stemming from regions with as little as $$1\%$$ difference.
German Tischler-Höhle

### Tumor Subclonal Progression Model for Cancer Hallmark Acquisition

Abstract
Recent advances in the methods for reconstruction of cancer evolutionary trajectories opened up the prospects of deciphering the subclonal populations and their evolutionary architectures within cancer ecosystems. An important challenge of the cancer evolution studies is how to connect genetic aberrations in subclones to a clinically interpretable and actionable target in the subclones for individual patients. In this study, our aim is to develop a novel method for constructing a model of tumor subclonal progression in terms of cancer hallmark acquisition using multiregional sequencing data. We prepare a subclonal evolutionary tree inferred from variant allele frequencies and estimate pathway alteration probabilities from large-scale cohort genomic data. We then construct an evolutionary tree of pathway alterations that takes into account selectivity of pathway alterations via selectivity score. We show the effectiveness of our method on a dataset of clear cell renal cell carcinomas.
Yusuke Matsui, Satoru Miyano, Teppei Shimamura

### GIMLET: Identifying Biological Modulators in Context-Specific Gene Regulation Using Local Energy Statistics

Abstract
The regulation of transcription factor activity dynamically changes across cellular conditions and disease subtypes. The identification of biological modulators contributing to context-specific gene regulation is one of the challenging tasks in systems biology, which is necessary to understand and control cellular responses across different genetic backgrounds and environmental conditions. Previous approaches for identifying biological modulators from gene expression data were restricted to the capturing of a particular type of a three-way dependency among a regulator, its target gene, and a modulator; these methods cannot describe the complex regulation structure, such as when multiple regulators, their target genes, and modulators are functionally related. Here, we propose a statistical method for identifying biological modulators by capturing multivariate local dependencies, based on energy statistics, which is a class of statistics based on distances. Subsequently, our method assigns a measure of statistical significance to each candidate modulator through a permutation test. We compared our approach with that of a leading competitor for identifying modulators, and illustrated its performance through both simulations and real data analysis. Our method, entitled genome-wide identification of modulators using local energy statistical test (GIMLET), is implemented with R ($$\ge$$3.2.2) and is available from github (https://​github.​com/​tshimam/​GIMLET).
Teppei Shimamura, Yusuke Matsui, Taisuke Kajino, Satoshi Ito, Takashi Takahashi, Satoru Miyano

### Structural Features of a DPPG Liposome Layer Adsorbed on a Rough Surface

Abstract
The development of drug delivery systems, sensors and other devices based on liposomes (small unilamellar lipid vesicles, SUVs) requires the adsorption of intact lipid structures onto solid surfaces in the first place. In this work, we report on the in situ investigation of the adsorption of liposomes of 1,2-dipalmitoyl-sn-glycero-3-[phospho-rac-(1-glycerol)] (sodium salt) (DPPG) onto a rough surface by neutron reflectivity. Rough surfaces are achieved by preparing polyelectrolyte layer-by-layer films, which act as soft polymer cushions. Neutron reflectivity measurements performed at the solid/D2O interface allow for the determination of the thickness of the adsorbed structures. The conducted investigation proofs that the liposomes dispersed in the liquid phase are generally adsorbed intact onto the cushion surface, confirming that the roughness of the latter is a variable to be taken into account if one intends to adsorb intact lipid structures. Liposome flattening is observed and justified by the attractive electrostatic interactions occurring between the negatively charged lipid liposomes and the outermost, positively charged polyelectrolyte layer of the cushion. The conducted measurements further demonstrate that the adsorbed liposomes are stable for several hours. These findings are fundamental for the development of devices based on immobilized but intact SUVs on sensor surfaces.
Maria Raposo, Andreia A. Duarte, Paulo J. Gomes, Paulo A. Ribeiro, Marli L. Moraes, Roland Steitz

### Chemical Exchanges and Actuation in Liposome-Based Synthetic Cells: Interaction with Biological Cells

Abstract
The development of new synthetic biology frontiers has led to scenarios where the embodied information-processing capability of biological organisms are implanted, in minimalistic version, in liposome-based synthetic cells. These are cell-like systems of minimal complexity resembling biological cells. Although not yet alive, synthetic cells are useful for generating basic biological understanding, and can become interesting biotechnological tools. In 2012 we devised a research program aimed at the design and construction of synthetic cells capable of exchanging chemical signals with biological micro-organisms (in particular bacteria). Here we review the fundamental steps leading to this innovative research field and comment on the most relevant experimental results obtained by us and others.
Giordano Rampioni, Francesca D’Angelo, Alessandro Zennaro, Livia Leoni, Pasquale Stano

### A Nano Communication System for CTC Detection in Blood Vessels

Abstract
In this paper, we show a simulation scenario of a short section of a blood vessel, in which white blood cells, red blood cells, and platelets move as a consequence of collisions and the Hagen–Poiseuille law. In addition to these cells, we have considered also the presence of circulating tumor cells (CTC) and of a receiver node that is able to detect the presence of CTC by using its surface receptors which are affine to the ligands present on the CTC surface.
This study aims at identifying potential optimal positions of CTC sensors within blood vessels in order to maximize the probability of a successful detection.
A simulation campaign has been performed by the BiNS2 simulation framework for several distances of the receiver node from the vessel axis. Obtained results show that CTCs tend to move towards the endothelium.
Luca Felicetti, Mauro Femminella, Gianluca Reali

### Experimental Evidences Suggest High Between-Vesicle Diversity of Artificial Vesicle Populations: Results, Models and Implications

Abstract
In the past years, artificial cellular models for origins-of-life research and synthetic biology have been extensively studied. At this aim, solute-filled lipid vesicles (liposomes) are widely used. Several evidences have been collected about the capture of water-soluble chemicals, the mechanism of vesicle self-reproduction, and the course of (bio)chemical reactions in the vesicle lumen. Among the several fascinating questions which emerged from these studies, here we focus on a peculiar feature, namely, the fact that a spontaneous heterogeneity of vesicle structure often emerges. In other words, vesicle populations created in the laboratory by classical batch methods include very ‘diverse’ vesicles with respect to size, morphology, and – importantly – solute content. The consequences of this between-vesicle diversity are shortly discussed.
Pasquale Stano, Roberto Marangoni, Fabio Mavelli

### Towards the Synthesis of Photo-Autotrophic Protocells

Abstract
In this contribution we discuss the possible strategies to synthesize photo-autotrophic artificial protocells starting from scratch, following the semi-synthetic bottom up approach. The main aim is to build up artificial compartmentalized systems able to mimic living cell behavior in the transduction of light energy in chemical energy. Some preliminary results and future perspective are presented and discussed.
Emiliano Altamura, Paola Albanese, Roberto Marotta, Pasquale Stano, Francesco Milano, Massimo Trotta, Fabio Mavelli

### Hierarchical Block Matrix Approach for Multi-view Clustering

Abstract
Scientists are facing two important challenges when investigating life processes. First, biological systems, from gene regulation to physiological mechanisms, are inherently multiscale. Second, complex disease data collection is an expensive process, and yet the analyses are presented in a rather empirical and sometimes simplistic way, completely missing the opportunity of uncovering patterns of predictive relationships and meaningful profiles. In this work, we propose a multi-view clustering methodology that, although quite general, could be used to identify patient subgroups, for different omic information, by studying the hierarchical structures of the patient data in each view and merging their topologies. We first demonstrate the ability of our method to identify hierarchical structures in synthetic data sets and then apply it to real multi-view multi-omic data sets. Our results, although preliminary, suggest that this methodology outperforms single-view clustering approaches and could open several directions for improvements.
Angela Serra, Maria Domenica Guida, Pietro Lió, Roberto Tagliaferri

### Backmatter

Weitere Informationen