nach oben

2017 | Buch

Kapitel lesen Erstes Kapitel lesen

11th International Conference on Practical Applications of Computational Biology & Bioinformatics

herausgegeben von: Florentino Fdez-Riverola, Mohd Saberi Mohamad, Miguel Rocha, Juan F. De Paz, Tiago Pinto

Verlag: Springer International Publishing

Buchreihe : Advances in Intelligent Systems and Computing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Biological and biomedical research are increasingly driven by experimental techniques that challenge our ability to analyse, process and extract meaningful knowledge from the underlying data. The impressive capabilities of next-generation sequencing technologies, together with novel and constantly evolving, distinct types of omics data technologies, have created an increasingly complex set of challenges for the growing fields of Bioinformatics and Computational Biology.
The analysis of the datasets produced and their integration call for new algorithms and approaches from fields such as Databases, Statistics, Data Mining, Machine Learning, Optimization, Computer Science and Artificial Intelligence. Clearly, Biology is more and more a science of information and requires tools from the computational sciences. In the last few years, we have seen the rise of a new generation of interdisciplinary scientists with a strong background in the biological and computational sciences.
In this context, the interaction of researchers from different scientific fields is, more than ever, of foremost importance in boosting the research efforts in the field and contributing to the education of a new generation of Bioinformatics scientists. The PACBB’17 conference was intended to contribute to this effort and promote this fruitful interaction, with a technical program that included 39 papers spanning many different sub-fields in Bioinformatics and Computational Biology. Further, the conference promoted the interaction of scientists from diverse research groups and with a distinct background (computer scientists, mathematicians, biologists).

Inhaltsverzeichnis

Frontmatter

S2P: A Desktop Application for Fast and Easy Processing of 2D-Gel and MALDI-Based Mass Spectrometry Protein Data

2D-gel electrophoresis is widely used in combination with MALDI-TOF mass spectrometry in order to analyse the proteome of biological samples. It can be used to discover proteins that are differentially expressed between two groups (e.g. two disease conditions) obtaining thus a set of potential biomarkers. Biomarker discovery requires a lot of data processing in order to prepare data for analysis or in order to merge data from different sources. This kind of work is usually done manually, being highly time consuming and distracting the operator or researcher from other important tasks. Moreover, doing this repetitive process in a non-automated, handling-based manner is error-prone, affecting reliability and reproducibility. To overcome these drawbacks, the S2P, an AIBench based desktop multiplatform application, has been specifically created to process 2D-gel and MALDI-mass spectrometry protein identification-based data in a computer-aided manner. S2P is open source and free to all users at http://www.sing-group.org/s2p.

Hugo López-Fernández, Jose E. Araújo, Daniel Glez-Peña, Miguel Reboiro-Jato, Florentino Fdez-Riverola, José L. Capelo-Martínez

Multi-Enzyme Pathway Optimisation Through Star-Shaped Reachable Sets

This article studies the time evolution of multi-enzyme pathways. The non-linearity of the problem coupled with the infinite dimensionality of the time-dependent input usually results in a rather laborious optimization. Here we discuss how the optimization of the input enzyme concentrations might be efficiently reduced to a calculation of reachable sets. Under some general conditions, the original system has star-shaped reachable sets that can be derived by solving a partial differential equation. This method allows a thorough study and optimization of quite sophisticated enzymatic pathways with non-linear dynamics and possible inhibition. Moreover, optimal control synthesis based on reachable sets can be implemented and was tested on several simulated examples.

Stanislav Mazurenko, Jiri Damborsky, Zbynek Prokop

Automated Collection and Sharing of Adaptive Amino Acid Changes Data

When changes at few amino acid sites are the target of selection, adaptive amino acid changes in protein sequences can be identified using maximum-likelihood methods based on models of codon substitution (such as codeml). Such methods have been used numerous times using a variety of different organisms but the time needed to collect the data and prepare the input files means that tens or a couple of hundred coding regions are usually analyzed. Nevertheless, the recent availability of flexible and ease to use computer applications to collect the relevant data (such as BDBM), and infer positively selected amino acid sites (such as ADOPS) means that the whole process is easier and quicker than before, but the lack of a batch option in ADOPS, here reported, still precluded the analysis of hundreds or thousands of sequence files. Given the interest and possibility of running such large scale projects, we also developed a database where ADOPS projects can be stored. Therefore, here we also present B+ that is both a data repository and a convenient interface to look at the information contained in ADOPS projects without the need to download and unzip the corresponding ADOPS project file. The ADOPS projects available at B+ can also be downloaded, unzipped, and opened using the ADOPS graphical interface. The availability of such a database ensures results repeatability, promotes data reuse with significant savings on the time needed for preparing datasets, and allows further exploration of the data contained in ADOPS projects effortlessly.

Noé Vázquez, Cristina P. Vieira, Bárbara S. R. Amorim, André Torres, Hugo López-Fernández, Florentino Fdez-Riverola, José L. R. Sousa, Miguel Reboiro-Jato, Jorge Vieira

ROC632: An Overview

The present paper aims to analyze and explore the ROC632 package, specifying its main characteristics and functions. More specifically, the goal of this study is the evaluation of the effectiveness of the package and its strengths and weaknesses. This package was created in order to overcome the lack of information concerning incomplete time-to-event data, adapting the 0.632+ bootstrap estimator for the evaluation of time dependent ROC curves. By applying this package to a specific dataset (DLBCLpatients), it becomes possible to assess tangible data, determining if it is able to analyze complete and incomplete data efficiently and without bias.

Catarina Santos, Ana Cristina Braga

Processing 2D Gel Electrophoresis Images for Efficient Gaussian Mixture Modeling

In modern molecular biology the most commonly used method to distinct proteins present in complex sample is two-dimensional gel electrophoresis. Unfortunately, the quality of the gel image is reduced by the presence of non-linear background signal, spikes, streaks and other artefacts. The main components of gel image are protein spots. To properly distinguish spots, mostly in overlapping regions, mixture modeling can be performed. Due to many signal impurities the estimation of model parameters is inadequate. In this study, by using two fragments of real gel image and a set of synthetic data, three background correction methods with four image filtering methods were collated and the quality of spot detection based on mixture modeling was checked. The presented results prove that efficient modeling of 2D gel electrophoresis images must be preceded by proper background correction and noise filtering. A two-step Otsu algorithm was the best method for removing background signal. There was no single favorite from filtering methods, but using 2D matched filtering leads to good results despite the background correction method used.

Michal Marczyk

Improving Document Prioritization for Protein-Protein Interaction Extraction Using Shallow Linguistics and Word Embeddings

Understanding of biological processes, associated to disease or pharmacological action for example, requires the analysis of large amounts of interconnected information. Protein interaction networks form part of this puzzle, and extracting this information from the scientific literature is an important but challenging task.In this work, we present a supervised classification approach for identifying and ranking literature documents that contain information regarding protein interactions. We studied the use of word embedding together with simple chunking features, and show that the combination of these features with baseline bag-of-words can lead to similar or even improved results when compared to the use of features based on deep linguistic parsing. When applied to the BioCreative III Article Classification Task dataset, our approach achieves an area under the precision-recall curve of 0.70 and a Matthew’s correlation coefficient of 0.56.

Sérgio Matos

K-Means Clustering with Infinite Feature Selection for Classification Tasks in Gene Expression Data

In the bioinformatics and clinical research areas, microarray technology has been widely used to distinguish a cancer dataset between normal and tumour samples. However, the high dimensionality of gene expression data affects the classification accuracy of an experiment. Thus, feature selection is needed to select informative genes and remove non-informative genes. Some of the feature selection methods, yet, ignore the interaction between genes. Therefore, the similar genes are clustered together and dissimilar genes are clustered in other groups. Hence, to provide a higher classification accuracy, this research proposed k-means clustering and infinite feature selection for identifying informative genes in the selected subset. This research has been applied to colorectal cancer and small round blue cell tumors datasets. Eventually, this research successfully obtained higher classification accuracy than the previous work.

Muhammad Akmal Remli, Kauthar Mohd Daud, Hui Wen Nies, Mohd Saberi Mohamad, Safaai Deris, Sigeru Omatu, Shahreen Kasim, Ghazali Sulong

Classification of Colorectal Cancer Using Clustering and Feature Selection Approaches

Accurate cancer classification and responses to treatment are important in clinical cancer research since cancer acts as a family of gene-based diseases. Microarray technology has widely developed to measure gene expression level changes under normal and experimental conditions. Normally, gene expression data are high dimensional and characterized by small sample sizes. Thus, feature selection is needed to find the smallest number of informative genes and improve the classification accuracy and the biological interpretability results. Due to some feature selection methods neglect the interactions among genes, thus, clustering is used to group the similar genes together. Besides, the quality of the selected data can determine the effectiveness of the classifiers. This research proposed clustering and feature selection approaches to classify the gene expression data of colorectal cancer. Subsequently, a feature selection approach based on centroid clustering provide higher classification accuracy compared with other approaches.

Hui Wen Nies, Kauthar Mohd Daud, Muhammad Akmal Remli, Mohd Saberi Mohamad, Safaai Deris, Sigeru Omatu, Shahreen Kasim, Ghazali Sulong

Development of Text Mining Tools for Information Retrieval from Patents

Biomedical literature is composed of an ever increasing number of publications in natural language. Patents are a relevant fraction of those, being important sources of information due to all the curated data from the granting process. However, their unstructured data turns the search of information a challenging task. To surpass that, Biomedical text mining (BioTM) creates methodologies to search and structure that data. Several BioTM techniques can be applied to patents. From those, Information Retrieval is the process where relevant data is obtained from collections of documents. In this work, a patent pipeline was developed and integrated into @Note2, an open-source computational framework for BioTM. This integration allows to run further BioTM tools over the patent documents, including Information Extraction processes as Named Entity Recognition or Relation Extraction.

Tiago Alves, Rúben Rodrigues, Hugo Costa, Miguel Rocha

How Can Photo Sharing Inspire Sharing Genomes?

People usually are aware of the privacy risks of publishing photos online, but these risks are less evident when sharing human genomes. Modern photos and sequenced genomes are both digital representations of real lives. They contain private information that may compromise people’s privacy, and still, their highest value is most of times achieved only when sharing them with others. In this work, we present an analogy between the privacy aspects of sharing photos and sharing genomes, which clarifies the privacy risks in the latter to the general public. Additionally, we illustrate an alternative informed model to share genomic data according to the privacy-sensitivity level of each portion. This article is a call to arms for a collaborative work between geneticists and security experts to build more effective methods to systematically protect privacy, whilst promoting the accessibility and sharing of genomes.

Vinicius V. Cogo, Alysson Bessani, Francisco M. Couto, Margarida Gama-Carvalho, Maria Fernandes, Paulo Esteves-Verissimo

An App Supporting the Self-management of Tinnitus

Tinnitus is an annoying ringing in the ears, in varying shades and intensities. Tinnitus can affect a patient’s overall health and social well-being (e.g., sleep problems, trouble concentrating, anxiety, depression and inability to work). Usually, the diagnostic procedure of tinnitus passes through three steps, i.e., audiological examination, psychoacoustic measurement, and disability evaluation. All steps are performed by physicians, by using dedicated hardware/software and administering questionnaires. The paper reports on the results of a one-year running project whose aim is to directly support patients in such a diagnostic procedure, and in particular on an Android app that controls an ad-hoc developed device and automate both the execution of the audiometric examinations and the administration of the questionnaires that measure the disability induced by the tinnitus.

Chamoso Pablo, De La Prieta Fernando, Eibenstein Alberto, Tizio Angelo, Vittorini Pierpaolo

Anthropometric Data Analytics: A Portuguese Case Study

Large amounts of information are systematically generated throughout the course of scientific research and progress. In our case, observations representing the Portuguese population within the central-southern region of Portugal were collected throughout various foetal autopsy procedures. Gestational age (GA) and measured distances and weights of numerous anthropometric features and organs, respectively, were recorded per singleton (24 variables in total). This work seeks to elaborate on the accuracy of different foetal parameters in terms of GA estimation, making use of principal component analysis (PCA) and regression techniques. We created a dataset of 450 foetuses, ranging from 13 to 42 weeks of age, to compute both PCA and regression models. Initial exploratory analysis shed light onto which variables are most explanatory in terms of foetal development, and are thus most likely suitable for predictive rolls. We produced clusters of models, based on coefficient of determination (R2) values, by comparing the squared sum of residuals between models (significance level α = 0.05). Models comprised of linear combinations of different variables exhibited significantly higher values of R2 (p-value ≤ 0.05) when compared to single variable models. Across all regressions, crown-heel length (CHL), crown-rump length (CRL), and foot length (FL) are constantly present within the cluster of best predictors of gestational age. Depending on the type of regression analysis applied, body weight (Body), hand length (HL) also fall onto the same category.

António Barata, Lucília Carvalho, Francisco M. Couto

Reverse Inference in Symbolic Systems Biology

Cell dynamics is intrinsically concurrent, since many different biochemical reactions might take place simultaneously in a cell. Productive symbolic mathematical models of cell biology can be developed by modeling such biochemical reactions with rewrite rules. Analyses and predictions of biological facts can be obtained from such models. The authors have previously published several approaches for searching along cellular signaling networks. In this paper, we introduce a novel reverse inference system by applying narrowing techniques. Moreover, we propose a new general architecture which allows an extendible set of tools for direct and reverse inference by using rewriting logic.

Beatriz Santos-Buitrago, Adrián Riesco, Merrill Knapp, Gustavo Santos-García, Carolyn Talcott

Skin Temperature Monitoring to Avoid Foot Lesions in Diabetic Patients

Foot temperature monitoring is of great importance in diabetic patients, as they are prone to complications such as peripheral neuropathy and vascular insufficiency. In recent years, the study of different non-invasive procedures to monitor healthy indicators is growing, due to the advances in mobile devices, micro-sensors, and also wireless sensors. The health monitoring systems are used by medical staff and also by patients when they are out of the hospital, in their personal environment. This paper presents a preliminary work to identify the specific points on the feet where the temperature sensors should be positioned. We have developed an statistical analysis of the data obtained by a thermal camera from healthy people.

A. Queiruga-Dios, J. Bullón Pérez, A. Hernández Encinas, J. Martín-Vaquero, A. Martínez Nova, J. Torreblanca González

Multidimensional Feature Selection and Interaction Mining with Decision Tree Based Ensemble Methods

This paper demonstrates capability of detecting strong synthetic benchmark feature interactions in a set of mixed categorical and continuous variables using a modified version of Monte Carlo Feature Selection algorithm. MCFS’s original way of detecting feature interactions relying on the analysis of structure of trained decision trees is compared with our modified approach consisting of a series of variable permutations combined with a decomposition of feature total effect to main effect and interaction effects. A comparison with unmodified MCFS, which by default handles only classification problems using C4.5 decision trees, shows that the new approach is slightly more robust. Furthermore, the decomposition approach is flexible by allowing to plug in different types of models to MCFS. This opens a way to handle high-throughput supervised feature selection and interaction mining problems for classification, regression and censored survival decision vector.

Lukasz Krol, Joanna Polanska

A Normalisation Strategy to Optimally Design Experiments in Computational Biology

In this work we describe a new methodology to improve predictive capabilities of dynamic models when parameters differ in orders of magnitude. The main idea is to normalise the model unknown parameters before solving the classical problem of optimal experimental design based on the Fisher information matrix. The normalisation improves the relative confidence intervals of the estimated parameters and the conditioning of the Fisher matrix, especially for those criteria aiming to decorrelate the model parameters. Using the so-called core predictions, we show how the new approach improves the final model predictive capabilities in two terms: predictions are closer to the real dynamics and with better confidence intervals.We illustrate the concepts using two toy examples linear and non-linear in their parameters. Finally we test the performance of the normalisation in a model simulating the bacterial SOS response. This pathway remains of main relevance to work towards a predictive model of antimicrobial resistance.

Míriam R. García, Antonio A. Alonso, Eva Balsa-Canto

Mitosis Detection in Breast Cancer Using Superpixels and Ensemble Classifiers

Determining the severity and potential aggressiveness of breast cancer is an important step in the determination of the treatment options for a patient. Mitosis activity is one of the main components in breast cancer severity grading. Currently, mitosis counting is a laborious, prone to processing errors, done manually by a pathologist.This paper presents a novel approach for automatic mitosis detection, where promising candidates are selected from a superpixel segmentation of the image and classified using an ensemble classifier created from a selection from a pool of different color spaces, different features vector.

César A. Ortiz Toro, Consuelo Gonzalo Martín, Angel García Pedrero, Alejandro Rodriguez Gonzalez, Ernestina Menasalvas

Reproducibility of Finding Enriched Gene Sets in Biological Data Analysis

Introducing the high-throughput measurement methods into molecular biology was a trigger to develop the algorithms for searching disorders in complex signalling systems, like pathways or gene ontologies. In recent years, there appeared many new solutions, but the results obtained with these techniques are ambiguous. In this work, five different algorithms for pathway enrichment analysis were compared using six microarray datasets covering cases with the same disease. The number of enriched pathways at different significance level and false positive rate of finding enrichment pathways was estimated, and reproducibility of obtained results between datasets was checked. The best performance was obtained for PLAGE method. However, taking into consideration the biological knowledge about analyzed disease condition, many findings may be false positives. Out of the other methods GSVA algorithm gave the most reproducible results across tested datasets, which was also validated in biological repositories. Similarly, good outcomes were given by GSEA method. ORA and PADOG gave poor sensitivity and reproducibility, which stand in contrary to previous research.

Joanna Zyla, Michal Marczyk, Joanna Polanska

Towards Trustworthy Predictions of Conversion from Mild Cognitive Impairment to Dementia: A Conformal Prediction Approach

Predicting progression from a stage of Mild Cognitive Impairment to Alzheimer’s disease is a major pursuit in current dementia research. As a result, many prognostic models have emerged with the goal of supporting clinical decisions. Despite the efforts, the lack of a reliable assessment of the uncertainty of each prediction has hampered its application in practise. It is paramount for clinicians to know how much they can rely upon the prediction made for a given patient, in order to adjust treatments to the patient based on that information. In this exploratory study, we evaluated the Conformal Prediction approach on the task of making predictions with precise levels of confidence. Conformal prediction showed promising results. Using high confidence levels have the drawback of leaving a large number of MCI patients without prognostic (the classifier is not confident enough to give a single class). When using forced predictions, conformal predictors achieved classification performances as good as standard classifiers, with the advantage of complementing each prediction with a confidence value.

Telma Pereira, Sandra Cardoso, Dina Silva, Alexandre de Mendonça, Manuela Guerreiro, Sara C. Madeira

Topological Sequence Segments Discriminate Between Class C GPCR Subtypes

G protein-coupled receptors are eukaryotic cell membrane proteins with a key role as extracellular signal transmitters. While GPCRs embrace a wide and heterogeneous super-family of proteins, our interest in this study is in its Class C, of great relevance to pharmacology. The scarcity of knowledge about their full 3-D crystal structure makes the use of their primary amino acid sequences important for analysis. In this paper, we systematically analyze whether segments of the receptor sequences are able to discriminate between the different class C GPCR subtypes according to their topological location on the extracellular, transmembrane or intracellular domain. For this, we build on previous research that showed that the use of the extracellular N-terminus domain on its own for this classification task did only entail a minor decrease in subtype discrimination when compared to the complete sequence. We use Support Vector Machine-based classification models to assess the subtype discriminating power of the topological segments.

Caroline König, René Alquézar, Alfredo Vellido, Jesús Giraldo

QmihR: Pipeline for Quantification of Microbiome in Human RNA-seq

The huge amount of genomic and transcriptomic data obtained to characterize human diversity can also be exploited to indirectly gather information on the human microbiome. Here we present the pipeline QmihR designed to identify and quantify the abundance of known microbiome communities and to search for new/rare pathogenic species in RNA-seq datasets. We applied QmihR to 36 RNA-seq tumor tissue samples from Ukrainian gastric carcinoma patients available in TCGA, in order to characterize their microbiome and check for efficiency of the pipeline. The microbes present in the samples were in accordance to published data in other European datasets, and the independent BLAST evaluation of microbiome-aligned reads confirmed that the assigned species presented the highest BLAST match-hits. QmihR is available at GitHub (https://github.com/Pereira-lab/QmihR).

Bruno Cavadas, Joana Ferreira, Rui Camacho, Nuno A. Fonseca, Luisa Pereira

Improving Prognostic Prediction from Mild Cognitive Impairment to Alzheimer’s Disease Using Genetic Algorithms

Alzheimer’s disease is becoming a global epidemic. Its impact is devastating for patients’, their families and the economy. As such, it is important to build good prognostic models that can predict conversion to dementia so that treatment measures could be taken. In this work, we applied a genetic algorithm to choose the most relevant neuropsychological and demographic features for prognostic prediction. The results show improvements over other feature selection methods, with our model being able to predict conversion to dementia with AUC and sensitivity of 88% . Moreover, we found that with only 7 features it is possible to achieve good classification results. These results could help physicians to adjust treatment and select which exams should be performed regularly to increase efficiency in clinical practice.

Francisco L. Ferreira, Sandra Cardoso, Dina Silva, Manuela Guerreiro, Alexandre de Mendonça, Sara C. Madeira

Novel Method of Identifying DNA Methylation Fingerprint of Acute Myeloid Leukaemia

Finding new statistical approaches to high throughput data analysis is a very hot topic nowadays. Such a data needs dedicated methods and algorithms of analysis due to huge number of features, but often also due to a small number of samples. Methylation data are also very special, because of dependencies between features and their neighbourhood. There is a need to find a novel, data driven algorithm for these data owing to big variety of distributions data sets. Purpose of this method is detection of regions with different levels of demethylation. From the biological point of view, the most important genome regions are TSS (transcription start site) regions. Hypermethylation of these part of a gene leads to repression and thus stop the gene expression. This phenomenon often happens in cancer disease and impairs a number of molecular processes in the cell. The proposed algorithm is performed for AML patients data in comparison to healthy control. By combination of statistics methods and mathematical modelling together, it enables detection of demethylated regions or DNA and their classification as low, medium or high demethylated.

Agnieszka Cecotka, Joanna Polanska

Metadata Analyser: Measuring Metadata Quality

Scientific research is increasingly dependent on publicly available information and data sharing. So far, the best practices to ensure that data is accessible and shareable has been to deposit it in public repositories. However, these repositories often fail to implement mechanisms that measure data quality, which could lead to improving the discoverability of existing data, and contribute to its future integration. In light of this, we present Metadata Analyser, a tool that measures metadata quality. It assesses the quality of metadata by considering the proportion of terms actually linked to ontology concepts, as well as the specificity of the terms used in the metadata. Metadata Analyser applied to Metabolights, a real-world repository of metabolomics data, and results show that the tool successfully implements the proposed measures, that there is indeed a lack of effort in the annotation task, and that our tool can be used to improve this situation. Metadata Analyser’s frontend is available at http://masterweb-metadataanalyser.rhcloud.com.

Bruno Inácio, João D. Ferreira, Francisco M. Couto

Vascular Contraction Model Based on Multi-agent Systems

This paper presents a first approximation to the simulation of vascular smooth muscle cell following an agent-based simulation approach. This simulation incorporates mathematical models that describe the behaviour of these cells, which are used by the agents in order to emulate vascular contraction. A first tool, implemented in Netlogo, is provided to allow the performance of the proposed simulation.

J. A. Rincon, Guerra-Ojeda Sol, V. Julian, C. Carrascosa

Study of the Epigenetic Signals in the Human Genome

Epigenetics can be defined as changes in the genome that are inherited during cell division, but without direct modification of the DNA sequence. These genomic changes are supported by three major epigenetic mechanisms: DNA methylation, histone modification and small RNAs. Different epigenetic marks function regulate gene transcription, some of them when altered can trigger various diseases such as cancer. This work is focus on the epigenetic signals in the human genome, studding the dependency between the nucleotide word context and the occurrence of epigenomic marking. We based our study on histone epigenomes available in the NIH Roadmap Epigenomics Mapping Consortium database that contains various types of cells and various types of tissues. We compared genomic contexts of epigenetic marking among chromosomes and among epigenomes. We included a control scenario, the DNA sequence regions without epigenetic marking. We identified significant differences between context occurrence of control and epigenetic regions. The genomic words in epigenetic marking regions present significant association with chromosome and histone modification type.

Susana Ferreira, Vera Afreixo, Gabriela Moura, Ana Tavares

Cloud-Assisted Read Alignment and Privacy

Thanks to the rapid advances in sequencing technologies, genomic data is now being produced at an unprecedented rate. To adapt to this growth, several algorithms and paradigm shifts have been proposed to increase the throughput of the classical DNA workflow, e.g. by relying on the cloud to perform CPU intensive operations. However, the scientific community raised an alarm due to the possible privacy-related attacks that can be executed on genomic data. In this paper we review the state of the art in cloud-based alignment algorithms that have been developed for performance. We then present several privacy-preserving mechanisms that have been, or could be, used to align reads at an incremental performance cost. We finally argue for the use of risk analysis throughout the DNA workflow, to strike a balance between performance and protection of data.

Maria Fernandes, Jérémie Decouchant, Francisco M. Couto, Paulo Esteves-Verissimo

On the Role of Inverted Repeats in DNA Sequence Similarity

In this paper, we propose a computational approach to quantify inverted repeats. This is important, because it is known that the presence of inverted repeats in genomic data may be associated to certain chromosomal rearrangements. First, we present a reference-based relative compression method, which employs statistical characteristics of the genomic data. Then, for determining the similarity between genomic sequences, we use the normalized relative compression measure, which is light-weight regarding computational time and memory. Testing this approach on various species, including human, chimpanzee, gorilla, chicken, turkey and archaea genomes, we unveil unreported results that may support several evolution insights.

Morteza Hosseini, Diogo Pratas, Armando J. Pinho

An Ensemble Approach for Gene Selection in Gene Expression Data

Feature/Gene selection is a major research area in the study of gene expression data, generally dealing with classification tasks of diseases or subtype of diseases and identification of biomarkers related to a type of disease. In such a context, this paper proposes an ensemble approach of gene selection for classification tasks from gene expression datasets. This proposal provides a four-staged approach of gene filtering. Each stage performs a different gene filtering task, such as: data processing, noise removing, gene selection ensemble and application of wrapper methods to reach the end result, a small subset of informative genes. Our proposal has been assessed on two different datasets of the same disease (Pancreatic ductal adenocarcinoma) for which, good results have been achieved in comparison with other gene selection methods. Hence, the proposed strategy has proven its reliability with respect to other approaches.

José A. Castellanos-Garzón, Juan Ramos, Daniel López-Sánchez, Juan F. de Paz

Dissimilar Symmetric Word Pairs in the Human Genome

In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. We focus our study on the complete human genome and its repeat-masked version.

Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Raquel M. Silva, Carlos A. C. Bastos, Armando Pinho, Paula Brito, Vera Afreixo

A Critical Evaluation of Automatic Atom Mapping Algorithms and Tools

The identification of the atoms which change their position in chemical reactions is an important knowledge within the field of Metabolic Engineering. This can lead to new advances at different levels from the reconstruction of metabolic networks to the classification of chemical reactions, through the identification of the atomic changes inside a reaction. The Atom Mapping approach was initially developed in the 1960s, but recently suffered important advances, being used in diverse biological and biotechnological studies. The main methodologies used for atom mapping are the Maximum Common Substructure and the Linear Optimization methods, which both require computational know-how and powerful resources to run the underlying tools.In this work, we assessed a number of previously implemented atom mapping frameworks, and built a framework able of managing the different data inputs and outputs, as well as the mapping process provided by each of these third-party tools. We evaluated the admissibility of the calculated atom maps from different algorithms, also assessing if with different approaches we were capable of returning equivalent atom maps for the same chemical reaction.

Nuno Osório, Paulo Vilaça, Miguel Rocha

Substitutional Tolerant Markov Models for Relative Compression of DNA Sequences

Referential compression is one of the fundamental operations for storing and analyzing DNA data. The models that incorporate relative compression, a special case of referential compression, are being steadily improved, namely those which are based on Markov models. In this paper, we propose a new model, the substitutional tolerant Markov model (STMM), which can be used in cooperation with regular Markov models to improve compression efficiency. We assessed its impact on synthetic and real DNA sequences, showing a substantial improvement in compression, while only slightly increasing the computation time. In particular, it shows high efficiency in modeling species that have split less than 40 million years ago.

Diogo Pratas, Morteza Hosseini, Armando J. Pinho

Biomedical Word Sense Disambiguation with Word Embeddings

There is a growing need for automatic extraction of information and knowledge from the increasing amount of biomedical and clinical data produced, namely in textual form. Natural language processing comes in this direction, helping in tasks such as information extraction and information retrieval. Word sense disambiguation is an important part of this process, being responsible for assigning the proper concept to an ambiguous term.In this paper, we present results from machine learning and knowledge-based algorithms applied to biomedical word sense disambiguation. For the supervised machine learning algorithms we used word embeddings, calculated from the full MEDLINE literature database, as global features and compare the results to the use of local unigram and bigram features.For the knowledge-based method we represented the textual definitions of biomedical concepts from the UMLS database as word embedding vectors, and combined this with concept associations derived from the MeSH term co-occurrences.Both the machine learning and the knowledge-based results indicate that word embeddings are informative and improve the biomedical word disambiguation accuracy. Applied to the reference MSH WSD data set, our knowledge-based approach achieves 85.1% disambiguation accuracy, which is higher than some previously proposed approaches that do not use machine-learning strategies.

Rui Antunes, Sérgio Matos

Classification Tools for Carotenoid Content Estimation in Manihot esculenta via Metabolomics and Machine Learning

Cassava genotypes (Manihot esculenta Crantz) with high pro-vitamin A activity have been identified as a strategy to reduce the prevalence of deficiency of this vitamin. The color variability of cassava roots, which can vary from white to red, is related to the presence of several carotenoid pigments. The present study has shown how CIELAB color measurement on cassava roots tissue can be used as a non-destructive and very fast technique to quantify the levels of carotenoids in cassava root samples, avoiding the use of more expensive analytical techniques for compound quantification, such as UV-visible spectrophotometry and the HPLC. For this, we used machine learning techniques, associating the colorimetric data (CIELAB) with the data obtained by UV-vis and HPLC, to obtain models of prediction of carotenoids for this type of biomass. Best values of R2 (above 90%) were observed for the predictive variable TCC determined by UV-vis spectrophotometry. When we tested the machine learning models using the CIELAB values as inputs, for the total carotenoids contents quantified by HPLC, the Partial Least Squares (PLS), Support Vector Machines, and Elastic Net models presented the best values of R2 (above 40%) and Root-Mean-Square Error (RMSE). For the carotenoid quantification by UV-vis spectrophotometry, R2 (around 60%) and RMSE values (around 6.5) are more satisfactory. Ridge regression and Elastic Network showed the best results. It can be concluded that the use colorimetric technique (CIELAB) associated with UV-vis/HPLC and statistical techniques of prognostic analysis through machine learning can predict the content of total carotenoids in these samples, with good precision and accuracy.

Rodolfo Moresco, Telma Afonso, Virgílio G. Uarrota, Bruno Bachiega Navarro, Eduardo da C. Nunes, Miguel Rocha, Marcelo Maraschin

UV-Vis Spectrophotometry and Chemometrics as Tools for Recognition of the Biochemical Profiles of Organic Banana Peels (Musa sp.) According to the Seasonality in Southern Brazil

Banana (Musa sp.) has received wide interest in popular and scientific medicine because of its rich composition in bioactive metabolites, e.g., phenolic compounds, found in interesting concentrations in its peel. Banana peel is a residue that is under-exploited by the industry. Thus, with the intention to give a destination to this by-product towards health care or cosmetics industries, we evaluated its aqueous extract (AE) as a source of bioactive phenolic compounds, aiming at to apply them in future studies of biological activities. For that, in this study samples of banana peels were chemically profiled throughout the year to identify the best harvest time of those biomasses regarding their phenolic composition. In this sense, we used additional information on the chemical heterogeneity of the samples determined by the seasoning, through a set of analytical and climatic data to elaborate chemometric models, supported by bioinformatics tools. Through PCA and HCA analyzes, it was detected that low temperatures; normally observed in winter; strongly modulate the banana metabolism, leading to increased amounts of phenolic compounds, and improving the antioxidant activity of the banana peel AE. The samples collected during the months of winter showed a similar profile and a relatively high concentration of phenolic compounds with potential for future studies of biological activities.

Susane Lopes, Rodolfo Moresco, Luiz Augusto Martins Peruch, Miguel Rocha, Marcelo Maraschin

Influence of Solar Radiation on the Production of Secondary Metabolites in Three Rice (Oryza sativa) Cultivars

Rice (Oryza sativa L.) is one of the most produced and consumed cereals worldwide and has its importance highlighted mainly in developing countries, where it plays a strategic economic and social role. Due to the importance of rice in the diet, its composition and nutritional characteristics are directly related to the health of the population. In the rice production systems, some climatic factors are determinants for the good performance of the crop, inducing the biosynthesis of primary and secondary metabolites. The present study determined the metabolic profiles through UV-visible spectrophotometry of leaf samples of three rice cultivars (Marques – white, Ônix – black, and Rubi – red pericarp) throughout the rice’s vegetative stages in two experimental times, from September to December 2015 and from January to April 2016. Solar radiation was recorded along the experimental period. To the organosolvent extracts of leaf samples, UV-vis spectrophotometric techniques were applied and the quantitative results of certain metabolites, e.g., chlorophylls, carotenoids, phenolics, flavonoids, and sugars, as well the antioxidant activity, which were analyzed by chemometrics tools. The results showed that biochemical parameters carotenoids, chlorophylls and sugars are more affected by the intensity of the radiation do que as variáveis phenolics, flavonoids and these alterations may be detected through statistical analysis of biochemical concentrations and UV-vis spectra.

Eva Regina Oliveira, Ester Wickert, Fernanda Ramlov, Rodolfo Moresco, Larissa Simão, Bruno B. Navarro, Claudia Bauer, Débora Cabral, Miguel Rocha, Marcelo Maraschin

Cryfa: A Tool to Compact and Encrypt FASTA Files

NGS (next-generation sequencing) is bringing the need to efficiently handle large volumes of patient data, maintaining privacy laws, such as those with secure protocols that ensure patients DNA confidentiality. Although there are multiple file representations for genomic data, the FASTA format is perhaps the most used and popular. As far as we know, FASTA encryption is being addressed with general purpose encryption methods, without exploring a compact representation. In this paper, we propose Cryfa, a new fast encryption method to store securely FASTA files in a compact form. The main differences between a general encryption approach and Cryfa are the reduction of storage, up to approximately three times, without compromising security, and the possibility of integration with pipelines. The core of the encryption method uses a symmetric approach, the AES (Advanced Encryption Standard). Cryfa implementation is freely available, under license GPLv3, at https://github.com/pratas/cryfa.

Diogo Pratas, Morteza Hosseini, Armando J. Pinho

An Automated Colourimetric Test by Computational Chromaticity Analysis: A Case Study of Tuberculosis Test

This paper presents an investigation into a novel approach for an automated universal colourimetric test by chromaticity analysis. This work particularly focuses on how a well-adjusted harmony between computational complexity and biochemical analysis can reduce the associated cost and unlock the limit on conventional chemical practice. The proposed research goal encompasses the potential to the criteria- anytime anywhere access, low cost, rapid detection, better sensitivity, specificity and accuracy. Our method includes obtaining the amount of colour change for each instance by delta E calculation. The system can provide the result in any ambient condition from the trajectory of colour change using Euclidean distance in LAB colour space. The strategy is verified on plasmonic ELISA based diagnosis of tuberculosis (TB). TB detection by plasmonic ELISA is a challenging, demanding and a time-consuming diagnosis. Completing the computation in real time, we circumvent the obstacle liberating the TB diagnosis in less than 15 min.

Marzia Hoque Tania, K. T. Lwin, Kamal AbuHassan, Noremylia Mohd Bakhori, Umi Zulaikha Mohd Azmi, Nor Azah Yusof, M. A. Hossain

Characterization of the Chemical Composition of Banana Peels from Southern Brazil Across the Seasons Using Nuclear Magnetic Resonance and Chemometrics

Banana peels are a source of important bioactive compounds, such as phenolics, carotenoids, biogenic amines, among others. For industrial usage of that by-product, a certain homogeneity of its chemical composition is claimed, a trait affected by the effect of (a)bioatic ecological factors. In this sense, this study aimed to investigate the banana peels chemical composition, to get insights on eventual metabolic changes caused by the seasons, in southern Brazil. For this purpose, a Nuclear Magnetic Resonance (NMR)-based metabolic profiling strategy was adopted, followed by chemometrics analysis, using the specmine package for the R environment. The obtained results show that the different seasons can, in fact, influence the metabolic composition, namely the levels of metabolites extracted from the bananas peels. The analytical approach herein adopted, i.e., NMR-based metabolomics coupled to chemometrics analysis, seems to enable identifying the chemical heterogeneity of banana peels over the harvest seasons, allowing obtaining standardized extracts for further technological purposes of usage.

Sara Cardoso, Marcelo Maraschin, Luiz Augusto Martins Peruch, Miguel Rocha, Aline Pereira

Erratum to: Multidimensional Feature Selection and Interaction Mining with Decision Tree Based Ensemble Methods

Lukasz Krol, Joanna Polanska

Backmatter

Titel: 11th International Conference on Practical Applications of Computational Biology & Bioinformatics
herausgegeben von: Florentino Fdez-Riverola
Mohd Saberi Mohamad
Miguel Rocha
Juan F. De Paz
Tiago Pinto
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-60816-7
Print ISBN: 978-3-319-60815-0
DOI: https://doi.org/10.1007/978-3-319-60816-7