Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the Brazilian Symposium on Bioinformatics, BSB 2019, held in Fortaleza, Brazil in October 2019.

The 9 revised full papers and 3 short papers were carefully reviewed and selected from 22 submissions. The papers address a broad range of current topics in computational biology and bioinformatics.



Full Papers


On Clustering Validation in Metagenomics Sequence Binning

In clustering, one of the most challenging aspects is the validation, whose objective is to evaluate how good a clustering solution is. Sequence binning is a clustering task on metagenomic data analysis. The sequence clustering challenge is essentially putting together sequences belonging to the same genome. As a clustering problem it requires proper use of validation criteria of the discovered partitions. In sequence binning, the concepts of precision and recall, and F-measure index (external validation) are normally used as benchmark. However, on practice, information about the (sub) optimal number of cluster is unknown, so these metrics might be biased to an overestimated “ground truth”. In the case of sequence binning analysis, where the reference information about genomes is not available, how to evaluate the quality of bins resulting from a clustering solution? To answer this question we empirically study both quantitative (internal indexes) and qualitative aspects (biological soundness) while evaluating clustering solutions on the sequence binning problem. Our experimental study indicates that the number of clusters, estimated by binning algorithms, do not have as much impact on the quality of bins by means of biological soundness of the discovered clusters. The quality of the sub-optimal bins (greater than 90%) were identified in both rich and poor clustering partitions. Qualitative validation is essential for proper evaluation of a sequence binning solution, generating bins with sub-optimal quality. Internal indexes can only be used in compliance with qualitative ones as a trade-off between the number of partitions and biological soundness of its respective bins.
Paulo Oliveira, Kleber Padovani, Ronnie Alves

Genome Assembly Using Reinforcement Learning

Reinforcement learning (RL) aims to build intelligent agents able to optimally act after the training process to solve a given goal task in an autonomous and non-deterministic fashion. It has been successfully employed in several areas; however, few RL-based approaches related to genome assembly have been found, especially when considering real input datasets. De novo genome assembly is a crucial step in a number of genome projects, but due to its high complexity, the outcome of state-of-art assemblers is still insufficient to assist researchers in answering all their scientific questions properly. Hence, the development of better assembler is desirable and perhaps necessary, and preliminary studies suggest that RL has the potential to solve this computational task. In this sense, this paper presents an empirical analysis to evaluate this hypothesis, particularly in higher scale, through performance assessment along with time and space complexity analysis of a theoretical approach to the problem of assembly proposed by [2] using the RL algorithm Q-learning. Our analysis shows that, although space and time complexities are limiting scale issues, RL is shown as a viable alternative for solving the DNA fragment assembly problem.
Roberto Xavier, Kleber Padovani de Souza, Annie Chateau, Ronnie Alves

GeNWeMME: A Network-Based Computational Method for Prioritizing Groups of Significant Related Genes in Cancer

Identifying significant mutations in cancer is a challenging problem in Cancer Genomics. Computational methods for identifying significant mutations have been developed in recent years. In this work, we present a flexible computational method named GeNWeMME (Gene Network + Weighted Mutations + Mutual Exclusivity). Our method uses an extensive biological base for prioritizing groups of significant and related genes in cancer. Our method considers data about mutations, type of mutations, gene interaction networks and mutual exclusivity pattern. All these aspects can be used according to the objective of the analysis by cancer genomics professionals, that can choose weights for each aspect. We test our method in four types of cancer where it was possible to identify known cancer genes and suggest others for further biological validation.
Jorge Francisco Cutigi, Adriane Feijo Evangelista, Adenilso Simao

MDR SurFlexDock: A Semi-automatic Webserver for Discrete Receptor-Ensemble Docking

In current computational biology, docking is a popular tool used to find the best fit of one ligand relative to its molecular receptor in forming a complex. However, the most of the tools do not take into account the flexibility of the receptor due to computational cost. As a result, the conformational changes caused by the induced fit are ignored in exploratory docking experiments. In this context and to improve the predictive capacity of docking, a good strategy is to simulate the flexibility of the receptor with the use of key conformations that can be obtained by mixing crystallography and computer simulations, a technique known as ensemble docking. Here, we present MDR SurFlexDock, a web tool that improves the docking experiments by computing a discrete, but representative, ensemble of contact surfaces of the receptor through clustering of molecular simulation trajectories in order to simulate the intrinsic flexibility of the ligand-contacting surface. The results of the interaction of each receptor-compound complex are presented in a concise tabular format to allow rapid analysis of compounds when classifying them by inhibition constant (Ki). MDR SurFlexDock can be valuable in cases of docking for new receptors obtained by homology modelling, in extensive analysis of different chemotypes on proteins with low structural information and for fast characterization of binding capacities on contact surfaces with poor structural information or only optimized for a specific ligand. MDR SurFlexDock is freely available as a web service at http://​biocomp.​uenf.​br:​81.
João Luiz de Almeida Filho, Jorge Hernandez Fernandez

Venom Gland Peptides of Arthropods from the Brazilian Cerrado Biome Unveiled by Transcriptome Analysis

Animal venoms are rich sources of pharmacological active molecules. Less than 10% of arthropod venom components have been characterized so far, reinforcing the importance of prospective studies. The Cerrado, in the Midwest Region of Brazil, is the second-largest biome in Brazil presenting vast biodiversity of arthropod species with venom glands. In this scenario, in a project called “Inovatoxin”, active principles present in the venom of three biodiversity representative arthropod animals from this region were characterized structurally and functionally, using proteomic and transcriptomic prospective strategies. High Throughput Sequencing (HTS) is among the strategies to provide the raw material to help identify bioactive peptides present in these arthropods’ venom. This work proposes a workflow that allowed to annotate a total of 230 venom peptides from the Brazilian arthropods spider Acanthoscurria paulensis, social wasp Polybia sp., and scorpion Tityus fasciolatus. Along with these results, abundant data on the metabolism of the three species were also obtained. These results extend knowledge of venoms, contributing to new perspectives on rational therapeutic measures to treat accidents with these animals, and also on academic and biotechnological applications.
Giovanni M. Guidini, Waldeyr M. C. da Silva, Thalita S. Camargos, Caroline F. B. Mourão, Priscilla Galante, Tainá Raiol, Marcelo M. Brígido, Maria Emília M. T. Walter, Elisabeth N. F. Schwartz

Block-Interchange Distance Considering Intergenic Regions

Genome Rearrangement (GR) is a field of computational biology that uses conserved regions within two genomes as a source of information for comparison purposes. This branch of genomics uses the order in which these regions appear to infer evolutive scenarios and to compute distances between species, while usually neglecting non-conserved DNA sequence. This paper sheds light on this matter and proposes models that use both conserved and non-conserved sequences as a source of information. The questions that arise are how classic GR algorithms should be adapted and how much would we pay in terms of complexity to have this feature. Advances on these questions aid in measuring advantages of including such approach in GR algorithms. We propose to represent non-conserved regions by their lengths and apply this idea in a genome rearrangement problem called “Sorting by Block-Interchanges”. The problem is an interesting choice on the theory of computation viewpoint because it is one of the few problems that are solvable in polynomial time and whose algorithm has a small number of steps. That said, we present a 2-approximation algorithm to this problem along with data structures and formal definitions that may be generalized to other problems in GR field considering intergenic regions.
Ulisses Dias, Andre Rodrigues Oliveira, Klairton Lima Brito, Zanoni Dias

K-mer Mapping and RDBMS Indexes

K-mer Mapping, an internal process for De Novo NGS genome fragments assembly methods, constitutes a computational challenge due to its high main memory consumption. We present a study of index-based methods to deal with this problem, considering a RDBMS environment. We propose an ad-hoc I/O cost model and analyze the performance of hash and B-tree versions for index structures. Furthermore, we present a novel approach for an index based on hashing that takes into account the notion of minimum substrings. An actual RDBMS implementation for experiments with a sugarcane dataset shows that one can obtain considerable performance gains while reducing main memory requirements.
Elvismary Molina de Armas, Paulo Cavalcanti Gomes Ferreira, Edward Hermann Haeusler, Maristela Terto de Holanda, Sérgio Lifschitz

A Clustering Approach to Identify Candidates to Housekeeping Genes Based on RNA-seq Data

Housekeeping genes (HKGs), are essential for gene expression based studies performed through Reverse Transcriptase-polymerase Chain Reaction (RT-qPCR). These genes are related with the basic cellular processes that are essential for cell maintenance, survival and function. Thus, HKGs should be expressed in all cells of an organism regardless of the tissue type, cell state or cell condition. High-throughput technologies, including RNA sequencing (RNA-seq), are used to study and identify these types of genes. RNA-seq is a high-throughput method that allows the measurement of gene expression profiles in a target tissue or an isolated cell. Moreover, machine learning methods are routinely applied in different genomics related areas to enable the interpretation of large datasets, including those related to gene expression. This study reports a new machine learning based approach to identify candidate HKGs in silico from RNA-seq gene expression data. The approach enabled the identification of stable HKGs candidates in RNA-seq data from Corynebacterium pseudotuberculosis. These genes showed stable expression under different stress conditions as well as low variation index and fold changes. Furthermore, some of these genes were already reported in the literature as HKGs or HKGs candidates for the same or other bacterial organisms, which reinforced the accuracy of the proposed method. We present a novel approach based on K-means algorithm, internal metrics and machine learning methods that can identify stable housekeeping genes from gene expression data with high accuracy and efficiency.
Edian F. Franco, Dener Maués, Ronnie Alves, Luis Guimarães, Vasco Azevedo, Artur Silva, Preetam Ghosh, Jefferson Morais, Rommel T. J. Ramos

Predicting Cancer Patients’ Survival Using Random Forests

The increasing amount of data available on the web, coupled with the demand for useful information, has sparked increasing interest in gaining knowledge in large information systems, especially biomedical ones. Health institutions operate in an environment that has been generating thousands of health records about patients. Such databases can be the source of a wealth of information. For instance, these databases can be used to study factors that contribute to the incidence of a pathology and thereby determine patient profiles at the earliest stage of the disease. Such information can be extracted with the help of Machine Learning methods, which are capable of dealing with large amounts of data in order to make predictions. These methods offer an opportunity to translate new data into palpable information and, thus, allows earlier diagnosis and precise treatment options. In order to understand the potential of these methods, we use a database that contain records of cancer patients, which is made publicly available by the Oncocentro Foundation of São Paulo. This database contains historical clinical information from cancer patients of the past 20 years. In this paper we present an initial investigation towards the goal of improving prognosis and therefore increasing the chances of survival among cancer patients. The Random Forest Classification Model was employed in our analysis; this model shows to be a suitable predicting tool for ours purpose. Thus, we intend to present means that allows the design of predictive, preventive and personalized treatments, as well as assisting in the decision making process of the disease.
Camila Takemoto Bertolini, Saul de Castro Leite, Fernanda Nascimento Almeida

Extended Abstracts


Searching in Silico Novel Targets for Specific Coffee Rust Disease Control

Coffee industry is threatened by production losses due to the rust disease since 1850. The coffee leaf rust (CLR) disease has an important social and economical impacts and its control is still a challenge. The CLR pathogen is the fungus Hemileia vastatrix Berkeley & Broome (Basidiomycota, order Pucciniales) that is currently controlled by using non-specific anti-fungal chemicals spraying. The advances in molecular biology and bioinformatics may allow the identification of new targets and environmentally safe strategies for controlling CLR. Several genomic and transcriptomic data are available for H. vastatrix that allow searching for new proteins to achieve a better disease control. We used the dataset of 34,242 sequences from the fungal genome and transcriptome, with a filtering strategy for protein annotation, structure and cell sublocalization to select three essential proteins related to steroid synthesis, cell membrane, and cell wall metabolism. This short paper reports the ongoing study to allow the development of new molecules that might be validated and contribute to new products that are specific and ecologically friendly.
Jonathan D. Lima, Bernard Maigret, Diana Fernandez, Jennifer Decloquement, Danilo Pinho, Erika V.S. Albuquerque, Marcelo O. Rodrigues, Natalia F. Martins

A Domain Framework Approach for Quality Feature Analysis of Genome Assemblies

The Genome Assembly research area has quickly evolved, adapting to new sequencing technologies and modern computational environments. There exist many assembler software systems that consider multiple approaches; however, at the end of the process, the assembly quality can always be questioned. When an assembly is accomplished, one may generate quality features for its qualification. Nonetheless, these features do not directly explain the assembly quality; instead, they only list quantitative assembly descriptions. This work proposes GAAF (Genome Assembly Analysis Framework), a domain framework for the feature analysis post-genome Assembly process. GAAF works with distinct species, assemblers, and features, and its goal is to enable data interpretation and assembly quality evaluation.
Guilherme Borba Neumann, Elvismary Molina de Armas, Fernanda Araujo Baiao, Ruy Luiz Milidiu, Sergio Lifschitz

Identifying Schistosoma mansoni Essential Protein Candidates Based on Machine Learning

The essentiality of proteins is a valuable characteristic in research related to the development of new drugs. Neglected diseases profit from this characteristic in their research due to the lack of investments in the search for new drugs. Among the neglected diseases, we can highlight the schistosomiasis caused by the Schistosoma mansoni organism. This organism is a major cause of infections in humans and only one drug for its treatment is recommended by the World Health Organization. This fact raises a concern about the development of drug resistance by this organism. In this context, the present work aims to identify S. mansoni essential protein candidates. The methodology uses a machine learning approach and makes use of the knowledge of protein essentiality characteristics of model organisms. Experimental results show the Random Forest algorithm achieved the best performance in predicting the protein essentiality characteristic of S. mansoni compared to the other evaluated algorithms.
Francimary P. Garcia, Gustavo Paiva Guedes, Kele Teixeira Belloze


Weitere Informationen

Premium Partner