Skip to main content

2023 | Buch

Advances in Bioinformatics and Computational Biology

16th Brazilian Symposium on Bioinformatics, BSB 2023, Curitiba, Brazil, June 13–16, 2023, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 16th Brazilian Symposium on Bioinformatics on Advances in Bioinformatics and Computational Biology, BSB 2023, which took place in Curitiba, Brazil, in June 2023.

The 11 full papers and 3 short papers presented in this volume were carefully reviewed and selected from 24 submissions. The papers focus on bioinformatics, computational biology, Biological Databases, Biological Networks, Cheminformatics, Evolutionary Genomics, Computational Proteomics, Systems Biology, Drug Design, Genomics, Machine Learning applications in Bioinformatics, Metagenomics, Molecular Docking and Modeling, Molecular Evolution and Phylogenetics, Protein Structure and Modeling, Proteomics, Transcriptomics, Single-Cell Analysis, Workflows in Bioinformatics.

Inhaltsverzeichnis

Frontmatter
Block Interchange and Reversal Distance on Unbalanced Genomes
Abstract
One method for inferring the evolutionary distance between two organisms is to find the rearrangement distance, which is defined as the minimum number of genome rearrangements required to transform one genome into the other. Rearrangements that do not alter the genome content are known as conservative. Examples of such rearrangements include: reversal, which reverts a segment of the genome; transposition, which exchanges two consecutive blocks; block interchange (BI), which exchanges two blocks at any position in the genome; and double cut and join (DCJ), which cuts two different pairs of adjacent blocks and joins them in a different manner. Initially, works in this area involved comparing genomes that shared the same set of conserved blocks. Nowadays, researchers are investigating unbalanced genomes (genomes with a distinct set of genes), which requires the use of non-conservative rearrangements such as insertions and deletions (indels). In cases where there are no repeated blocks and the genomes have the same set of blocks, the BI Distance and the Reversal Distance have polynomial-time algorithms, while the complexity of the BI and Reversal Distance problem remains unknown. In this study, we investigate the BI and Indel Distance and the BI, Reversal, and Indel Distance on genomes with different gene content and no repeated genes. We present 2-approximation algorithms for each problem using a variant of the breakpoint graph structure.
Alexsandro Oliveira Alexandrino, Gabriel Siqueira, Klairton Lima Brito, Andre Rodrigues Oliveira, Ulisses Dias, Zanoni Dias
circTIS: A Weighted Degree String Kernel with Support Vector Machine Tool for Translation Initiation Sites Prediction in circRNA
Abstract
Recent studies discovered that peptides generated from the translation of circRNAs participate in several biological processes, many related to human diseases. Researchers have observed that initiation of translation in circRNAs frequently occurs from non-AUG start codons. However, most existing computational tools for translation initiation site (TIS) prediction consider only the canonical AUG start codon. Thus, we developed a new methodology for predicting TIS AUG and near-cognates, considering the circularization of ORFs occurring in circRNAs. Initially, we used the weighted degree string kernel to create a data representation of the circRNA sequence fragments around possible TIS. Next, we applied a support vector machine to calculate a score representing the potential of the sequence fragment to contain an actual TIS. We used datasets from annotated TIS on circRNAs sequences to train and test our methodology. The first experiment showed that the sequence fragment length is the best value for the kernel’s degree hyperparameter. Next, we investigated the most suitable sequence fragment length. Finally, we compared our methodology with three tools, TITER, TIS Predictor, and TIS Transformer. For TIS AUG prediction, circTIS obtained an AUROC of 98.64%, while TITER, TIS Predictor, and TIS Transformer obtained 78.97%, 78.39%, and 81.3%, respectively. For the TIS near-cognate prediction, our method obtained an AUROC equal to 96.84%, while TITER, TIS Predictor, and TIS Transformer got 81.37%, 72.68%, and 66.33%, respectively. We implemented our methodology in the circTIS tool, freely available at https://​github.​com/​denilsonfbar/​circTIS.
Denilson Fagundes Barbosa, Liliane Santana Oliveira, André Yoshiaki Kashiwabara
Evaluating the Molecular—Electronic Structure and the Antiviral Effect of Functionalized Heparin on Graphene Oxide Through Ab Initio Computer Simulations and Molecular Docking
Abstract
In antiviral studies, heparin is widely used against the SARS-CoV-2 virus. In this study, computer simulations were performed to understand the role of heparin in a possible blockade of the spike protein binding with the human cell receptor. Another molecule, graphene oxide (GO), was functionalized to interact and bind with heparin to achieve an increase in binding affinity with the spike protein. In the first stage. The electronic and chemical interaction between the molecules were analyzed through ab initio simulations by using Spanish Initiative for SIESTA (Electronic Simulations with Thousands of Atoms) Software. Next, we evaluated the interaction between molecules together and separately in the spike protein target through molecular docking simulations using AutoDock Vina Software. The results were relevant because GO functionalized with heparin exhibited an increase in affinity energy to the spike protein. This affinity indicated a possible increase in antiviral activity. This increase will be verified in the future through in vitro tests. Experimental tests on the synthesis and morphology of the material preliminarily indicate a good interaction between molecules and absorption of heparin by GO. This phenomenon confirmed the results of first principles simulations.
André Flores dos Santos, Mirkos Ortiz Martins, Mariana Zancan Tonel, Solange Binotto Fagan
Make No Mistake! Why Do Tools Make Incorrect Long Non-coding RNA Classification?
Abstract
Long non-coding RNAs (lncRNAs) play important roles in various biological processes, and their accurate identification is essential for understanding their functions and potential therapeutic applications. In a previous study, we assessed the impact of short and long reads sequencing technologies on long non-coding RNA computational identification in human and plant data. We provided evidence of where and how to make potential better approaches for the lncRNA classification. In this follow-up study, we investigate the misclassified sequences by five machine learning tools for lncRNA classification in humans to understand the reasons behind the failures of the tools. Our analysis suggests that the primary cause for the failures of these tools is the overlap of two coding regions by lncRNAs, similar to a chimeric sequence. Furthermore, we emphasize the need to view genes as transcriptional units, as the transcript will define the gene function. These insights underscore the need for further refinement and improvement of these tools to enhance their accuracy and reliability in lncRNA prediction and classification, ultimately contributing to a better understanding of the role of lncRNAs in various biological processes and potential therapeutic applications.
Alisson G. Chiquitto, Lucas Otávio L. Silva, Liliane Santana Oliveira, Douglas S. Domingues, Alexandre R. Paschoal
Spectrum-Based Statistical Methods for Directed Graphs with Applications in Biological Data
Abstract
Graphs often model complex phenomena in diverse fields, such as social networks, connectivity among brain regions, or protein-protein interactions. However, standard computational methods are insufficient for empirical network analysis due to randomness. Thus, a natural solution would be the use of statistical approaches. A recent paper by Takahashi et al. suggested that the graph spectrum is a good fingerprint of the graph’s structure. They developed several statistical methods based on this feature. These methods, however, rely on the distribution of the eigenvalues of the graph being real-valued, which is false when graphs are directed. In this paper, we extend their results to directed graphs by analyzing the distribution of complex eigenvalues instead. We show the strength of our methods by performing simulations on artificially generated groups of graphs and finally show a proof of concept using concrete biological data obtained by Project Tycho.
Victor Chavauty Villela, Eduardo Silva Lira, André Fujita
Feature Selection Investigation in Machine Learning Docking Scoring Functions
Abstract
The in silico evaluation of small molecules (ligands) and receptors (proteins) interactions is of great importance, especially in Drug Design. This is one of the principal computational methodologies that can be incorporated into the process of proposing new drugs, with the aim of reducing the high financial costs and time involved. In this context, molecular docking is a computer simulation procedure used to predict the best conformation and orientation of a ligand in the binding site of a target protein. These docking algorithms evaluate the protein-ligand complex interactions using scoring functions (SF). SF computationally quantify the complex binding affinity and can be divided into categories according to the methodology applied in their development: Physics-based, Empirical, Knowledge-based and Machine Learning. Machine Learning (ML) scoring functions train the SF considering features obtained from known protein-ligand complexes and experimental affinities. These SF rely heavily on the set of attributes that are used to train them. Thus, in this work, we use PCA, ANOVA and Random Forest to investigate how these feature selection methods impact the performance of three Machine Learning scoring functions trained with Support Vector Machines, Elastic Net Regularization and Neural Networks algorithms. The results show that Neural Networks can greatly benefit from Feature selection performed by Random Forests but not from ANOVA and PCA. The conclusions are that Feature selection can improve the results of regression and in this study Neural Networks combined with Random Forest is the best option.
Maurício Dorneles Caldeira Balboni, Oscar Emilio Arrua, Adriano V. Werhli, Karina dos Santos Machado
Using Natural Language Processing for Context Identification in COVID-19 Literature
Abstract
The COVID-19 pandemic led to an unprecedented volume of articles published in scientific journals with possible strategies and technologies to contain the disease. Academic papers summarize the main findings of scientific research, which are vital for decision-making, especially regarding health data. However, due to the technical language used in this type of manuscript, its understanding becomes complex for professionals who do not have a greater affinity with scientific research. Thus, building strategies that improve communication between health professionals and academics is essential. In this paper, we show a semi-automated approach to analyze the scientific literature through natural language processing using as a basis the results collected by the “Scientific Evidence Panel on Pharmacological Treatment and Vaccines – COVID-19” proposed by the Brazilian Ministry of Health. After manual curation, we obtained an accuracy of 0.64, precision of 0.74, recall of 0.70, and F1 score of 0.72 for the analysis of the using-context of technologies, such as treatments or medicines (i.e., we evaluated if the keyword was used in a positive or negative context). Our results demonstrate how machine learning and natural language processing techniques can greatly help understand data from the literature, taking into account the context. Additionally, we present a proposal for a scientific panel called SimplificaSUS, which includes evidence taken from scientific articles evaluated through machine learning and natural language processing methods.
Frederico Carvalho, Diego Mariano, Marcos Bomfim, Giovana Fiorini, Luana Bastos, Ana Paula Abreu, Vivian Paixão, Lucas Santos, Juliana Silva, Angie Puelles, Alessandra Silva, Raquel Cardoso de Melo-Minardi
A Framework for Inference and Selection of Cell Signaling Pathway Dynamic Models
Abstract
Properly modeling the dynamics of cell signaling pathways requires several steps, such as selecting a subset of chemical reactions, mapping them into a mathematical model that deals with the communication of the pathway with the remainder of the cell (e.g., systems of universal differential equations - UDEs), inferring model parameters, and selecting the best model based on experimental data. However, this entire process can be extremely laborious and time-consuming for many researchers, as they often have to access different and complicated tools to achieve this goal. To address the challenges associated with this process in a more efficient way, we propose a framework that provides a streamlined approach tailored for universal differential equation UDE-based cell signaling pathway modeling. The open-source, free framework (github.​com/​Dynamic-Systems-Biology/​BSB-2023-Framework) combines parameter inference algorithms, model selection techniques, and data importation from public repositories of biochemical reactions into a single tool. We provide an example of the usage of the proposed framework in a Julia Jupyter notebook. We expect that this streamlined approach will enable researchers to design improved cell signaling pathway models more easily, which may lead to new insights and discoveries in the study of biological mechanisms.
Marcelo Batista, Fabio Montoni, Cristiano Campos, Ronaldo Nogueira, Hugo A. Armelin, Marcelo S. Reis
Intentional Semantics for Molecular Biology
Abstract
This article presents an intentional semantics, using Object Petri Nets (OPNs), to assign activity to each biological molecule and complex, such as mRNA, tRNA, ribosomes, and protein synthesis. The work differs from traditional uses of Petri Nets in Biology and Chemistry for being a bottom-up and general semantics and not only a formalization of some molecular biological phenomenon. Assigning activities to every molecule and the difference between biological function and activity is also a conceptual contribution of this work. To illustrate our semantics, we set to tRNA, mRNA, ribosome, and the protein transcription molecular complex the respective activities expressed by OPNs.
Edward H. Haeusler, Bruno Cuconato, Luiz A. Glatzl, Maria L. Guateque, Diogo M. Vieira, Elvismary M. de Armas, Fernanda Baião, Marcos Catanho, Antonio B. de Miranda, Sergio Lifschitz
transcAnalysis: A Snakemake Pipeline for Differential Expression and Post-transcriptional Modification Analysis
Abstract
The transcAnalysis pipeline is a comprehensive tool that allows the analysis of transcriptome data. The pipeline allows for analysis of differential expression, alternative splicing, lncRNA and RNA editing analysis, with a specific focus on A-to-I editing mediated by the ADAR protein. This type of RNA editing is widespread and can significantly affect gene regulation and function. The results from these analyses are integrated, and the events are associated with each gene. The pipeline also integrates results that can help correlate gene expression and post-transcriptional events. This allows for a comprehensive understanding of the functional impact and provides insight into the biological processes and pathways associated with these events. One of the significant advantages of the transcAnalysis pipeline is its ability to perform all these analyses with a single command using the Snakemake package. This feature simplifies the analysis process and makes it accessible to researchers with limited bioinformatics expertise. Its user-friendly ability to perform multiple analyses with a single command make it an ideal choice for researchers looking to analyze transcriptome data.
Pedro H. A. Barros, Waldeyr M. C. Silva, Marcelo M. Brigido
Peptide-Protein Interface Classification Using Convolutional Neural Networks
Abstract
Peptides are short chains of amino acid residues linked through peptide bonds, whose potential to act as protein inhibitors has contributed to the advancement of rational drug design. Indeed, understanding the interactions between proteins and peptides is potentially helpful for several biotechnological applications. However, it is not a trivial task since peptides can adopt different conformations when interacting with proteins. In this paper, we develop a classification model for protein-peptide interfaces using a convolutional neural network and distance maps. To evaluate our proposal, we performed two case studies classifying protein-peptide interfaces based on peptide sequences and receptor classes. Additionally, we compared the distance map approach with a graph-based structural signatures approach. We aim to find out if a convolutional neural network could classify peptides just from the patterns of distances in these maps. In conclusion, graph-based methods were slightly superior in almost all comparisons performed. However, distance map-based signature methods achieved better results for some classes, such as classifying hormones, membranes, and viral proteins. These results shed light on the potential use of distance maps for classifying protein-peptide interfaces. Nevertheless, more experiments may be needed to explore this use.
Lucas Moraes dos Santos, Diego Mariano, Luana Luiza Bastos, Alessandra Gomes Cioletti, Raquel Cardoso de Melo Minardi
A Power Law Semantic Similarity from Gene Ontology
Abstract
Currently, there is a massive data generation in the most diverse areas of knowledge, as bioinformatics that generates huge amounts of data, requiring the analysis and the summarization of this data for its understanding. Semantic similarity can be seen as an approach that considers the features of objects in a context in order to establish the similarity or dissimilarity of these objects. The Gene Ontology (GO) has been widely employed as a source of features in the estimation of semantic similarity between its terms. Several methods have been proposed in the literature for estimating semantic similarity from GO. However, the methods are based on parametric distributions or arbitrarily defined parameters that do not consider the distribution of GO data. In this context, this work presents a data-driven method for estimating the semantic similarity from GO terms that exploit the power-law distribution. A set of five metabolic pathways were considered for the evaluation of the proposed method and compared with some of the principal methods in the literature. The results showed the adequacy of the proposed method in the estimation of semantic similarities and that it produced more compact gene clusters among all the methods adopted and with an adequate distance between them, leading to clusters more assertive and less susceptible to errors. The proposed method is freely available at https://​github.​com/​EricIto/​plawss.
Eric Augusto Ito, Fábio Fernandes da Rocha Vicente, Luiz Filipe Protasio Pereira, Fabricio Martins Lopes
Gene Networks Inference by Reinforcement Learning
Abstract
Gene Regulatory Networks inference from gene expression data is an important problem in systems biology field, involving the estimation of gene-gene indirect dependencies and the regulatory functions among these interactions to provide a model that explains the gene expression dataset. The main goal is to comprehend the global molecular mechanisms underlying diseases for the development of medical treatments and drugs. However, such a problem is considered an open problem, since it is difficult to obtain a satisfactory estimation of the dependencies given a very limited number of samples subject to experimental noises. Many gene networks inference methods exist in the literature, where some of them use heuristics or model based algorithms to find interesting networks that explain the data by codifying whole networks as solutions. However, in general, these models are slow, not scalable to real sized networks (thousands of genes), or require many parameters, the knowledge from an specialist or a large number of samples to be feasible. Reinforcement Learning is an adaptable goal oriented approach that does not require large labeled datasets and many parameters; can give good quality solutions in a feasible execution time; and can work automatically without the need of a specialist for a long time. Therefore, we here propose a way to adapt Reinforcement Learning to the Gene Regulatory Networks inference domain in order to get networks with quality comparable to one achieved by exhaustive search, but in much smaller execution time. Our experimental evaluation shows that our proposal is promising in learning and successfully finding good solutions across different tasks automatically in a reasonable time. However, scalabilty to networks with thousands of genes remains as limitation of our RL approach due to excessive memory consuming, although we foresee some possible improvements that could deal with this limitation in future versions of our proposed method.
Rodrigo Cesar Bonini, David Correa Martins-Jr
Exploring Identifiability in Hybrid Models of Cell Signaling Pathways
Abstract
Various processes, including growth, proliferation, migration, and death, mediate the activity of a cell. To better understand these processes, dynamic modeling can be a helpful tool. First-principle modeling provides interpretability, while data-driven modeling can offer predictive performance using models such as neural network, however at the expense of the understanding of the underlying biological processes. A hybrid model that combines both approaches might mitigate the limitations of each of them alone; nevertheless, to this end one needs to tackle issues such as model calibration and identifiability. In this paper, we report a methodology to address these challenges that makes use of a universal differential equation (UDE)-based hybrid modeling, were a partially known, ODE-based, first-principle model is combined with a feedforward neural network-based, data-driven model. We used a synthetic signaling network composed of 38 chemical species and 51 reactions to generate simulated time series for those species, and then defined twelve of those reactions as a partially known first-principle model. A UDE system was defined with this latter and it was calibrated with the data simulated with the whole network. Initial results showed that this approach could identify the missing communication of the partially-known first-principle model with the remainder of the network. Therefore, we expect that this type of hybrid modeling might become a powerful tool to assist in the investigation of underlying mechanisms in cellular systems.
Ronaldo N. Sousa, Cristiano G. S. Campos, Willian Wang, Ronaldo F. Hashimoto, Hugo A. Armelin, Marcelo S. Reis
Backmatter
Metadaten
Titel
Advances in Bioinformatics and Computational Biology
herausgegeben von
Marcelo S. Reis
Raquel C. de Melo-Minardi
Copyright-Jahr
2023
Electronic ISBN
978-3-031-42715-2
Print ISBN
978-3-031-42714-5
DOI
https://doi.org/10.1007/978-3-031-42715-2

Premium Partner