Skip to main content

2022 | Buch

Computational Intelligence Methods for Bioinformatics and Biostatistics

17th International Meeting, CIBB 2021, Virtual Event, November 15–17, 2021, Revised Selected Papers

herausgegeben von: Davide Chicco, Angelo Facchiano, Erica Tavazzi, Enrico Longato, Martina Vettoretti, Anna Bernasconi, Simone Avesani, Paolo Cazzaniga

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes revised selected papers from the 17th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, CIBB 2021, which was held virtually during November 15–17, 2021.

The 19 papers included in these proceedings were carefully reviewed and selected from 26 submissions, and they focus on bioinformatics, computational biology, health informatics, cheminformatics, biotechnology, biostatistics, and biomedical imaging.

Inhaltsverzeichnis

Frontmatter
Chemical Neural Networks and Synthetic Cell Biotechnology: Preludes to Chemical AI
Abstract
Synthetic Biology and Artificial Intelligence are two relevant fields in modern science. Together with Robotics, they have either practical scopes, or can be used for modeling organisms’ features and behaviors. The recent Synthetic Biology advancements in the so-called “synthetic cells” area allow the construction of cell-like systems with non trivial complexity, paving the way to a novel direction: the realization of chemical artificial intelligence. One possible path foresees the “installation” of chemical versions of artificial intelligence devices in synthetic cells. In this article we present this new scenario, focusing on chemical mechanisms and systems that are topologically organized as neural networks, highlighting their possible role in synthetic cell biotechnology. Future directions, challenges and requirements, as well as epistemological interpretations are also briefly discussed.
Pasquale Stano
Development of Bayesian Network for Multiple Sclerosis Risk Factor Interaction Analysis
Abstract
Extensive dataset availability for neurological disease, such as multiple sclerosis (MS), has led to new methods of risk assessment and disease course prediction, such as using machine learning and other statistical methods. However, many of these methods cannot properly capture complex relationships between variables that affect results of odds ratios unless independence between risk factors is assumed. This work addresses this limitation using a Bayesian network (BN) approach to MS risk assessment that incorporates data from UK Biobank with a counterfactual model, which includes causal knowledge of dependencies between variables. We present the results of more traditional Bayesian measurements such as necessity and sufficiency, along with odds ratios for each of the risk factors in the model. The greatest risk is produced by the genetic factor DRB15 (2.7 OR) but smoking, vitamin D levels, and childhood obesity may also play a role in MS development. Further data collection, especially in infectious mononucleosis in the population, is needed to provide a more accurate measure of risk.
Morghan Hartmann, Norman Fenton, Ruth Dobson
Real-Time Automatic Plankton Detection, Tracking and Classification on Raw Hologram
Abstract
Digital holography is an imaging process that encodes the 3D information of objects into a single intensity image. In recent years, this technology has been used to detect and count various microscopic objects and has been applied in submersible equipment to monitor the in situ distribution of plankton. To count and classify plankton, conventional methods require a holographic reconstruction step to decode the hologram before identifying the objects. However, this iterative and time-consuming step must be performed at each frame of a video, which makes it difficult to support real-time processing. We propose a real-time object detection based approach that simultaneously performs the detection, classification and counting of all plankton within videos of raw holograms. Experiments show that our pipeline based on YOLOv5 and SORT is fast (44 FPS) and can accurately detect and identify the plankton among 13 classes (97.6% mAP@0.5, 92% MOTA). Our method can be implemented to detect and count other microscopic objects in raw holograms.
Romane Scherrer, Rodrigue Govan, Thomas Quiniou, Thierry Jauffrais, Hugues Lemonnier, Sophie Bonnet, Nazha Selmaoui-Folcher
The First in-silico Model of Leg Movement Activity During Sleep
Abstract
We developed the first model simulator of leg movements activity during sleep. We designed and calibrated a phenomenological model on control subjects not showing significant periodic leg movements (PLM). To test a single generator hypothesis behind PLM—a single pacemaker possibly resulting from two (or more) interacting spinal/supraspinal generators—we added a periodic excitatory input to the control model. We describe the onset of a movement in one leg as the firing of a neuron integrating physiological excitatory and inhibitory inputs from the central nervous system, while the duration of the movement was drawn in accordance with statistical evidence. The period and the intensity of the periodic input were calibrated on a dataset of subjects showing PLM (mainly restless legs syndrome patients). Despite its many simplifying assumptions—the strongest being the stationarity of the neural processes during night sleep—the model simulations are in remarkable agreement with the polysomnographically recorded data.
Matteo Italia, Andrea Danani, Fabio Dercole, Raffaele Ferri, Mauro Manconi
Transfer Learning and Magnetic Resonance Imaging Techniques for the Deep Neural Network-Based Diagnosis of Early Cognitive Decline and Dementia
Abstract
Combining neuroimaging technologies and deep networks has gained considerable attention over the last few years. Instead of training deep networks from scratch, transfer learning methods have allowed retraining deep networks, which were already trained on massive data repositories, using a smaller dataset from a new application domain, and have demonstrated high performance in several application areas. In the context of a diagnosis of neurodegenerative disorders, this approach can potentially lessen the dependence of the training process on large neuroimaging datasets, and reduce the length of the training, validation, and testing process on a new dataset. To this end, the paper investigates transfer learning of deep networks, which were trained on ImageNet data, for the diagnosis of dementia. The designed networks are modifications of the AlexNet and VGG16 Convolutional Neural Networks (CNNs) and are retrained to classify Mild Cognitive Impairment (MCI), Alzheimer’s disease (AD) and normal patients using Diffusion Tensor Imaging (DTI) and Magnetic Resonance Imaging (MRI) data. An empirical evaluation using DTI and MRI data from the ADNI database supports the potential of transfer learning methods in the detection of early degenerative changes in the brain. Diagnosis of AD was achieved with an accuracy of 99.75% and a 0.995 Matthews correlation coefficient (MCC) score using transfer learning of VGG models retrained on DTI scans. Early cognitive decline was predicted with an accuracy of 93.88% and an MCC equal to 0.8602 by VGG models processing MRI data. The proposed models can be used as additional tools to support a quick and efficient diagnosis of MCI, AD and other neurodegenerative disorders.
Nitsa J. Herzog, George D. Magoulas
Improving Bacterial sRNA Identification By Combining Genomic Context and Sequence-Derived Features
Abstract
Bacterial small non-coding RNAs (sRNAs) are ubiquitous regulatory RNAs involved in controlling several cellular processes by targeting multiple mRNAs. The large diversity of sRNAs in terms of their length, sequence, and function poses a challenge for computational sRNA prediction. There are several bacterial sRNA prediction tools. Most of them use sequence-derived features or rely on phylogenetic conservation. Recently, a new sRNA predictor (sRNARanking) showed that using genomic context features outperformed methods based on sequence-derived features. Here we comparatively assessed the effect of using sequence-derived features together with genomic context features for computational sRNA prediction and generated a new model sRNARanking v2 with increased predictive performance in terms of the area under the precision-recall curve (AUPRC). sRNARanking v2 is available at:
Mohammad Sorkhian, Megha Nagari, Moustafa Elsisy, Lourdes Peña-Castillo
High-Dimensional Multi-trait GWAS By Reverse Prediction of Genotypes Using Machine Learning Methods
Abstract
Multi-trait genome-wide association studies (GWAS) use multi-variate statistical methods to identify associations between genetic variants and multiple correlated traits simultaneously, and have higher statistical power than independent univariate analyses of traits. Reverse regression, where genotypes of genetic variants are regressed on multiple traits simultaneously, has emerged as a promising approach to perform multi-trait GWAS in high-dimensional settings where the number of traits exceeds the number of samples. We analyzed different machine learning methods (ridge regression, naive Bayes/independent univariate, random forests and support vector machines) for reverse regression in multi-trait GWAS, using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains to evaluate methods. We found that genotype prediction performance, in terms of root mean squared error (RMSE), allowed to distinguish between genomic regions with high and low transcriptional activity. Moreover, model feature coefficients correlated with the strength of association between variants and individual traits, and were predictive of true trans acting expression quantitative trait loci (trans-eQTL) target genes, with complementary findings across methods.
Muhammad Ammar Malik, Adriaan-Alexander Ludl, Tom Michoel
A Non-Negative Matrix Tri-Factorization Based Method for Predicting Antitumor Drug Sensitivity
Abstract
Large annotated cell line collections have been proven to enable the prediction of drug response in the pre-clinical setting. We present an enhancement of Non-Negative Matrix Tri-Factorization method, which allows the integration of different data types for the prediction of missing associations. To test our method we retrieved a dataset from the Cancer Cell Line Encyclopedia (CCLE), containing the connections among cell lines and drugs by means of their IC50 values, and we integrated it by linking cell lines to their respective tissue of origin and genomic profile. We performed two different kind of experiments: a) prediction of missing values in the matrix, b) prediction of the complete drug profile of a new cell line, demonstrating the validity of the method in both scenarios.
Carolina Testa, Sara Pidò, Pietro Pinoli
A Rule-Based Approach for Generating Synthetic Biological Pathways
Abstract
Deep learning has recently enabled many advances for computer vision applications in image recognition, localization, segmentation, and understanding. However, applying deep learning models to a wider variety of domains is often limited by available labeled data. To address this problem, conventional approaches supplement more samples by augmenting existing datasets. However, these up-sampling methods usually only create derivations of the source images. To supplement with unique examples, we introduce an approach for generating purely synthetic data for object detection on biological pathway diagrams, which describe a series of molecular interactions leading to a certain biological function based on a set of rules and domain knowledge. Our method iteratively generates each pathway relationship uniquely. These realistic replicas improve the generalization significantly across a variety of settings. The code is available at https://​github.​com/​JRunner97/​Pathway_​Data_​Synthesis.
Joshua Thompson, Haoyu Dong, Kai Liu, Fei He, Mihail Popescu, Dong Xu
Machine Learning Classifiers Based on Dimensionality Reduction Techniques for the Early Diagnosis of Alzheimer’s Disease Using Magnetic Resonance Imaging and Positron Emission Tomography Brain Data
Abstract
Machine learning techniques have become more attractive and widely used for medical image processing purposes. In particular, the diagnosis of neurodegenerative diseases has recently shown a potential field of application for these methods. The performance comparison of a unique algorithm in various study contexts can be biased, which usually leads to incorrect results. In this context, this study consists in comparing the performance of different machine learning techniques, identifying their main trends and their application for the diagnosis of Alzheimer’s disease (AD). We presented a computer-aided diagnosis system for the early diagnosis of AD by analyzing brain data from the OASIS dataset. The principal component analysis (PCA) and the uniform manifold approximation and projection (UMAP) technique have been evaluated on the magnetic resonance imaging and positron emission tomography images as feature selection techniques. After that, the features are fed into nine machine learning models namely Support vector machine (SVM), Artificial neural networks, Decision trees, Random Forests, Discriminant analysis, Regression analysis, Naive Bayes, k-Nearest neighbors, and Ensemble learning. The performance of the proposed classifiers is investigated by the confusion matrix. In addition, area under the curve, Matthews correlation coefficient, accuracy, and F1-score metrics are calculated regarding this matrix. Our results indicate that the SVM-PCA/UMAP schemes provide a significant advantage over the other classifiers. Moreover, they are more efficient than the baseline model based on the voxels-as-features reference feature extraction approach.
Lilia Lazli
Text Mining Enhancements for Image Recognition of Gene Names and Gene Relations
Abstract
The volume of the biological literature has been increasing fast, which leads to a rapid growth of biological pathway figures included in the related biological papers. Each pathway figure encompasses rich biological information, consisting of gene names and gene relations. However, manual curations for pathway figures require tremendous time and labor. While leveraging advanced image understanding models may accelerate the process of curations, the accuracy of these models still needs improvements. Since each pathway figure is associated with a paper, most of the gene names and gene relations in a pathway figure also appear in the related paper text, where we can utilize text mining to improve the image recognition results. In this paper, we applied a fuzzy match method to detect gene names with different “gene dictionaries,” as well as gene co-occurrence in the plain text for suggesting gene relations. We have demonstrated that the performance of image understanding for both gene name recognitions and gene relation extractions can be improved with the help of text mining methods. All the data and code are available at GitHub (https://​github.​com/​lyfer233/​Text-Mining-Enhancements-for-Image-Recognition-of-Gene-Names-and-Gene-Relations).
Yijie Ren, Fei He, Jing Qu, Yifan Li, Joshua Thompson, Mark Hannink, Mihail Popescu, Dong Xu
Sentence Classification to Detect Tables for Helping Extraction of Regulatory Interactions in Bacteria
Abstract
The biomedical knowledge about transcriptional regulation in bacteria is rapidly published in scientific articles, so keeping biological databases up to date by manual curation is rather than impossible. Despite the efforts in biomedical text mining, there are still challenges in extracting regulatory interactions (RIs) between transcription factors and genes from text documents. One of them is produced by text extraction from PDF files. We have observed that the extraction of RIs from text lines that comes from tables of the original PDF article produces false positives. Here, we address the problem of automatically separating this text lines from those that are regular sentences by using automatic classification. Our best model was a Support Vector Classifier trained with n-grams of characters of tags of parts of speech, numbers, symbols, punctuation, brackets, and hyphens. Despite a significant imbalanced data, our classifier archived a positive class F1-score of 0.87. Our best classifier will be coupled eventually to a preprocessing pipeline for the automatic generation of transcriptional regulatory networks of bacteria by discarding text lines that comes from tables of the original PDF.
Dante Sepúlveda, Joel Rodríguez-Herrera, Alfredo Varela-Vega, Axel Zagal Norman, Carlos-Francisco Méndez-Cruz
RF-Isolation: A Novel Representation of Structural Connectivity Networks for Multiple Sclerosis Classification
Abstract
Magnetic Resonance Imaging (MRI) is one of the tools used to identify structural and functional changes caused by multiple sclerosis, and by processing MR images, connectivity networks can be obtained. The analysis of structural connectivity networks of multiple sclerosis patients usually employs network-derived metrics, which are computed independently for each subject. We propose a novel representation of connectivity networks that is extracted from a model trained on the whole multiple sclerosis population: RF-Isolation. RF-Isolation is a vector encoding the disconnection of each region of interest with respect to all other regions. This feature can be easily captured by isolation-based outlier detection methods. We therefore reformulate the task as an outlier detection problem and propose a novel approach, called MS-ProxIF, based on a variant of Isolation Forest, a Random Forest-based outlier detection system, from which the representation is extracted. We test the representation via a set of classification experiments, involving 79 subjects, 55 of which suffer from multiple sclerosis. In particular, we compare favourably to the most used network-derived metrics in multiple sclerosis.
Antonella Mensi, Simona Schiavi, Maria Petracca, Nicole Graziano, Alessandro Daducci, Matilde Inglese, Manuele Bicego
Summarizing Global SARS-CoV-2 Geographical Spread by Phylogenetic Multitype Branching Models
Abstract
Using available phylogeographical data of 3585 SARS–CoV–2 genomes we attempt at providing a global picture of the virus’s dynamics in terms of directly interpretable parameters. To this end we fit a hidden state multistate speciation and extinction model to a pre-estimated phylogenetic tree with information on the place of sampling of each strain. We find that even with such coarse–grained data the dominating transition rates exhibit weak similarities with the most popular, continent–level aggregated, airline passenger flight routes.
Hao Chi Kiang, Krzysztof Bartoszek, Sebastian Sakowski, Stefano Maria Iacus, Michele Vespe
Explainable AI Models for COVID-19 Diagnosis Using CT-Scan Images and Clinical Data
Abstract
The pandemic of COVID-19 has had a significant impact on global health and is becoming a major international concern. Fortunately, early detection helped decrease its number of deaths. Artificial Intelligence (AI) and Machine Learning (ML) techniques are a new era, where the main objective is no longer to assist experts in decision-making but to improve and increase their capabilities and this is where interpretability comes in. This study aims to address one of the biggest hurdles that AI faces today which is public trust and acceptance due to its black-box strategy. In this paper, we use a deep Convolutional Neural Network (CNN) on chest computed tomography (CT) image data and Support Vector Machine (SVM) and Random Forest (RF) on clinical symptoms data (Bio-data) to diagnose patients positive for COVID-19. Our objective is to present an Explainable AI (XAI) models by using the Local Interpretable Model-agnostic Explanations (LIME) technique to identify positive patients to the virus in an interpreted way. The results are promising and outperformed the state of the art. The CNN model reached an Accuracy and F1-Score of 96% on CT-scan images, and SVM outperformed RF with Accuracy of 90% and Specificity of 91% on Bio-data. The interpretable results of XAI-Img-Model and XAI-Bio-Model, show that LIME explanations help to understand how SVM and CNN black box models behave in making their decision after being trained on different types of COVID-19 dataset. This can significantly increase trust and help experts understand and learn new patterns for the current pandemic.
Aicha Boutorh, Hala Rahim, Yassmine Bendoumia
The Need of Standardised Metadata to Encode Causal Relationships: Towards Safer Data-Driven Machine Learning Biological Solutions
Abstract
In this paper, we discuss the importance of considering causal relations in the development of machine learning solutions to prevent factors hampering the robustness and generalisation capacity of the models, such as induced biases. This issue often arises when the algorithm decision is affected by confounding factors. In this work, we argue that the integration of research assumptions as causal relationships can help identify potential confounders. Together with metadata information, it can enable meta-comparison of data acquisition pipelines. We call for standardised meta-information practices as a crucial step for proper machine learning solutions development, validation, and data sharing. Such practices include detailing the data acquisition process, aiming for automatic integration of causal relationships and actionable metadata.
Beatriz Garcia Santa Cruz, Carlos Vega, Frank Hertel
Deep Recurrent Neural Networks for the Generation of Synthetic Coronavirus Spike Protein Sequences
Abstract
With the advent of deep learning techniques for text generation, comes the possibility of generating fully simulated or synthetic genomes. For this study, the dataset of interest is that of coronaviruses. Coronaviridae are a family of positive-sense RNA viruses capable of infecting humans and animals. These viruses usually cause mild to moderate upper respiratory tract infection; however, they can also cause more severe symptoms, gastrointestinal and central nervous system diseases. The viruses are capable of flexibly adapting to new environments, hence health threats from coronavirus are constant and long-term. Immunogenic spike proteins are glycoproteins found on the surface of Coronaviridae particles that mediate entry to host cells. The aim of this study was to train deep learning neural networks to produce simulated spike protein sequences, which may be able to aid in knowledge and/or vaccine design by creating alternative possible spike sequences that could arise from zoonotic sources in future. Deep learning recurrent neural networks (RNN) were trained to provide computer-simulated coronavirus spike protein sequences in the style of previously known sequences and examine their characteristics. The deep generative model was created as a recurrent neural network employing text embedding and gated recurrent unit layers in TensorFlow Keras. Training used a dataset of alpha, beta, gamma, and delta coronavirus spike sequences. In a set of 100 simulated sequences, all 100 had most significant BLAST matches to Spike proteins in searches against NCBI non-redundant dataset (NR) and possessed the expected Pfam domain matches. Simulated sequences from the neural network may be able to guide us with future prospective targets for vaccine discovery in advance of a potential novel zoonosis.
Lisa C. Crossman
Recent Dimensionality Reduction Techniques for High-Dimensional COVID-19 Data
Abstract
We are going through the last years of the COVID-19 pandemic, where almost the entire research community has focused on the challenges that constantly arise. From the computational and mathematical perspective, we have to deal with a dataset with ultra-high volume and ultra-high dimensionality in several experimental studies. An indicative example is DNA sequencing technologies, which offer a more realistic picture of human diseases at the molecular biology level. However, these technologies produce data with high complexity and ultra-high dimensionality. On the other hand, dimensionality reduction techniques are the first choice to address this complexity, revealing the hidden data structure in the original multidimensional space. Also, such techniques can improve the efficiency of machine learning tasks such as classification and clustering. Towards this direction, we study the behavior of seven well-known and cutting-edge dimensionality reduction techniques tailored for RNA-sequencing data. Along with the study of the effect of these algorithms, we propose the extension of the Random projection and Geodesic distance t-Stochastic Neighbor Embedding (RGt-SNE) algorithm, a recent t-Stochastic Neighbor Embedding (t-SNE) improvement. We suggest a new distance criterion for the kernel matrix construction. Our results show the potential of the proposed algorithm and, at the same time, highlight the complexity of the COVID-19 data, which are not separable, creating a significant challenge that the Machine Learning field will have to face.
Ioannis L. Dallas, Aristidis G. Vrahatis, Sotiris K. Tasoulis, Vassilis P. Plagianakos
Soft Brain Ageing Indicators Based on Light-Weight LeNet-Like Neural Networks and Localized 2D Brain Age Biomarkers
Abstract
In recent years, there have been several proposed applications based on Convolutional Neural Networks (CNN) to neuroimaging data analysis and explanation. Traditional pipelines require several processing steps for feature extraction and ageing biomarker detection. However, modern deep learning strategies based on transfer learning and gradient-based explanations (e.g., Grad-Cam++) can provide a more powerful and reliable framework for automatic feature mapping, further identifying 3D ageing biomarkers. Despite the existence of several 3D CNN methods, we show that a LeNet-like 2D-CNN model trained on T1-weighted MRI images can be used to predict brain biological age in a classification task and, by transfer learning, in a regression task. In addition, automatic averaging and aligning of 2D-CNN gradient-based images is applied and shown to improve its biological meaning. The proposed model predicts soft biological brain ageing indicators with a six-class-balanced accuracy of \({\approx }70\%\) by using the anagraphic age of 1100 healthy subjects in comparison to their brain scans.
Francesco Bardozzo, Mattia Delli Priscoli, Andrea Gerardo Russo, Davide Crescenzi, Ugo Di Benedetto, Fabrizio Esposito, Roberto Tagliaferri
Backmatter
Metadaten
Titel
Computational Intelligence Methods for Bioinformatics and Biostatistics
herausgegeben von
Davide Chicco
Angelo Facchiano
Erica Tavazzi
Enrico Longato
Martina Vettoretti
Anna Bernasconi
Simone Avesani
Paolo Cazzaniga
Copyright-Jahr
2022
Electronic ISBN
978-3-031-20837-9
Print ISBN
978-3-031-20836-2
DOI
https://doi.org/10.1007/978-3-031-20837-9

Premium Partner