Computational Advances in Bio and Medical Sciences
13th International Conference, ICCABS 2025, Atlanta, GA, USA, January 12–14, 2025, Revised Selected Papers
- 2026
- Book
- Editors
- Mohammed Alser
- Mukul S. Bansal
- Yury Khudyakov
- Serghei Mangul
- Ion I. Mandoiu
- Marmar R. Moussa
- Murray Patterson
- Sanguthevar Rajasekaran
- Pavel Skums
- Shibu Yooseph
- Alexander Zelikovsky
- Book Series
- Lecture Notes in Computer Science
- Publisher
- Springer Nature Switzerland
insite
SEARCH
About this book
This book constitutes the refereed proceedings of the 13th International Conference on Computational Advances in Bio and Medical Sciences, ICCABS 2025, held in Atlanta, GA, USA, during January 12–14, 2025.
The 26 full papers presented in these proceedings were carefully reviewed and selected from 75 submissions. ICCABS has the goal of bringing together researchers, scientists, and students from academia, laboratories, and industry to discuss recent advances on computational techniques and applications in the areas of biology, medicine, and drug discovery.
Advertisement
Table of Contents
-
Frontmatter
-
A Benchmarking Study of Random Projections and Principal Components for Dimensionality Reduction Strategies in Single Cell Analysis
Mohamed Abdelnaby, Marmar R. MoussaAbstractPrincipal Component Analysis (PCA) has long been a cornerstone in dimensionality reduction for high-dimensional data, including single-cell RNA sequencing (scRNA-seq). However, PCA’s performance typically degrades with increasing data size, can be sensitive to outliers, and assumes linearity. Recently, Random Projection (RP) methods have emerged as promising alternatives, addressing some of these limitations. This study systematically and comprehensively evaluates PCA and RP approaches, including Singular Value Decomposition (SVD) and randomized SVD, alongside Sparse and Gaussian Random Projection algorithms, with a focus on computational efficiency and downstream analysis effectiveness. We benchmark performance using multiple scRNA-seq datasets including labeled and unlabeled publicly available datasets. We apply Hierarchical Clustering and Spherical K-Means clustering algorithms to assess downstream clustering quality. For labeled datasets, clustering accuracy is measured using the Hungarian algorithm and Mutual Information. For unlabeled datasets, the Dunn Index and Gap Statistic capture cluster separation. Across both dataset types, the Within-Cluster Sum of Squares (WCSS) metric is used to assess variability. Additionally, locality preservation is examined, with RP outperforming PCA in several of the evaluated metrics.Our results demonstrate that RP not only surpasses PCA in computational speed but also rivals and, in some cases, exceeds PCA in preserving data variability and clustering quality. By providing a thorough benchmarking of PCA and RP methods, this work offers valuable insights into selecting optimal dimensionality reduction techniques, balancing computational performance, scalability, and the quality of downstream analyses. -
Resistance Genes are Distinct in Protein-Protein Interaction Networks According to Drug Class and Gene Mobility
Nazifa Ahmed Moumi, Connor L. Brown, Shafayat Ahmed, Peter J. Vikesland, Amy Pruden, Liqing ZhangAbstractWith growing calls for increased surveillance of antibiotic resistance as an escalating global health threat, improved bioinformatic tools are needed to track antibiotic resistance genes (ARGs) across One Health domains. Most studies to date profile ARGs using sequence homology, but such approaches provide limited information about the broader context or function of the ARG in bacterial genomes. Here we introduce a new pipeline, PPI-ARG-finder, for identifying ARGs in genomic data that employs machine learning analysis of Protein-Protein Interaction Networks (PPINs) as a means to improve predictions of ARGs while also providing vital information about the genetic context, such as gene mobility. A random forest model was trained to effectively differentiate between ARGs and nonARGs and was validated using the PPINs of ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter cloacae), which represent urgent threats to human health because they tend to be multi-antibiotic resistant. The pipeline exhibited robustness in discriminating ARGs from nonARGs, achieving an average area under the precision-recall curve of 88%. We further identified that the neighbors of ARGs, i.e., genes connected to ARGs by only one edge, were disproportionately associated with mobile genetic elements, which is consistent with the understanding that ARGs tend to be more mobile compared to randomly sampled genes in the PPINs. PPI-ARG-finder showcases the utility of PPINs in discerning distinctive characteristics of ARGs within a broader genomic context and in differentiating ARGs from nonARGs through network-based attributes and interaction patterns. -
DuoHash: Fast Hashing of Spaced Seeds with Application to Spaced K-mers Counting
Leonardo Gemin, Cinzia Pizzi, Matteo CominAbstractAlignment-free genomic sequence analysis has facilitated high-throughput processing within numerous bioinformatics workflows. A central task in alignment-free applications is hashing k-mers, commonly used for indexing, querying, and fast similarity searches. Recently, spaced seeds—a specialized pattern designed to accommodate errors or mutations—have increasingly replaced k-mers, enhancing sensitivity in various applications. However, spaced seed hashing is computationally intensive, introducing significant delays. This paper addresses the challenge of efficient spaced seed hashing and presents DuoHash, a framework that enables the efficient computation of several hash functions. Our experimental results demonstrate that the proposed method substantially outperforms existing algorithms, achieving speedups of up to 11x. To illustrate practical utility, we further applied DuoHash to the problem of spaced k-mers counting. The code of DuoHash is available at https://github.com/CominLab/DuoHash/. -
Unsupervised Learning for Tertiary Structure Prediction of Protein Molecules: Systematic Review
Kazi Lutful KabirAbstractTertiary structures of molecules represent high-dimensional data containing spatial information of hundreds (even thousands) of atoms. Unsupervised learning techniques can be applied to such spatial data to uncover hidden organizations that can be subjected to further evaluation. Such techniques have already been employed in a number of relevant applications e.g., tracking the conformational changes in a set of biomolecular structures, detecting biologically active tertiary structures from computed structures of proteins, analyzing molecular dynamics simulation of peptides, and so on. This paper presents a comprehensive review of clustering techniques for tertiary (3D) molecular structure data focusing on protein molecules. In fact, the article systematically organizes and analyzes the existing approaches in terms of data representation, methodology, proximity measure, and evaluation metric. Besides, it highlights key open challenges and proposes future research directions to advance this domain. -
Fast and Succinct Compression of k-mer Sets with Plain Text Representation of Colored de Bruijn Graphs
Enrico Rossignolo, Matteo CominAbstractA fundamental operation in computational genomics is the reduction of input sequences into their constituent k-mers. Designing space-efficient ways to represent a k-mer collection is essential to improve the scalability of bioinformatics analyses. A widely used approach involves converting the k-mer set into a de Bruijn graph and then producing a compact plain text representation by identifying the minimum path cover. In this article, we present USTAR-CR, a novel algorithm for compressing multiple k-mer sets. USTAR-CR leverages node connectivity principles in the colored de Bruijn graph for a more compact plain text representation, combined with an efficient encoding of k-mers colors. We tested USTAR-CR on real read datasets and compared it with the state-of-the-art GGCAT. USTAR-CR demonstrated superior performance in terms of compression, requiring less memory and being significantly faster (up to 51x) https://github.com/enricorox/USTAR-CR. -
Enhancing Protein Side Chain Packing Using Rotamer Clustering and Machine Learning
Mohammed Alamri, Mohammad Al Sallal, Kamal Al Nasr, Muhammad Akbar, Ahmad Jad AllahAbstractOne of the challenges and a significant part of a protein structure’s prediction in three-dimensional space is a side chain prediction/packing. This area of research has a large importance, due to its various applications in protein design. In recent years, many methodologies and techniques have been crafted for side chain prediction such as DLPacker, FASPR, SCWRL4 and OPUS-Rota4. In this research, we address the problem from a different perspective. We employed a machine learning model to predict the side chain packing of protein molecules given only the Cα trace. We analyzed 32,000 protein molecules to extract important geometrical features that can distinguish between different orientations of side chain rotamers. We designed and implemented a Random Forest model to tackle this problem. Given the accuracy of existing state-of-the-art approaches, our model represents an improvement. The results of our experiment show that Random Forest is highly effective, achieving a total average accuracy of 73.7% for proteins and 73.3% for individual amino acids. -
Can Language Models Reason About ICD Codes to Guide the Generation of Clinical Notes?
Ivan Makohon, Jian Wu, Bintao Feng, Yaohang LiAbstractIn the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients’ assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering them. Manually writing clinical notes may take considerable amount of time, increasing the patient’s waiting time and could possibly delay diagnoses. Large language models (LLMs), such as GPT-3 possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM’s response in clinical note generation. In our prompts, we incorporate International Classification of Diseases (ICD) codes and basic patient information along with similar clinical case examples to investigate how LLMs can effectively formulate clinical notes. We tested our CoT prompt technique on six clinical cases from the CodiEsp test dataset using GPT-4 as our LLM and our results show that it outperformed the standard zero-shot prompt. -
Link Prediction in Disease-Disease Interactions Network Using a Hybrid Deep Learning Model
Ashwag Altayyar, Li LiaoAbstractDiscovering disease-disease association based on the underlying biological mechanisms is an essential biomedical task in modern biology as understanding these relationships will assist biologists in discovering the pathogenesis, diagnosis, and intervention of human diseases. Recently, deep learning on graph and graph neural networks have achieved promising performance in modeling complex biological structures and learning compact representations of interconnected data. Inspired by the success of graph neural networks in learning subgraph representations, we propose a novel framework, SNN-VGA, designed to predict potential disease comorbid pairs. We first model disease-associated genes as subgraphs in the protein-protein interactions network and learn disentangled disease module representations using a subgraph neural network model. The learned embeddings are leveraged by the variational graph auto-encoder to predict disease comorbidity in the disease-disease interactions network. Empirical results from a benchmark dataset demonstrate that our method performs competitively compared with the state-of-the-art model, with an AUROC of 0.96. -
Model Selection for Sparse Microbial Network Inference Using Variational Approximation
Shibu YoosephAbstractMicrobial communities are often composed of taxa from different taxonomic groups. The associations among the constituent members in a microbial community play an important role in determining the functional characteristics of the community, and these associations can be modeled using an edge weighted graph (microbial network). A microbial network is typically inferred from a sample-taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa abundance in each sample. Motivated by microbiome studies that involve a large number of samples collected across a range of study parameters, here we consider the computational problem of identifying the number of microbial networks underlying the observed sample-taxa abundance matrix. Specifically, we consider the problem of determing the number of sparse microbial networks in this setting. We use a mixture model framework to address this problem, and present formulations to model both count data and proportion data. We propose several variational approximation based algorithms that allow the incorporation of the sparsity constraint while estimating the number of components in the mixture model. We evaluate these algorithms on a large number of simulated datasets generated using a collection of different graph structures (band, hub, cluster, random, and scale-free). -
Haplotype-Based Parallel PBWT for Biobank Scale Data
Kecong Tang, Ahsan Sanaullah, Degui Zhi, Shaojie ZhangAbstractDurbin’s positional Burrows-Wheeler transform (PBWT) enables algorithms with the optimal time complexity of O(MN) for reporting all vs all haplotype matches in a population panel with M haplotypes and N variant sites. However, even this efficiency may still be too slow when the number of haplotypes reaches millions. To further reduce the run time, in this paper, a parallel version of the PBWT algorithms is introduced for all versus all haplotype matching, which is called HP-PBWT (haplotype-based parallel PBWT). HP-PBWT parallelly executes the PBWT by splitting a haplotype panel into blocks of haplotypes. HP-PBWT algorithms achieve parallelization for PBWT construction, reporting all versus all L-long matches, and reporting all versus all set-maximal matches while maintaining memory efficiency. HP-PBWT has an \(O((\frac{M}{T}+T)N)\) time complexity in PBWT construction, and \(O((\frac{M}{T}+T +c^*)N)\) time complexity for reporting all versus all L-long matches and reporting all versus all set-maximal matches, where T is the number of threads and \(c^*\) is the maximum number of matches (of length L or maximum divergence value for L-long matches and set-maximal matches respectively) per haplotype per site. HP-PBWT achieves 4-fold speed-up in UK Biobank genotyping array data with 30 threads in the IO-included benchmarks. When applying HP-PBWT to a dataset of 8 million randomized haplotypes (random binary strings of equal length) in the IO-excluded benchmarks, it can achieve a 22-fold speed-up with 60 cores on the Amazon EC2 server. With further hardware optimization, HP-PBWT is expected to handle billions of haplotypes efficiently. -
Mammo-Bench: A Large-Scale Benchmark Dataset of Mammography Images
Gaurav Bhole, S. Suba, Nita ParekhAbstractBreast cancer remains a significant global health concern, and machine learning algorithms and computer-aided detection systems have shown great promise in enhancing the accuracy and efficiency of mammography image analysis. However, there is a critical need for large, benchmark datasets for training deep learning models for breast cancer detection. In this work we developed Mammo-Bench, a large-scale benchmark dataset of mammography images, by collating data from seven well-curated resources, viz., DDSM, INbreast, KAU-BCMD, CMMD, CDD-CESM, DMID, and RSNA Screening Dataset. To ensure consistency across images from diverse sources while preserving clinically relevant features, a preprocessing pipeline that includes breast segmentation, pectoral muscle removal, and intelligent cropping is proposed. The dataset consists of 74,436 high-quality mammographic images from 26,500 patients across 7 countries and is one of the largest open-source mammography databases to the best of our knowledge. To show the efficacy of training on the large dataset, performance of ResNet101 architecture was evaluated on Mammo-Bench and the results compared by training independently on a few member datasets and an external dataset, VinDr-Mammo. An accuracy of 78.8% (with data augmentation of the minority classes) and 77.8% (without data augmentation) was achieved on the proposed benchmark dataset, compared to the other datasets for which accuracy varied from 25 – 69%. Noticeably, improved prediction of the minority classes is observed with the Mammo-Bench dataset. These results establish baseline performance and demonstrate Mammo-Bench's utility as a comprehensive resource for developing and evaluating mammography analysis systems. -
MetaEdit: Computational Identification of RNA Editing in Microbiomes
Arpit Mehta, Vitalii Stebliankin, Kalai Mathee, Giri NarasimhanAbstractRNA editing is a pivotal post-transcriptional mechanism that plays a critical role in the regulation of some genes by altering their mRNA sequences, thereby influencing the resulting protein sequence, structure, and the functional and cellular responses. While extensively studied in eukaryotes, its significance and prevalence in prokaryotic microbiomes remain underexplored. Given the crucial role of microbiomes in various biological processes and their potential impact on human health and disease, understanding RNA editing within these communities could reveal new insights into microbial gene regulation and adaptation. The lack of studies to detect RNA editing in microbiomes motivates the need for developing bioinformatic strategies to bridge this research gap. This study introduces MetaEdit, a computational tool designed to detect RNA editing in bacterial microbiomes. We apply MetaEdit to metatranscriptomic and metagenomic datasets to identify and characterize RNA editing events in the human gut microbiome. Our results demonstrate the presence of RNA editing in Escherichia coli and provide a foundation for future investigations into the functional implications of RNA editing in microbiomes. Our findings are supported by previously reported research but need validation with laboratory experiments. The developed pipeline is generic and can be applied to find RNA editing in any sequencing datasets containing both metagenomic and metatranscriptomic data.Availability: Pipeline is available from https://biorg.cs.fiu.edu/metaedit/.Supplementary information: None. -
Drug-Centric Prior Improves Drug Response Modeling in Partially Overlapping Pharmacogenomic Screens
Dharani Thirumalaisamy, Sunil K. Joshi, Stephen E. Kurtz, Tania Q. Vu, Jeffrey W. Tyner, Mehmet Gönen, Olga NikolovaAbstractWith the accumulation of large-scale genomic data such as whole-genome RNA sequencing, copy number, and mutation profiles for tens of thousands of samples, associated with screening thousands of small molecules and other perturbagens, arises the question of how to best leverage partially overlapping datasets generated at different facilities. As research groups across the world continue to generate drug screens of variable size and quality, the need for approaches that can learn from such partially overlapping experiments and improve the signal to noise ratio emerges with increasing importance. We present an application of a Bayesian group factor analysis model, where we employ a drug-centric prior to transfer information about drugs screened in the same samples in multiple datasets. We show that joint models leveraging partially overlapping pharmacogenomic datasets from the Broad and Sanger institutes can overall improve drug signature identification. -
Improving Inter-helical Residue Contact Prediction in -Helical Transmembrane Proteins Using Structural Neighborhood Crowdedness Information
Aman Sawhney, Li LiaoAbstractResidue contact maps are a useful compressed representation that can be used as constraints for structural modeling, but can also help identify inter-helical binding sites and are hence effective on their own. In this work, we hypothesize that crowdedness around a target residue pair influences whether it is a contact point. We developed two measures of crowdedness in a residue’s 3D neighborhood: bin counts - defined in terms of relative residue distance; and residue contact number for inter-helical TM proteins - the number of residues in a specified relative distance. Since unsupervised language models such as MSA transformer, trained on millions of sequences, are very accurate but also complementary to our approach, we combined MSA transformer score with our proposed features to assess the impact of crowdedness on residue contact prediction. We found that crowdedness measures can in fact increase the upper bound performance by at least 7.65% average precision in cross validation experiments and by at least 11.59% average precision in held out experiments. Further, we developed a method to “transfer” this information when ground truth crowdedness measures are unavailable. Our approach outperformed MSA transformer by at least 1.15% average precision in cross validation experiments and 1.85% average precision in held-out experiments. -
Explaining Protein Folding Networks Using Integrated Gradients and Attention Mechanisms
Rukmangadh Sai Myana, Sumit Kumar JhaAbstractProtein folding prediction models like AlphaFold and ColabFold have revolutionized structural biology by providing accurate protein structures. However, these models present challenges when it comes to understanding how they arrive at their decisions. In this paper, we propose the application of Explainable AI (XAI) techniques, specifically Integrated Gradients and Attention Mechanisms, to elucidate the decision-making process of these complex networks. We conduct computational experiments to evaluate the effectiveness of these methods and discuss potential implications for the field. -
Computationally Reconstructing the Evolution of Cancer Progression Risk
Kefan Cao, Russell SchwartzAbstractUnderstanding the evolution of cancer in its early stages is critical to identifying key drivers of cancer progression and developing better early diagnostics or prophylactic treatments. Early cancer is difficult to observe, though, since it is generally asymptomatic until extensive genetic damage has accumulated. In this study, we develop a computational approach to infer how once-healthy cells enter into and become committed to a pathway of aggressive cancer. We accomplish this through a strategy of using tumor phylogenetics to look backwards in time to earlier stages of tumor development combined with machine learning to infer how progression risk changes over those stages. We apply this paradigm to point mutation data from a set of cohorts from the Cancer Genome Atlas (TCGA) to formulate models of how progression risk evolves from the earliest stages of tumor growth, as well as how this evolution varies within and between cohorts. The results suggest general mechanisms by which risk develops as a cell population commits to aggressive cancer, but with significant variability between cohorts and individuals. These results imply limits to the potential for earlier diagnosis and intervention while also providing grounds for hope in extending these beyond current practice. -
Cancer Diseases Classification with Sparse Neural Networks: An Information-Theoretic Approach
Zahra Jandaghi, Sixiang Zhang, Xiuzhen Huang, Liming CaiAbstractMachine learning is indispensable for biomedical data modeling and classification. Tasks involving large, high-dimensional datasets are nevertheless computationally intensive and approximation methods are often sought to scale down the volume of raw data or model size without compromising substantial information embedded within the data. However, previous approximation methods have yielded mixed results and have yet to establish a clear framework linking feature selection and model sparsification. In this paper, we present an information-theoretic approach for cancer classification by addressing two prominent questions in data model approximation: how to identify a minimal set of critical features in cancer microarray data and how to design sparse neural networks that are effective and efficient for cancer classification. Our study highlights a key connection between these two challenges. In particular, we introduce a mutual information (MI)-based method to select a highly informative subset of genes from extensive microarray gene expression data. Each selected subset of genes, up to two orders of magnitude smaller than the original gene set, demonstrates superior performance in cancer classification compared to the full dataset. Additionally, the MI-based method enables the design of sparsified neural networks that consistently maintain or even improve classification performance compared to fully connected networks. Our test results reveal that sparsified networks selectively retain connections to the critical genes identified by the MI-based filtering method, effectively ignoring contributions from irrelevant genes. -
Epistatic Density of Viral Variants in Acute and Chronic HCV Patients
Alina Nemira, Akshay Juyal, Pavel Skums, Alexander ZelikovskyAbstractRNA viruses exhibit high mutation rates due to the lack of proofreading mechanisms during replication, leading to diverse intra-host viral populations. Variants with higher fitness tend to dominate the population due to enhanced transmissibility and immune escape. Fitness of viral variants depends on individual SNVs and epistatic links between pairs of SNVs as well as competition with other viral variants within the population. Recent machine learning methods have successfully predicted emerging COVID-19 variants based on epistatic SNV links, implying that SNV links contribute to fitness of viral variants.We define the epistatic density of a viral variant as the number of positively linked SNV pairs between mutated positions in its genome. We computed epistatic density of intra-host Hepatitis C Virus (HCV) populations sampled from 85 chronic and 28 acute patients with HCV 1a genotypes. On average, epistatic density was higher in chronic patients than in acute cases. Additionally, the epistatic density distributions are more irregular and choppy in acute populations. Finally, we applied the epistatic density properties to distinguish between intra-host populations of chronic and acute HCV patients. -
Applying Genetic Algorithm with Saltations to MAX-3SAT
Ryan Alomair, Hafsa Farooq, Daniel Novikov, Akshay Juyal, Alexander ZelikovskyAbstractPunctuated equilibrium, the pattern of rapid, significant mutational change, had not been observed in real time until the SARS-CoV-2 viral variants emerged with multiple mutations occurring together. Using epistasis (the circumstance in which the effect of one gene is influenced by the presence of one or more other genes) as a framework to understand this phenomenon, we can capture the relationships between different combinations of mutations, where each node is an individual mutation, and each edge represents the interaction between them, allowing us to effectively model the fitness landscape of viral variants. In exploring these relationships, it has been found that dense subgraphs within the network correspond to emerge saltation. We refer to this as an evolutionary jump and incorporate it with a genetic algorithm (GA + EJ), which can uncover high-fitness regions seemingly distant from the variant(s) from which they originally derived. We applied it to the MAX-3SAT problem and found improvement for satisfiable problem instances with 600 variables and 2550 clauses, as well as 100 variables and 429 clauses. -
Computing Gram Matrix for SMILES Strings Using RDKFingerprint and Sinkhorn-Knopp Algorithm
Sarwan Ali, Haris Mansoor, Prakash Chourasia, Imdad Ullah Khan, Murray PattersonAbstractSMILES (Simplified Molecular Input Line Entry System) strings are widely used to represent molecular structures in cheminformatics and drug discovery. However, effectively transforming these string-based representations into meaningful numerical features for machine learning remains a significant challenge due to the complex, non-Euclidean nature of molecular structures. Traditional fingerprint-based and deep learning approaches often struggle with scalability, interpretability, or computational efficiency. Our approach leverages the Morgan Fingerprint to generate molecular feature representations, followed by a pairwise kernel function to compute a structured similarity matrix. We then refine this matrix using the Sinkhorn-Knopp algorithm, ensuring it satisfies probabilistic constraints. To reduce dimensionality, we apply Kernel Principal Component Analysis (PCA), producing compact embeddings suitable for downstream machine learning tasks. We conduct a comprehensive empirical evaluation of the proposed method which is assessed for drug subcategory prediction (classification task) and solubility AlogPS “aqueous solubility and octanol/water partition coefficient” (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms baseline methods in supervised analysis and has potential uses in molecular design and drug discovery. By integrating kernel-based learning with probabilistic refinement, our method offers a promising alternative to existing cheminformatics techniques. -
Enhancing Privacy Preservation and Reducing Analysis Time with Federated Transfer Learning in Digital Twins-Based Computed Tomography Scan Analysis
Avais Jan, Qasim Zia, Murray PattersonAbstractThe application of Digital Twin (DT) technology and Federated Learning (FL) has great potential to change the field of biomedical image analysis, particularly for Computed Tomography (CT) scans. This paper presents Federated Transfer Learning (FTL) as a new Digital Twin-based CT scan analysis paradigm. FTL uses pre-trained models and knowledge transfer between peer nodes to solve problems such as data privacy, limited computing resources, and data heterogeneity. The proposed framework allows real-time collaboration between cloud servers and Digital Twin-enabled CT scanners while protecting patient identity.We apply the FTL method to a heterogeneous CT scan dataset and assess model performance using convergence time, model accuracy, precision, recall, F1 score, and confusion matrix. It has been shown to perform better than conventional FL and Clustered Federated Learning (CFL) methods with better precision, accuracy, recall, and F1-score. The technique is beneficial in settings where the data is not independently and identically distributed (non-IID), and it offers reliable, efficient, and secure solutions for medical diagnosis. These findings highlight the possibility of using FTL to improve decision-making in digital twin-based CT scan analysis, secure and efficient medical image analysis, promote privacy, and open new possibilities for applying precision medicine and smart healthcare systems. -
Improved Graph-Based Antibody-Aware Epitope Prediction with Protein Language Model-Based Embeddings
Mansoor Ahmed, Sarwan Ali, Avais Jan, Imdad Ullah Khan, Murray PattersonAbstractThe accurate identification of B-cell epitopes is critical in antibody design, diagnostics, and immunotherapies. Many in silico approaches have recently been proposed to predict epitopes, but these approaches struggle primarily because of the variational and conformational nature of epitopes. However, deep learning-based approaches have recently shown great promise in achieving better performance at the epitope prediction task. In this paper, we employ a graph convolutional network (GCN) coupled with pre-trained protein language model (PLM)-based embeddings for epitope prediction on a benchmark antibody-specific epitope prediction (AsEP) dataset. We explore the use of different PLM-embedding methods on the epitope prediction task and show that the choice of PLM embeddings impacts the performance. Specifically, we find that antibody-specific PLMs such as AntiBERTy and general PLMs such as ProtTrans and ESM-2 for antigens provide improved epitope prediction performance with an AUCROC of 0.65, precision of 0.28, and recall of 0.46. The source code is available at: https://github.com/mansoor181/walle-pp.git -
Leveraging RNA LLMs for 3D Structure Prediction via Data Augmentation
Sixiang Zhang, Harish Anand, Liming CaiAbstractRibonucleic acid (RNA) is a complex macromolecule essential for living organisms to function in cells. Understanding its three-dimensional (3D) structure is critical for elucidating its cellular roles. However, computational prediction of RNA 3D structures remains a significant challenge due to the vast conformational space that RNA molecules can adopt. Although machine learning, particularly deep learning-based methods, has recently gained traction, the lack of a large dataset of native RNA structures for training has limited these methods from achieving desired performance. In this study, we leverage pre-trained RNA large language models to predict RNA 3D conformations directly from input RNA sequences. Specifically, we introduce data augmentation techniques to address the issue of data scarcity in RNA 3D structures. This present paper focuses on predicting backbone conformations to evaluate the effectiveness of our method. Preliminary results demonstrate promising accuracy, with predicted structures achieving an average RMSD of 3.85Å against native 3D structures in the PDB—a 50% reduction in performance error compared to predictions made without the data augmentation method. -
EfficientNet in Digital Twin-Based Cardiac Arrest Prediction and Analysis
Qasim Zia, Avais Jan, Zafar Iqbal, Muhammad Mumtaz Ali, Mukarram Ali, Murray PattersonAbstractCardiac arrest is one of the biggest global health problems, and early identification and management are key to enhancing the patient’s prognosis. In this paper, we propose a novel framework that combines an EfficientNet-based deep learning model with a digital twin system to improve the early detection and analysis of cardiac arrest. We use compound scaling and EfficientNet to learn the features of cardiovascular images. In parallel, the digital twin creates a realistic and individualized cardiovascular system model of the patient based on data received from the Internet of Things (IoT) devices attached to the patient, which can help in the constant assessment of the patient and the impact of possible treatment plans. As shown by our experiments, the proposed system is highly accurate in its prediction abilities and, at the same time, efficient. Combining highly advanced techniques such as deep learning and digital twin (DT) technology presents the possibility of using an active and individual approach to predicting cardiac disease. -
AmpliconHunter: A Scalable Tool for PCR Amplicon Prediction from Microbiome Samples
Rye Howard-Stone, Ion I. MăndoiuAbstractSequencing of PCR amplicons generated using degenerate primers (typically targeting a region of the 16S ribosomal gene) is widely used in metagenomics to profile the taxonomic composition of complex microbial samples. To reduce taxonomic biases in primer selection it is important to conduct in silico PCR analyses of the primers against large collections of up to millions of bacterial genomes. However, existing in silico PCR tools have impractical running time for analyses of this scale. In this paper we introduce AmpliconHunter, a highly scalable in silico PCR package distributed as an open-source command-line tool and publicly available through a user-friendly web interface at https://ah1.engr.uconn.edu/. AmpliconHunter implements an accurate nearest-neighbor model for melting temperature calculations, allowing for primer-template hybridization with mismatches, along with three complementary methods for estimating off-target amplification. By taking advantage of multi-core parallelism and SIMD operations available on modern CPUs, the AmpliconHunter web server can complete in silico PCR analyses of commonly used degenerate primer pairs against the 2.4M genomes in the latest AllTheBacteria collection in as few as 6–7 h. -
Neuromorphic Spiking Neural Network Based Classification of COVID-19 Spike Sequences
Taslim Murad, Prakash Chourasia, Sarwan Ali, Avais Jan, Murray PattersonAbstractThe availability of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) virus data post-COVID has reached exponentially to an enormous magnitude, opening research doors to analyze its behavior. Various studies are conducted by researchers to gain a deeper understanding of the virus, like genomic surveillance, etc., so that efficient prevention mechanisms can be developed. However, the unstable nature of the virus (rapid mutations, multiple hosts, etc.) creates challenges in designing analytical systems for it. Therefore, we propose a neural network-based (NN) mechanism to perform an efficient analysis of the SARS-CoV-2 data, as NN portrays generalized behavior upon training. Moreover, rather than using the full-length genome of the virus, we apply our method to its spike region, as this region is known to have predominant mutations and is used to attach to the host cell membrane. In this paper, we introduce a pipeline that first converts the spike protein sequences into a fixed-length numerical representation and then uses Neuromorphic Spiking Neural Network to classify those sequences. We compare the performance of our method with various baselines using real-world SARS-CoV-2 spike sequence data and show that our method is able to achieve higher predictive accuracy compared to the recent baselines. -
Backmatter
- Title
- Computational Advances in Bio and Medical Sciences
- Editors
-
Mohammed Alser
Mukul S. Bansal
Yury Khudyakov
Serghei Mangul
Ion I. Mandoiu
Marmar R. Moussa
Murray Patterson
Sanguthevar Rajasekaran
Pavel Skums
Shibu Yooseph
Alexander Zelikovsky
- Copyright Year
- 2026
- Publisher
- Springer Nature Switzerland
- Electronic ISBN
- 978-3-032-02489-3
- Print ISBN
- 978-3-032-02488-6
- DOI
- https://doi.org/10.1007/978-3-032-02489-3
Accessibility information for this book is coming soon. We're working to make it available as quickly as possible. Thank you for your patience.