Zum Inhalt

Computational Advances in Bio and Medical Sciences

13th International Conference, ICCABS 2025, Atlanta, GA, USA, January 12–14, 2025, Revised Selected Papers

  • 2026
  • Buch
insite
SUCHEN

Über dieses Buch

Dieses Buch stellt die referierten Beiträge der 13. Internationalen Konferenz über Computational Advances in Bio and Medical Sciences, ICCABS 2025, dar, die vom 12. bis 14. Januar 2025 in Atlanta, GA, USA, stattfand. Die 26 vollständigen Beiträge, die in diesem Verfahren präsentiert wurden, wurden sorgfältig geprüft und aus 75 Einreichungen ausgewählt. ICCABS hat das Ziel, Forscher, Wissenschaftler und Studenten aus Wissenschaft, Labor und Industrie zusammenzubringen, um die jüngsten Fortschritte bei Computertechniken und -anwendungen in den Bereichen Biologie, Medizin und Arzneimittelforschung zu diskutieren.

Inhaltsverzeichnis

Frontmatter
A Benchmarking Study of Random Projections and Principal Components for Dimensionality Reduction Strategies in Single Cell Analysis
Abstract
Principal Component Analysis (PCA) has long been a cornerstone in dimensionality reduction for high-dimensional data, including single-cell RNA sequencing (scRNA-seq). However, PCA’s performance typically degrades with increasing data size, can be sensitive to outliers, and assumes linearity. Recently, Random Projection (RP) methods have emerged as promising alternatives, addressing some of these limitations. This study systematically and comprehensively evaluates PCA and RP approaches, including Singular Value Decomposition (SVD) and randomized SVD, alongside Sparse and Gaussian Random Projection algorithms, with a focus on computational efficiency and downstream analysis effectiveness. We benchmark performance using multiple scRNA-seq datasets including labeled and unlabeled publicly available datasets. We apply Hierarchical Clustering and Spherical K-Means clustering algorithms to assess downstream clustering quality. For labeled datasets, clustering accuracy is measured using the Hungarian algorithm and Mutual Information. For unlabeled datasets, the Dunn Index and Gap Statistic capture cluster separation. Across both dataset types, the Within-Cluster Sum of Squares (WCSS) metric is used to assess variability. Additionally, locality preservation is examined, with RP outperforming PCA in several of the evaluated metrics.
Our results demonstrate that RP not only surpasses PCA in computational speed but also rivals and, in some cases, exceeds PCA in preserving data variability and clustering quality. By providing a thorough benchmarking of PCA and RP methods, this work offers valuable insights into selecting optimal dimensionality reduction techniques, balancing computational performance, scalability, and the quality of downstream analyses.
Mohamed Abdelnaby, Marmar R. Moussa
Resistance Genes are Distinct in Protein-Protein Interaction Networks According to Drug Class and Gene Mobility
Abstract
With growing calls for increased surveillance of antibiotic resistance as an escalating global health threat, improved bioinformatic tools are needed to track antibiotic resistance genes (ARGs) across One Health domains. Most studies to date profile ARGs using sequence homology, but such approaches provide limited information about the broader context or function of the ARG in bacterial genomes. Here we introduce a new pipeline, PPI-ARG-finder, for identifying ARGs in genomic data that employs machine learning analysis of Protein-Protein Interaction Networks (PPINs) as a means to improve predictions of ARGs while also providing vital information about the genetic context, such as gene mobility. A random forest model was trained to effectively differentiate between ARGs and nonARGs and was validated using the PPINs of ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter cloacae), which represent urgent threats to human health because they tend to be multi-antibiotic resistant. The pipeline exhibited robustness in discriminating ARGs from nonARGs, achieving an average area under the precision-recall curve of 88%. We further identified that the neighbors of ARGs, i.e., genes connected to ARGs by only one edge, were disproportionately associated with mobile genetic elements, which is consistent with the understanding that ARGs tend to be more mobile compared to randomly sampled genes in the PPINs. PPI-ARG-finder showcases the utility of PPINs in discerning distinctive characteristics of ARGs within a broader genomic context and in differentiating ARGs from nonARGs through network-based attributes and interaction patterns.
Nazifa Ahmed Moumi, Connor L. Brown, Shafayat Ahmed, Peter J. Vikesland, Amy Pruden, Liqing Zhang
DuoHash: Fast Hashing of Spaced Seeds with Application to Spaced K-mers Counting
Abstract
Alignment-free genomic sequence analysis has facilitated high-throughput processing within numerous bioinformatics workflows. A central task in alignment-free applications is hashing k-mers, commonly used for indexing, querying, and fast similarity searches. Recently, spaced seeds—a specialized pattern designed to accommodate errors or mutations—have increasingly replaced k-mers, enhancing sensitivity in various applications. However, spaced seed hashing is computationally intensive, introducing significant delays. This paper addresses the challenge of efficient spaced seed hashing and presents DuoHash, a framework that enables the efficient computation of several hash functions. Our experimental results demonstrate that the proposed method substantially outperforms existing algorithms, achieving speedups of up to 11x. To illustrate practical utility, we further applied DuoHash to the problem of spaced k-mers counting. The code of DuoHash is available at https://​github.​com/​CominLab/​DuoHash/​.
Leonardo Gemin, Cinzia Pizzi, Matteo Comin
Unsupervised Learning for Tertiary Structure Prediction of Protein Molecules: Systematic Review
Abstract
Tertiary structures of molecules represent high-dimensional data containing spatial information of hundreds (even thousands) of atoms. Unsupervised learning techniques can be applied to such spatial data to uncover hidden organizations that can be subjected to further evaluation. Such techniques have already been employed in a number of relevant applications e.g., tracking the conformational changes in a set of biomolecular structures, detecting biologically active tertiary structures from computed structures of proteins, analyzing molecular dynamics simulation of peptides, and so on. This paper presents a comprehensive review of clustering techniques for tertiary (3D) molecular structure data focusing on protein molecules. In fact, the article systematically organizes and analyzes the existing approaches in terms of data representation, methodology, proximity measure, and evaluation metric. Besides, it highlights key open challenges and proposes future research directions to advance this domain.
Kazi Lutful Kabir
Fast and Succinct Compression of k-mer Sets with Plain Text Representation of Colored de Bruijn Graphs
Abstract
A fundamental operation in computational genomics is the reduction of input sequences into their constituent k-mers. Designing space-efficient ways to represent a k-mer collection is essential to improve the scalability of bioinformatics analyses. A widely used approach involves converting the k-mer set into a de Bruijn graph and then producing a compact plain text representation by identifying the minimum path cover. In this article, we present USTAR-CR, a novel algorithm for compressing multiple k-mer sets. USTAR-CR leverages node connectivity principles in the colored de Bruijn graph for a more compact plain text representation, combined with an efficient encoding of k-mers colors. We tested USTAR-CR on real read datasets and compared it with the state-of-the-art GGCAT. USTAR-CR demonstrated superior performance in terms of compression, requiring less memory and being significantly faster (up to 51x) https://​github.​com/​enricorox/​USTAR-CR.
Enrico Rossignolo, Matteo Comin
Enhancing Protein Side Chain Packing Using Rotamer Clustering and Machine Learning
Abstract
One of the challenges and a significant part of a protein structure’s prediction in three-dimensional space is a side chain prediction/packing. This area of research has a large importance, due to its various applications in protein design. In recent years, many methodologies and techniques have been crafted for side chain prediction such as DLPacker, FASPR, SCWRL4 and OPUS-Rota4. In this research, we address the problem from a different perspective. We employed a machine learning model to predict the side chain packing of protein molecules given only the Cα trace. We analyzed 32,000 protein molecules to extract important geometrical features that can distinguish between different orientations of side chain rotamers. We designed and implemented a Random Forest model to tackle this problem. Given the accuracy of existing state-of-the-art approaches, our model represents an improvement. The results of our experiment show that Random Forest is highly effective, achieving a total average accuracy of 73.7% for proteins and 73.3% for individual amino acids.
Mohammed Alamri, Mohammad Al Sallal, Kamal Al Nasr, Muhammad Akbar, Ahmad Jad Allah
Can Language Models Reason About ICD Codes to Guide the Generation of Clinical Notes?
Abstract
In the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients’ assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering them. Manually writing clinical notes may take considerable amount of time, increasing the patient’s waiting time and could possibly delay diagnoses. Large language models (LLMs), such as GPT-3 possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM’s response in clinical note generation. In our prompts, we incorporate International Classification of Diseases (ICD) codes and basic patient information along with similar clinical case examples to investigate how LLMs can effectively formulate clinical notes. We tested our CoT prompt technique on six clinical cases from the CodiEsp test dataset using GPT-4 as our LLM and our results show that it outperformed the standard zero-shot prompt.
Ivan Makohon, Jian Wu, Bintao Feng, Yaohang Li
Link Prediction in Disease-Disease Interactions Network Using a Hybrid Deep Learning Model
Abstract
Discovering disease-disease association based on the underlying biological mechanisms is an essential biomedical task in modern biology as understanding these relationships will assist biologists in discovering the pathogenesis, diagnosis, and intervention of human diseases. Recently, deep learning on graph and graph neural networks have achieved promising performance in modeling complex biological structures and learning compact representations of interconnected data. Inspired by the success of graph neural networks in learning subgraph representations, we propose a novel framework, SNN-VGA, designed to predict potential disease comorbid pairs. We first model disease-associated genes as subgraphs in the protein-protein interactions network and learn disentangled disease module representations using a subgraph neural network model. The learned embeddings are leveraged by the variational graph auto-encoder to predict disease comorbidity in the disease-disease interactions network. Empirical results from a benchmark dataset demonstrate that our method performs competitively compared with the state-of-the-art model, with an AUROC of 0.96.
Ashwag Altayyar, Li Liao
Model Selection for Sparse Microbial Network Inference Using Variational Approximation
Abstract
Microbial communities are often composed of taxa from different taxonomic groups. The associations among the constituent members in a microbial community play an important role in determining the functional characteristics of the community, and these associations can be modeled using an edge weighted graph (microbial network). A microbial network is typically inferred from a sample-taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa abundance in each sample. Motivated by microbiome studies that involve a large number of samples collected across a range of study parameters, here we consider the computational problem of identifying the number of microbial networks underlying the observed sample-taxa abundance matrix. Specifically, we consider the problem of determing the number of sparse microbial networks in this setting. We use a mixture model framework to address this problem, and present formulations to model both count data and proportion data. We propose several variational approximation based algorithms that allow the incorporation of the sparsity constraint while estimating the number of components in the mixture model. We evaluate these algorithms on a large number of simulated datasets generated using a collection of different graph structures (band, hub, cluster, random, and scale-free).
Shibu Yooseph
Haplotype-Based Parallel PBWT for Biobank Scale Data
Abstract
Durbin’s positional Burrows-Wheeler transform (PBWT) enables algorithms with the optimal time complexity of O(MN) for reporting all vs all haplotype matches in a population panel with M haplotypes and N variant sites. However, even this efficiency may still be too slow when the number of haplotypes reaches millions. To further reduce the run time, in this paper, a parallel version of the PBWT algorithms is introduced for all versus all haplotype matching, which is called HP-PBWT (haplotype-based parallel PBWT). HP-PBWT parallelly executes the PBWT by splitting a haplotype panel into blocks of haplotypes. HP-PBWT algorithms achieve parallelization for PBWT construction, reporting all versus all L-long matches, and reporting all versus all set-maximal matches while maintaining memory efficiency. HP-PBWT has an \(O((\frac{M}{T}+T)N)\) time complexity in PBWT construction, and \(O((\frac{M}{T}+T +c^*)N)\) time complexity for reporting all versus all L-long matches and reporting all versus all set-maximal matches, where T is the number of threads and \(c^*\) is the maximum number of matches (of length L or maximum divergence value for L-long matches and set-maximal matches respectively) per haplotype per site. HP-PBWT achieves 4-fold speed-up in UK Biobank genotyping array data with 30 threads in the IO-included benchmarks. When applying HP-PBWT to a dataset of 8 million randomized haplotypes (random binary strings of equal length) in the IO-excluded benchmarks, it can achieve a 22-fold speed-up with 60 cores on the Amazon EC2 server. With further hardware optimization, HP-PBWT is expected to handle billions of haplotypes efficiently.
Kecong Tang, Ahsan Sanaullah, Degui Zhi, Shaojie Zhang
Mammo-Bench: A Large-Scale Benchmark Dataset of Mammography Images
Abstract
Breast cancer remains a significant global health concern, and machine learning algorithms and computer-aided detection systems have shown great promise in enhancing the accuracy and efficiency of mammography image analysis. However, there is a critical need for large, benchmark datasets for training deep learning models for breast cancer detection. In this work we developed Mammo-Bench, a large-scale benchmark dataset of mammography images, by collating data from seven well-curated resources, viz., DDSM, INbreast, KAU-BCMD, CMMD, CDD-CESM, DMID, and RSNA Screening Dataset. To ensure consistency across images from diverse sources while preserving clinically relevant features, a preprocessing pipeline that includes breast segmentation, pectoral muscle removal, and intelligent cropping is proposed. The dataset consists of 74,436 high-quality mammographic images from 26,500 patients across 7 countries and is one of the largest open-source mammography databases to the best of our knowledge. To show the efficacy of training on the large dataset, performance of ResNet101 architecture was evaluated on Mammo-Bench and the results compared by training independently on a few member datasets and an external dataset, VinDr-Mammo. An accuracy of 78.8% (with data augmentation of the minority classes) and 77.8% (without data augmentation) was achieved on the proposed benchmark dataset, compared to the other datasets for which accuracy varied from 25 – 69%. Noticeably, improved prediction of the minority classes is observed with the Mammo-Bench dataset. These results establish baseline performance and demonstrate Mammo-Bench's utility as a comprehensive resource for developing and evaluating mammography analysis systems.
Gaurav Bhole, S. Suba, Nita Parekh
MetaEdit: Computational Identification of RNA Editing in Microbiomes
Abstract
RNA editing is a pivotal post-transcriptional mechanism that plays a critical role in the regulation of some genes by altering their mRNA sequences, thereby influencing the resulting protein sequence, structure, and the functional and cellular responses. While extensively studied in eukaryotes, its significance and prevalence in prokaryotic microbiomes remain underexplored. Given the crucial role of microbiomes in various biological processes and their potential impact on human health and disease, understanding RNA editing within these communities could reveal new insights into microbial gene regulation and adaptation. The lack of studies to detect RNA editing in microbiomes motivates the need for developing bioinformatic strategies to bridge this research gap. This study introduces MetaEdit, a computational tool designed to detect RNA editing in bacterial microbiomes. We apply MetaEdit to metatranscriptomic and metagenomic datasets to identify and characterize RNA editing events in the human gut microbiome. Our results demonstrate the presence of RNA editing in Escherichia coli and provide a foundation for future investigations into the functional implications of RNA editing in microbiomes. Our findings are supported by previously reported research but need validation with laboratory experiments. The developed pipeline is generic and can be applied to find RNA editing in any sequencing datasets containing both metagenomic and metatranscriptomic data.
Availability: Pipeline is available from https://​biorg.​cs.​fiu.​edu/​metaedit/​.
Supplementary information: None.
Arpit Mehta, Vitalii Stebliankin, Kalai Mathee, Giri Narasimhan
Drug-Centric Prior Improves Drug Response Modeling in Partially Overlapping Pharmacogenomic Screens
Abstract
With the accumulation of large-scale genomic data such as whole-genome RNA sequencing, copy number, and mutation profiles for tens of thousands of samples, associated with screening thousands of small molecules and other perturbagens, arises the question of how to best leverage partially overlapping datasets generated at different facilities. As research groups across the world continue to generate drug screens of variable size and quality, the need for approaches that can learn from such partially overlapping experiments and improve the signal to noise ratio emerges with increasing importance. We present an application of a Bayesian group factor analysis model, where we employ a drug-centric prior to transfer information about drugs screened in the same samples in multiple datasets. We show that joint models leveraging partially overlapping pharmacogenomic datasets from the Broad and Sanger institutes can overall improve drug signature identification.
Dharani Thirumalaisamy, Sunil K. Joshi, Stephen E. Kurtz, Tania Q. Vu, Jeffrey W. Tyner, Mehmet Gönen, Olga Nikolova
Improving Inter-helical Residue Contact Prediction in -Helical Transmembrane Proteins Using Structural Neighborhood Crowdedness Information
Abstract
Residue contact maps are a useful compressed representation that can be used as constraints for structural modeling, but can also help identify inter-helical binding sites and are hence effective on their own. In this work, we hypothesize that crowdedness around a target residue pair influences whether it is a contact point. We developed two measures of crowdedness in a residue’s 3D neighborhood: bin counts - defined in terms of relative residue distance; and residue contact number for inter-helical TM proteins - the number of residues in a specified relative distance. Since unsupervised language models such as MSA transformer, trained on millions of sequences, are very accurate but also complementary to our approach, we combined MSA transformer score with our proposed features to assess the impact of crowdedness on residue contact prediction. We found that crowdedness measures can in fact increase the upper bound performance by at least 7.65% average precision in cross validation experiments and by at least 11.59% average precision in held out experiments. Further, we developed a method to “transfer” this information when ground truth crowdedness measures are unavailable. Our approach outperformed MSA transformer by at least 1.15% average precision in cross validation experiments and 1.85% average precision in held-out experiments.
Aman Sawhney, Li Liao
Explaining Protein Folding Networks Using Integrated Gradients and Attention Mechanisms
Abstract
Protein folding prediction models like AlphaFold and ColabFold have revolutionized structural biology by providing accurate protein structures. However, these models present challenges when it comes to understanding how they arrive at their decisions. In this paper, we propose the application of Explainable AI (XAI) techniques, specifically Integrated Gradients and Attention Mechanisms, to elucidate the decision-making process of these complex networks. We conduct computational experiments to evaluate the effectiveness of these methods and discuss potential implications for the field.
Rukmangadh Sai Myana, Sumit Kumar Jha
Computationally Reconstructing the Evolution of Cancer Progression Risk
Abstract
Understanding the evolution of cancer in its early stages is critical to identifying key drivers of cancer progression and developing better early diagnostics or prophylactic treatments. Early cancer is difficult to observe, though, since it is generally asymptomatic until extensive genetic damage has accumulated. In this study, we develop a computational approach to infer how once-healthy cells enter into and become committed to a pathway of aggressive cancer. We accomplish this through a strategy of using tumor phylogenetics to look backwards in time to earlier stages of tumor development combined with machine learning to infer how progression risk changes over those stages. We apply this paradigm to point mutation data from a set of cohorts from the Cancer Genome Atlas (TCGA) to formulate models of how progression risk evolves from the earliest stages of tumor growth, as well as how this evolution varies within and between cohorts. The results suggest general mechanisms by which risk develops as a cell population commits to aggressive cancer, but with significant variability between cohorts and individuals. These results imply limits to the potential for earlier diagnosis and intervention while also providing grounds for hope in extending these beyond current practice.
Kefan Cao, Russell Schwartz
Cancer Diseases Classification with Sparse Neural Networks: An Information-Theoretic Approach
Abstract
Machine learning is indispensable for biomedical data modeling and classification. Tasks involving large, high-dimensional datasets are nevertheless computationally intensive and approximation methods are often sought to scale down the volume of raw data or model size without compromising substantial information embedded within the data. However, previous approximation methods have yielded mixed results and have yet to establish a clear framework linking feature selection and model sparsification. In this paper, we present an information-theoretic approach for cancer classification by addressing two prominent questions in data model approximation: how to identify a minimal set of critical features in cancer microarray data and how to design sparse neural networks that are effective and efficient for cancer classification. Our study highlights a key connection between these two challenges. In particular, we introduce a mutual information (MI)-based method to select a highly informative subset of genes from extensive microarray gene expression data. Each selected subset of genes, up to two orders of magnitude smaller than the original gene set, demonstrates superior performance in cancer classification compared to the full dataset. Additionally, the MI-based method enables the design of sparsified neural networks that consistently maintain or even improve classification performance compared to fully connected networks. Our test results reveal that sparsified networks selectively retain connections to the critical genes identified by the MI-based filtering method, effectively ignoring contributions from irrelevant genes.
Zahra Jandaghi, Sixiang Zhang, Xiuzhen Huang, Liming Cai
Epistatic Density of Viral Variants in Acute and Chronic HCV Patients
Abstract
RNA viruses exhibit high mutation rates due to the lack of proofreading mechanisms during replication, leading to diverse intra-host viral populations. Variants with higher fitness tend to dominate the population due to enhanced transmissibility and immune escape. Fitness of viral variants depends on individual SNVs and epistatic links between pairs of SNVs as well as competition with other viral variants within the population. Recent machine learning methods have successfully predicted emerging COVID-19 variants based on epistatic SNV links, implying that SNV links contribute to fitness of viral variants.
We define the epistatic density of a viral variant as the number of positively linked SNV pairs between mutated positions in its genome. We computed epistatic density of intra-host Hepatitis C Virus (HCV) populations sampled from 85 chronic and 28 acute patients with HCV 1a genotypes. On average, epistatic density was higher in chronic patients than in acute cases. Additionally, the epistatic density distributions are more irregular and choppy in acute populations. Finally, we applied the epistatic density properties to distinguish between intra-host populations of chronic and acute HCV patients.
Alina Nemira, Akshay Juyal, Pavel Skums, Alexander Zelikovsky
Applying Genetic Algorithm with Saltations to MAX-3SAT
Abstract
Punctuated equilibrium, the pattern of rapid, significant mutational change, had not been observed in real time until the SARS-CoV-2 viral variants emerged with multiple mutations occurring together. Using epistasis (the circumstance in which the effect of one gene is influenced by the presence of one or more other genes) as a framework to understand this phenomenon, we can capture the relationships between different combinations of mutations, where each node is an individual mutation, and each edge represents the interaction between them, allowing us to effectively model the fitness landscape of viral variants. In exploring these relationships, it has been found that dense subgraphs within the network correspond to emerge saltation. We refer to this as an evolutionary jump and incorporate it with a genetic algorithm (GA + EJ), which can uncover high-fitness regions seemingly distant from the variant(s) from which they originally derived. We applied it to the MAX-3SAT problem and found improvement for satisfiable problem instances with 600 variables and 2550 clauses, as well as 100 variables and 429 clauses.
Ryan Alomair, Hafsa Farooq, Daniel Novikov, Akshay Juyal, Alexander Zelikovsky
Computing Gram Matrix for SMILES Strings Using RDKFingerprint and Sinkhorn-Knopp Algorithm
Abstract
SMILES (Simplified Molecular Input Line Entry System) strings are widely used to represent molecular structures in cheminformatics and drug discovery. However, effectively transforming these string-based representations into meaningful numerical features for machine learning remains a significant challenge due to the complex, non-Euclidean nature of molecular structures. Traditional fingerprint-based and deep learning approaches often struggle with scalability, interpretability, or computational efficiency. Our approach leverages the Morgan Fingerprint to generate molecular feature representations, followed by a pairwise kernel function to compute a structured similarity matrix. We then refine this matrix using the Sinkhorn-Knopp algorithm, ensuring it satisfies probabilistic constraints. To reduce dimensionality, we apply Kernel Principal Component Analysis (PCA), producing compact embeddings suitable for downstream machine learning tasks. We conduct a comprehensive empirical evaluation of the proposed method which is assessed for drug subcategory prediction (classification task) and solubility AlogPS “aqueous solubility and octanol/water partition coefficient” (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms baseline methods in supervised analysis and has potential uses in molecular design and drug discovery. By integrating kernel-based learning with probabilistic refinement, our method offers a promising alternative to existing cheminformatics techniques.
Sarwan Ali, Haris Mansoor, Prakash Chourasia, Imdad Ullah Khan, Murray Patterson
Enhancing Privacy Preservation and Reducing Analysis Time with Federated Transfer Learning in Digital Twins-Based Computed Tomography Scan Analysis
Abstract
The application of Digital Twin (DT) technology and Federated Learning (FL) has great potential to change the field of biomedical image analysis, particularly for Computed Tomography (CT) scans. This paper presents Federated Transfer Learning (FTL) as a new Digital Twin-based CT scan analysis paradigm. FTL uses pre-trained models and knowledge transfer between peer nodes to solve problems such as data privacy, limited computing resources, and data heterogeneity. The proposed framework allows real-time collaboration between cloud servers and Digital Twin-enabled CT scanners while protecting patient identity.
We apply the FTL method to a heterogeneous CT scan dataset and assess model performance using convergence time, model accuracy, precision, recall, F1 score, and confusion matrix. It has been shown to perform better than conventional FL and Clustered Federated Learning (CFL) methods with better precision, accuracy, recall, and F1-score. The technique is beneficial in settings where the data is not independently and identically distributed (non-IID), and it offers reliable, efficient, and secure solutions for medical diagnosis. These findings highlight the possibility of using FTL to improve decision-making in digital twin-based CT scan analysis, secure and efficient medical image analysis, promote privacy, and open new possibilities for applying precision medicine and smart healthcare systems.
Avais Jan, Qasim Zia, Murray Patterson
Improved Graph-Based Antibody-Aware Epitope Prediction with Protein Language Model-Based Embeddings
Abstract
The accurate identification of B-cell epitopes is critical in antibody design, diagnostics, and immunotherapies. Many in silico approaches have recently been proposed to predict epitopes, but these approaches struggle primarily because of the variational and conformational nature of epitopes. However, deep learning-based approaches have recently shown great promise in achieving better performance at the epitope prediction task. In this paper, we employ a graph convolutional network (GCN) coupled with pre-trained protein language model (PLM)-based embeddings for epitope prediction on a benchmark antibody-specific epitope prediction (AsEP) dataset. We explore the use of different PLM-embedding methods on the epitope prediction task and show that the choice of PLM embeddings impacts the performance. Specifically, we find that antibody-specific PLMs such as AntiBERTy and general PLMs such as ProtTrans and ESM-2 for antigens provide improved epitope prediction performance with an AUCROC of 0.65, precision of 0.28, and recall of 0.46. The source code is available at: https://​github.​com/​mansoor181/​walle-pp.​git
Mansoor Ahmed, Sarwan Ali, Avais Jan, Imdad Ullah Khan, Murray Patterson
Leveraging RNA LLMs for 3D Structure Prediction via Data Augmentation
Abstract
Ribonucleic acid (RNA) is a complex macromolecule essential for living organisms to function in cells. Understanding its three-dimensional (3D) structure is critical for elucidating its cellular roles. However, computational prediction of RNA 3D structures remains a significant challenge due to the vast conformational space that RNA molecules can adopt. Although machine learning, particularly deep learning-based methods, has recently gained traction, the lack of a large dataset of native RNA structures for training has limited these methods from achieving desired performance. In this study, we leverage pre-trained RNA large language models to predict RNA 3D conformations directly from input RNA sequences. Specifically, we introduce data augmentation techniques to address the issue of data scarcity in RNA 3D structures. This present paper focuses on predicting backbone conformations to evaluate the effectiveness of our method. Preliminary results demonstrate promising accuracy, with predicted structures achieving an average RMSD of 3.85Å against native 3D structures in the PDB—a 50% reduction in performance error compared to predictions made without the data augmentation method.
Sixiang Zhang, Harish Anand, Liming Cai
EfficientNet in Digital Twin-Based Cardiac Arrest Prediction and Analysis
Abstract
Cardiac arrest is one of the biggest global health problems, and early identification and management are key to enhancing the patient’s prognosis. In this paper, we propose a novel framework that combines an EfficientNet-based deep learning model with a digital twin system to improve the early detection and analysis of cardiac arrest. We use compound scaling and EfficientNet to learn the features of cardiovascular images. In parallel, the digital twin creates a realistic and individualized cardiovascular system model of the patient based on data received from the Internet of Things (IoT) devices attached to the patient, which can help in the constant assessment of the patient and the impact of possible treatment plans. As shown by our experiments, the proposed system is highly accurate in its prediction abilities and, at the same time, efficient. Combining highly advanced techniques such as deep learning and digital twin (DT) technology presents the possibility of using an active and individual approach to predicting cardiac disease.
Qasim Zia, Avais Jan, Zafar Iqbal, Muhammad Mumtaz Ali, Mukarram Ali, Murray Patterson
AmpliconHunter: A Scalable Tool for PCR Amplicon Prediction from Microbiome Samples
Abstract
Sequencing of PCR amplicons generated using degenerate primers (typically targeting a region of the 16S ribosomal gene) is widely used in metagenomics to profile the taxonomic composition of complex microbial samples. To reduce taxonomic biases in primer selection it is important to conduct in silico PCR analyses of the primers against large collections of up to millions of bacterial genomes. However, existing in silico PCR tools have impractical running time for analyses of this scale. In this paper we introduce AmpliconHunter, a highly scalable in silico PCR package distributed as an open-source command-line tool and publicly available through a user-friendly web interface at https://​ah1.​engr.​uconn.​edu/​. AmpliconHunter implements an accurate nearest-neighbor model for melting temperature calculations, allowing for primer-template hybridization with mismatches, along with three complementary methods for estimating off-target amplification. By taking advantage of multi-core parallelism and SIMD operations available on modern CPUs, the AmpliconHunter web server can complete in silico PCR analyses of commonly used degenerate primer pairs against the 2.4M genomes in the latest AllTheBacteria collection in as few as 6–7 h.
Rye Howard-Stone, Ion I. Măndoiu
Neuromorphic Spiking Neural Network Based Classification of COVID-19 Spike Sequences
Abstract
The availability of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) virus data post-COVID has reached exponentially to an enormous magnitude, opening research doors to analyze its behavior. Various studies are conducted by researchers to gain a deeper understanding of the virus, like genomic surveillance, etc., so that efficient prevention mechanisms can be developed. However, the unstable nature of the virus (rapid mutations, multiple hosts, etc.) creates challenges in designing analytical systems for it. Therefore, we propose a neural network-based (NN) mechanism to perform an efficient analysis of the SARS-CoV-2 data, as NN portrays generalized behavior upon training. Moreover, rather than using the full-length genome of the virus, we apply our method to its spike region, as this region is known to have predominant mutations and is used to attach to the host cell membrane. In this paper, we introduce a pipeline that first converts the spike protein sequences into a fixed-length numerical representation and then uses Neuromorphic Spiking Neural Network to classify those sequences. We compare the performance of our method with various baselines using real-world SARS-CoV-2 spike sequence data and show that our method is able to achieve higher predictive accuracy compared to the recent baselines.
Taslim Murad, Prakash Chourasia, Sarwan Ali, Avais Jan, Murray Patterson
Backmatter
Titel
Computational Advances in Bio and Medical Sciences
Herausgegeben von
Mohammed Alser
Mukul S. Bansal
Yury Khudyakov
Serghei Mangul
Ion I. Mandoiu
Marmar R. Moussa
Murray Patterson
Sanguthevar Rajasekaran
Pavel Skums
Shibu Yooseph
Alexander Zelikovsky
Copyright-Jahr
2026
Electronic ISBN
978-3-032-02489-3
Print ISBN
978-3-032-02488-6
DOI
https://doi.org/10.1007/978-3-032-02489-3

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH