Skip to main content
Top

2025 | Book

Foundation Models for General Medical AI

Second International Workshop, MedAGI 2024, Held in Conjunction with MICCAI 2024, Marrakesh, Morocco, October 6, 2024, Proceedings

Editors: Zhongying Deng, Yiqing Shen, Hyunwoo J. Kim, Won-Ki Jeong, Angelica I. Aviles-Rivero, Junjun He, Shaoting Zhang

Publisher: Springer Nature Switzerland

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings from the Second International Workshop on Foundation Models for General Medical AI, MedAGI 2024, held in conjunction with the 27th International conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2024, in Marrakesh, Morocco in October 2024.

The 17 papers included in this book were carefully reviewed and selected from 26 submissions. These papers provide insights into the current landscape of medical AI and foundation models, that can pave the way for the evolution of task-specific medical AI systems into more generalized frameworks capable of tackling a diverse range of tasks, datasets, and domains.

Table of Contents

Frontmatter
FastSAM-3DSlicer: A 3D-Slicer Extension for 3D Volumetric Segment Anything Model with Uncertainty Quantification
Abstract
Accurate segmentation of anatomical structures and pathological regions in medical images is crucial for diagnosis, treatment planning, and disease monitoring. While the Segment Anything Model (SAM) and its variants have demonstrated impressive interactive segmentation capabilities on image types not seen during training without the need for domain adaptation or retraining, their practical application in volumetric 3D medical imaging workflows has been hindered by the lack of a user-friendly interface. To address this challenge, we introduce FastSAM-3DSlicer, a 3D Slicer extension that integrates both 2D and 3D SAM models, including SAM-Med2D, MedSAM, SAM-Med3D, and FastSAM-3D. Building on the well-established open-source 3D Slicer platform, our extension enables efficient, real-time segmentation of 3D volumetric medical images, with seamless interaction and visualization. By automating the handling of raw image data, user prompts, and segmented masks, FastSAM-3DSlicer provides a streamlined, user-friendly interface that can be easily incorporated into medical image analysis workflows. Performance evaluations reveal that the FastSAM-3DSlicer extension running FastSAM-3D achieves low inference times of only 1.09 s per volume on CPU and 0.73 s per volume on GPU, making it well-suited for real-time interactive segmentation. Moreover, we introduce an uncertainty quantification scheme that leverages the rapid inference capabilities of FastSAM-3D for practical implementation, further enhancing its reliability and applicability in medical settings. FastSAM-3DSlicer offers an interactive platform and user interface for 2D and 3D interactive volumetric medical image segmentation, offering a powerful combination of efficiency, precision, and ease of use with SAMs. The source code and a video demonstration are publicly available at https://​github.​com/​arcadelab/​FastSAM3D_​slicer.
Yiqing Shen, Xinyuan Shao, Blanca Inigo Romillo, David Dreizin, Mathias Unberath
The Importance of Downstream Networks in Digital Pathology Foundation Models
Abstract
Digital pathology has significantly advanced disease detection and pathologist efficiency through the analysis of gigapixel whole-slide images (WSI). In this process, WSIs are first divided into patches, for which a feature extractor model is applied to obtain feature vectors, which are subsequently processed by an aggregation model to predict the respective WSI label. With the rapid evolution of representation learning, numerous new feature extractor models, often termed foundational models, have emerged. Traditional evaluation methods rely on a static downstream aggregation model setup, encompassing a fixed architecture and hyperparameters, a practice we identify as potentially biasing the results. Our study uncovers a sensitivity of feature extractor models towards aggregation model configurations, indicating that performance comparability can be skewed based on the chosen configurations. By accounting for this sensitivity, we find that the performance of many current feature extractor models is notably similar. We support this insight by evaluating seven feature extractor models across three different datasets with 162 different aggregation model configurations. This comprehensive approach provides a more nuanced understanding of the feature extractors’ sensitivity to various aggregation model configurations, leading to a fairer and more accurate assessment of new foundation models in digital pathology.
Gustav Bredell, Marcel Fischer, Przemyslaw Szostak, Samaneh Abbasi-Sureshjani, Alvaro Gomariz
Temporal-Spatial Adaptation of Promptable SAM Enhance Accuracy and Generalizability of Cine CMR Segmentation
Abstract
Accurate myocardium segmentation across all phases in one cardiac cycle in cine cardiac magnetic resonance (CMR) scans is crucial for comprehensively cardiac function analysis. Despite advancements in deep learning (DL) for automatic cine CMR segmentation, generalizability on unseen data remains a significant challenge. Recently, the segment-anything-model (SAM) has been invented as a segmentation foundation model, known for its accurate segmentation and more importantly, zero-shot generalization. SAM was trained on two-dimensional (2D) natural images; to adapt it for comprehensive cine CMR segmentation, we propose cineCMR-SAM which incorporates both temporal and spatial information through a modified model architecture. Compared to other state-of-the-art (SOTA) methods, our model achieved superior data-specific model segmentation accuracy on the STACOM2011 when fine-tuned on this dataset and demonstrated superior zero-shot generalization on two other large public datasets (ACDC and M&Ms) unseen during fine-tuning. Additionally, we introduced a text prompt feature in cineCMR-SAM to specify the view type of input slices (short-axis or long-axis), enhancing performance across all view types. The GitHub repository is https://​github.​com/​zhennongchen/​cineCMR-SAM.​git.
Zhennong Chen, Sekeun Kim, Hui Ren, Quanzheng Li, Xiang Li
Navigating Data Scarcity Using Foundation Models: A Benchmark of Few-Shot and Zero-Shot Learning Approaches in Medical Imaging
Abstract
Data scarcity is a major limiting factor for applying modern machine learning techniques to clinical tasks. Although sufficient data exists for some well-studied medical tasks, there remains a long tail of clinically relevant tasks with poor data availability. Recently, numerous foundation models have demonstrated high suitability for few-shot learning (FSL) and zero-shot learning (ZSL), potentially making them more accessible to practitioners. However, it remains unclear which foundation model performs best on FSL medical image analysis tasks and what the optimal methods are for learning from limited data. We conducted a comprehensive benchmark study of ZSL and FSL using 16 pretrained foundation models on 19 diverse medical imaging datasets. Our results indicate that BiomedCLIP, a model pretrained exclusively on medical data, performs best on average for very small training set sizes, while very large CLIP models pretrained on LAION-2B perform best with slightly more training samples. However, simply fine-tuning a ResNet-18 pretrained on ImageNet performs similarly with more than five training examples per class. Our findings also highlight the need for further research on foundation models specifically tailored for medical applications and the collection of more datasets to train these models.
Stefano Woerner, Christian F. Baumgartner
AutoEncoder-Based Feature Transformation with Multiple Foundation Models in Computational Pathology
Abstract
The performance of deep learning models is highly dataset-dependent. Pretrained models on large-scale datasets have significant advantages in understanding general patterns by leveraging large volumes of data. Some of these models, which are adaptable to a wide range of downstream tasks, are referred to as foundation models. Recently, several foundation models have been published in the field of computational pathology, recognized for their potential to advance deep learning applications in several downstream tasks. In order to effectively utilize multiple foundation models, each with its own advantages, it is crucial to effectively summarize or ensemble their advantages. In this paper, we propose a feature transformation method for the effective utilization of features from multiple foundation models using an autoencoder-based architecture. This method facilitates the extraction of integrated features from multiple foundation models, enabling more generalized training. We demonstrated that the proposed approach resulted in more robust representations for out-of-distribution datasets in our patch-level classification tasks.
Woojin Chung, Yujun Park, Yonnho Nam
OSATTA: One-Shot Automatic Test Time Augmentation for Domain Adaptation
Abstract
Fundamental models (FM) are reshaping the research paradigm by providing ready-to-use solutions to many challenging tasks, such as image classification, registration, or segmentation. Yet, their performance on new dataset cohorts significantly drops, particularly due to domain gaps between the training (source) and testing (target) data. Recently, test-time augmentation strategies aim at finding target-to-source-mappings (t2sm), which improve the performance of the FM on the target dataset by leveraging the FM weights, thus assuming access to them. While this assumption holds for open research models, it does not for commercial ones (e.g., Chat-GPT). These are provided as black boxes; thus, the training data and the model weights are unavailable. In our work, we propose a new generic few-shot method that enables the computation of a target-to-source mapping by only using the black-box model’s outputs. We start by defining a parametric family of functions for the t2sm. Using a simple loss function, we optimize the t2sm parameters based on a single labeled image volume. This effectively provides a mapping between the source domain and the target domain. In our experiments, we investigate how to improve the segmentation performance of a given FM (a UNet), and we outperform state-of-the-art accuracy in the 1-shot setting, with further improvement in a few-shot setting. Our approach is invariant to the model architecture as the FM is treated as a black box, which significantly increases our method’s practical utility in real-world scenarios. The code is available for reproducibility purposes at https://​osatta.​gitlabpages.​inria.​fr/​MedAGI.
Felix Küper, Sergi Pujades
Automating MedSAM by Learning Prompts with Weak Few-Shot Supervision
Abstract
Foundation models such as the recently introduced Segment Anything Model (SAM) have achieved remarkable results in image segmentation tasks. However, these models typically require user interaction through handcrafted prompts such as bounding boxes, which limits their deployment to downstream tasks. Adapting these models to a specific task with fully labeled data also demands expensive prior user interaction to obtain ground-truth annotations. This work proposes to replace conditioning on input prompts with a lightweight module that directly learns a prompt embedding from the image embedding, both of which are subsequently used by the foundation model to output a segmentation mask. Our foundation models with learnable prompts can automatically segment any specific region by 1) modifying the input through a prompt embedding predicted by a simple module, and 2) using weak labels (tight bounding boxes) and few-shot supervision (10 samples). Our approach is validated on MedSAM, a version of SAM fine-tuned for medical images, with results on three medical datasets in MR and ultrasound imaging. Our code is available on https://​github.​com/​Minimel/​MedSAMWeakFewSho​tPromptAutomatio​n.
Mélanie Gaillochet, Christian Desrosiers, Hervé Lombaert
SAT-Morph: Unsupervised Deformable Medical Image Registration Using Vision Foundation Models with Anatomically Aware Text Prompt
Abstract
Current unsupervised deformable medical image registration methods rely on image similarity measures. However, these methods are inherently limited by the difficulty of integrating important anatomy knowledge into registration. The development of vision foundation models (e.g., Segment Anything Model (SAM)) has attracted attention for their excellent image segmentation capabilities. Medical-based SAM aligns medical text knowledge with visual knowledge, enabling precise segmentation of organs. In this study, we propose a novel approach that leverages the vision foundation model to enhance medical image registration by integrating anatomical understanding of the vision foundation model into the medical image registration model. Specifically, we propose a novel unsupervised deformable medical image registration framework, called SAT-Morph, which includes Segment Anything with Text prompt (SAT) module and mask registration module. In the SAT module, the medical vision foundation model is utilized to segment anatomical regions within both moving and fixed images according to our designed text prompts. In the mask registration module, we take these segmentation results instead of traditionally used image pairs as the input of the registration model. Compared with utilizing image pairs as input, using segmentation mask pairs incorporates anatomical knowledge and improves the registration performance. Experiments demonstrate that SAT-Morph significantly outperforms existing state-of-the-art methods on both the Abdomen CT and ACDC cardiac MRI datasets. These results illustrate the effectiveness of integrating vision foundation models into medical image registration, showing the potential way for more accurate and anatomically-aware registration. Our code is available at https://​github.​com/​HaoXu0507/​SAT-Morph/​.
Hao Xu, Tengfei Xue, Dongnan Liu, Fan Zhang, Carl-Fredrik Westin, Ron Kikinis, Lauren J. O’Donnell, Weidong Cai
Promptable Counterfactual Diffusion Model for Unified Brain Tumor Segmentation and Generation with MRIs
Abstract
Brain tumor analysis in Magnetic Resonance Imaging (MRI) is crucial for accurate diagnosis and treatment planning. However, the task remains challenging due to the complexity and variability of tumor appearances, as well as the scarcity of labeled data. Traditional approaches often address tumor segmentation and image generation separately, limiting their effectiveness in capturing the intricate relationships between healthy and pathological tissue structures. We introduce a novel promptable counterfactual diffusion model as a unified solution for brain tumor segmentation and generation in MRI. The key innovation lies in our mask-level prompting mechanism at the sampling stage, which enables guided generation and manipulation of specific healthy or unhealthy regions in MRI images. Specifically, the model’s architecture allows for bidirectional inference, which can segment tumors in existing images and generate realistic tumor structures in healthy brain scans. Furthermore, we present a two-step approach for tumor generation and position transfer, showcasing the model’s versatility in synthesizing realistic tumor structures. Experiments on the BRATS2021 dataset demonstrate that our method outperforms traditional counterfactual diffusion approaches [17], achieving a mean IoU of 0.653 and mean Dice score of 0.785 for tumor segmentation, outperforming the 0.344 and 0.475 of conventional counterfactual diffusion model. Our work contributes to improving brain tumor detection and segmentation accuracy, with potential implications for data augmentation and clinical decision support in neuro-oncology. The code is available at https://​github.​com/​arcadelab/​counterfactual_​diffusion.
Yiqing Shen, Guannan He, Mathias Unberath
D-Rax: Domain-Specific Radiologic Assistant Leveraging Multi-modal Data and eXpert Model Predictions
Abstract
Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis, which currently hinders VLMs’ clinical adaptability. To create precise, user-friendly models in healthcare, we propose D-Rax- a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.
Hareem Nisar, Syed Muhammad Anwar, Zhifan Jiang, Abhijeet Parida, Ramon Sanchez-Jacob, Vishwesh Nath, Holger R. Roth, Marius George Linguraru
Optimal Prompting in SAM for Few-Shot and Weakly Supervised Medical Image Segmentation
Abstract
Recent advancements in medical image segmentation have been driven by deep learning’s capability to extract rich features from extensive datasets. However, these improvements rely heavily on large annotated datasets, which pose significant challenges in the resource-intensive medical field. Foundational models, such as Meta’s Segment Anything Model (SAM), have been developed to address these challenges. SAM has demonstrated exceptional zero-shot performance, often rivaling or surpassing fully supervised models across various tasks. Nonetheless, SAM cannot be directly applied to medical image segmentation due to domain shift, making it necessary to fine-tune the model using prompts. Reducing the annotation workload is crucial to alleviate the burden and constraints associated with extensive data annotation in the medical field. This study investigates prompt-guided strategies in SAM for medical image segmentation under few-shot and weakly supervised scenarios. We assess various strategies-bounding boxes, positive points, negative points, and their combinations-using two publicly available datasets. Optimal results are achieved using positive-negative points, demonstrating that the SAM model can perform comparably to established methods in hepatic vascular and prostate cancer segmentation, even with minimal examples. This research aims to advance medical image segmentation by decreasing reliance on extensive annotated data, providing insights into effective prompt utilization, and showcasing SAM’s adaptability in specialized medical contexts.
Lara Siblini, Gustavo Andrade-Miranda, Kamilia Taguelmimt, Dimitris Visvikis, Julien Bert
UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation
Abstract
Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at https://​github.​com/​chauncey-tow/​MRG-CLIP.
Yaxiong Chen, Chuang Du, Chunlei Li, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
TUMSyn: A Text-Guided Generalist Model for Customized Multimodal MR Image Synthesis
Abstract
Multimodal magnetic resonance (MR) imaging has revolutionized our understanding of the human brain. However, various limitations in clinical scanning hinder the data acquisition process. Current medical image synthesis techniques, often designed for specific tasks or modalities, exhibit diminished performance when confronted with heterogeneous-source MRI data. Here we introduce a Text-guided Universal MR image Synthesis (TUMSyn) generalist model to generate text-specified multimodal brain MRI sequences from any real-acquired sequences. By leveraging demographic data and imaging parameters as text prompts, TUMSyn achieves diverse cross-sequence synthesis tasks using a unified model. To enhance the efficacy of text features in steering synthesis, we pre-train a text encoder by using contrastive learning strategy to align and fuse image and text semantic information. Developed and evaluated on a multi-center dataset of over 20K brain MR image-text pairs with 7 structural MR contrasts, spanning almost entire age spectrum and various physical conditions, TUMSyn demonstrates comparable or exceeding performance compared to task-specific methods in both supervised and zero-shot settings, and the synthesized images exhibit accurate anatomical morphology suitable for various downstream clinical-related tasks. In summary, by incorporating text metadata into the image synthesis, the accuracy, versatility, and generalizability position TUMSyn as a powerful augmentative tool for conventional MRI systems, offering rapid and cost-effective acquisition of multi-sequence MR images for clinical and research applications.
Yulin Wang, Honglin Xiong, Yi Xie, Jiameng Liu, Qian Wang, Qian Liu, Dinggang Shen
SAMU: An Efficient and Promptable Foundation Model for Medical Image Segmentation
Abstract
Segmentation of 3D medical images is a labor-intensive task with important clinical applications. Recently, foundation models for image segmentation have received significant interest. Specifically, many works have proposed methods for the adaptation of promptable natural image foundation models to medical image segmentation. However, the shift to 3D volumes from 2D natural images has proven difficult, and many approaches have limited real-world clinical applicability due to large model sizes and corresponding heavy computational requirements. Here, we present an original model for generalized, promptable 3D medical image segmentation. Our approach leverages a lightweight convolutional backbone while simultaneously integrating information from single-point prompts at multiple spatial resolutions. Our approach dramatically reduces the computational burden for promptable segmentation while also outperforming similar recent works on a diverse dataset of 98,699 image-mask pairs from CT and MRI datasets.
Joseph Bae, Xueqi Guo, Halid Yerebakan, Yoshihisa Shinagawa, Sepehr Farhand
Anatomical Embedding-Based Training Method for Medical Image Segmentation Foundation Models
Abstract
Existing training methods for medical image foundation models primarily focus on tasks such as image restoration, overlooking the potential of harnessing the inherent anatomical knowledge of the human body. The discrepancy between the training tasks of foundation models and downstream tasks often necessitates model fine-tuning for each specific application. An insufficient scale of the downstream training set can lead to catastrophic forgetting of the foundational model. To address these issues, we propose a novel unsupervised training method for medical image foundation models. Our approach incorporates an anatomical embedding task, enabling the model to generate anatomically related embeddings for each voxel. To expedite the training and accommodate large-scale models, we employ the strategy of momentum contrast learning, which is further enhanced to adapt to the task of anatomical embedding. To improve the model's performance for specific targets, we introduce the region contrastive loss, utilizing a small set of segmentation labels (e.g., five samples) to identify the focused regions during training. In our experiments, we pre-train the foundation model using a dataset of 4000 unlabeled abdominal CT scans with the downstream task being the few-shot learning of 13 abdominal organ segmentation. The results showed significant improvements in the downstream segmentation task, particularly in the scenarios with limited segmentation annotations, compared to methods without pre-training and similar foundation models. The trained models and the downstream training code have been open sourced at https://​github.​com/​DlutMedimgGroup/​Anatomy-Embedding-Foundation-Model.
Mingrui Zhuang, Rui Xu, Qinhe Zhang, Ailian Liu, Xin Fan, Hongkai Wang
Boosting Vision-Language Models for Histopathology Classification: Predict All at Once
Abstract
The development of vision-language models (VLMs) for histo-pathology has shown promising new usages and zero-shot performances. However, current approaches, which decompose large slides into smaller patches, focus solely on inductive classification, i.e., prediction for each patch is made independently of the other patches in the target test data. We extend the capability of these large models by introducing a transductive approach. By using text-based predictions and affinity relationships among patches, our approach leverages the strong zero-shot capabilities of these new VLMs without any additional labels. Our experiments cover four histopathology datasets and five different VLMs. Operating solely in the embedding space (i.e., in a black-box setting), our approach is highly efficient, processing \(10^5\) patches in just a few seconds, and shows significant accuracy improvements over inductive zero-shot classification. Code available at https://​github.​com/​FereshteShakeri/​Histo-TransCLIP.
Maxime Zanella, Fereshteh Shakeri, Yunshi Huang, Houda Bahig, Ismail Ben Ayed
MAGDA: Multi-agent Guideline-Driven Diagnostic Assistance
Abstract
In emergency departments, rural hospitals, or clinics in less developed regions, clinicians often lack fast image analysis by trained radiologists, which can have a detrimental effect on patients’ healthcare. Large Language Models (LLMs) have the potential to alleviate some pressure from these clinicians by providing insights that can help them in their decision-making. While these LLMs achieve high test results on medical exams showcasing their great theoretical medical knowledge, they tend not to follow medical guidelines. In this work, we introduce a new approach for zero-shot guideline-driven decision support. We model a system of multiple LLM agents augmented with a contrastive vision-language model that collaborate to reach a patient diagnosis. After providing the agents with simple diagnostic guidelines, they will synthesize prompts and screen the image for findings following these guidelines. Finally, they provide understandable chain-of-thought reasoning for their diagnosis, which is then self-refined to consider inter-dependencies between diseases. As our method is zero-shot, it is adaptable to settings with rare diseases, where training data is limited, but expert-crafted disease descriptions are available. We evaluate our method on two chest X-ray datasets, CheXpert and ChestX-ray 14 Longtail, showcasing performance improvement over existing zero-shot methods and generalizability to rare diseases.
David Bani-Harouni, Nassir Navab, Matthias Keicher
Backmatter
Metadata
Title
Foundation Models for General Medical AI
Editors
Zhongying Deng
Yiqing Shen
Hyunwoo J. Kim
Won-Ki Jeong
Angelica I. Aviles-Rivero
Junjun He
Shaoting Zhang
Copyright Year
2025
Electronic ISBN
978-3-031-73471-7
Print ISBN
978-3-031-73470-0
DOI
https://doi.org/10.1007/978-3-031-73471-7

Premium Partner