Skip to main content

2024 | Book

Machine Learning in Medical Imaging

14th International Workshop, MLMI 2023, Held in Conjunction with MICCAI 2023, Vancouver, BC, Canada, October 8, 2023, Proceedings, Part I


About this book

The two-volume set LNCS 14348 and 14139 constitutes the proceedings of the 14th International Workshop on Machine Learning in Medical Imaging, MLMI 2023, held in conjunction with MICCAI 2023, in Vancouver, Canada, in October 2023.
The 93 full papers presented in the proceedings were carefully reviewed and selected from 139 submissions. They focus on major trends and challenges in artificial intelligence and machine learning in the medical imaging field, translating medical imaging research into clinical practice. Topics of interests included deep learning, generative adversarial learning, ensemble learning, transfer learning, multi-task learning, manifold learning, reinforcement learning, along with their applications to medical image analysis, computer-aided diagnosis, multi-modality fusion, image reconstruction, image retrieval, cellular image analysis, molecular imaging, digital pathology, etc.

Table of Contents

Structural MRI Harmonization via Disentangled Latent Energy-Based Style Translation

Multi-site brain magnetic resonance imaging (MRI) has been widely used in clinical and research domains, but usually is sensitive to non-biological variations caused by site effects (e.g., field strengths and scanning protocols). Several retrospective data harmonization methods have shown promising results in removing these non-biological variations at feature or whole-image level. Most existing image-level harmonization methods are implemented through generative adversarial networks, which are generally computationally expensive and generalize poorly on independent data. To this end, this paper proposes a disentangled latent energy-based style translation (DLEST) framework for image-level structural MRI harmonization. Specifically, DLEST disentangles site-invariant image generation and site-specific style translation via a latent autoencoder and an energy-based model. The autoencoder learns to encode images into low-dimensional latent space, and generates faithful images from latent codes. The energy-based model is placed in between the encoding and generation steps, facilitating style translation from a source domain to a target domain implicitly. This allows highly generalizable image generation and efficient style translation through the latent space. We train our model on 4,092 T1-weighted MRIs in 3 tasks: histogram comparison, acquisition site classification, and brain tissue segmentation. Qualitative and quantitative results demonstrate the superiority of our approach, which generally outperforms several state-of-the-art methods.

Mengqi Wu, Lintao Zhang, Pew-Thian Yap, Weili Lin, Hongtu Zhu, Mingxia Liu
Cross-Domain Iterative Network for Simultaneous Denoising, Limited-Angle Reconstruction, and Attenuation Correction of Cardiac SPECT

Single-Photon Emission Computed Tomography (SPECT) is widely applied for the diagnosis of ischemic heart diseases. Low-dose (LD) SPECT aims to minimize radiation exposure but leads to increased image noise. Limited-angle (LA) SPECT enables faster scanning and reduced hardware costs but results in lower reconstruction accuracy. Additionally, computed tomography (CT)-derived attenuation maps ( $$\mu $$ μ -maps) are commonly used for SPECT attenuation correction (AC), but this will cause extra radiation exposure and SPECT-CT misalignments. Although various deep learning methods have been introduced to separately address these limitations, the solution for simultaneously addressing these challenges still remains highly under-explored and challenging. To this end, we propose a Cross-domain Iterative Network (CDI-Net) for simultaneous denoising, LA reconstruction, and CT-free AC in cardiac SPECT. In CDI-Net, paired projection- and image-domain networks are end-to-end connected to fuse the cross-domain emission and anatomical information in multiple iterations. Adaptive Weight Recalibrators (AWR) adjust the multi-channel input features to further enhance prediction accuracy. Our experiments using clinical data showed that CDI-Net produced more accurate $$\mu $$ μ -maps, projections, and AC reconstructions compared to existing approaches that addressed each task separately. Ablation studies demonstrated the significance of cross-domain and cross-iteration connections, as well as AWR, in improving the reconstruction performance. The source code of this work is released at .

Xiongchao Chen, Bo Zhou, Huidong Xie, Xueqi Guo, Qiong Liu, Albert J. Sinusas, Chi Liu
Arbitrary Reduction of MRI Inter-slice Spacing Using Hierarchical Feature Conditional Diffusion

Magnetic resonance (MR) images collected in 2D scanning protocols typically have large inter-slice spacing, resulting in high in-plane resolution but reduced through-plane resolution. Super-resolution techniques can reduce the inter-slice spacing of 2D scanned MR images, facilitating the downstream visual experience and computer-aided diagnosis. However, most existing super-resolution methods are trained at a fixed scaling ratio, which is inconvenient in clinical settings where MR scanning may have varying inter-slice spacings. To solve this issue, we propose Hierarchical Feature Cond itional Diffusion (HiFi-Diff) for arbitrary reduction of MR inter-slice spacing. Given two adjacent MR slices and the relative positional offset, HiFi-Diff can iteratively convert a Gaussian noise map into any desired in-between MR slice. Furthermore, to enable fine-grained conditioning, the Hierarchical Feature Extraction (HiFE) module is proposed to hierarchically extract conditional features and conduct element-wise modulation. Our experimental results on the publicly available HCP-1200 dataset demonstrate the high-fidelity super-resolution capability of HiFi-Diff and its efficacy in enhancing downstream segmentation performance.

Xin Wang, Zhenrong Shen, Zhiyun Song, Sheng Wang, Mengjun Liu, Lichi Zhang, Kai Xuan, Qian Wang
Reconstruction of 3D Fetal Brain MRI from 2D Cross-Sectional Acquisitions Using Unsupervised Learning Network

Fetal brain magnetic resonance imaging (MRI) is becoming more important for early brain assessment in prenatal examination. Fast acquisition of three cross-sectional series/views is often used to eliminate motion effects using single-shot fast spin-echo sequences. Although stacked in 3D volumes, these slices are essentially 2D images with large slice thickness and distances (4 to 6 mm) resulting blurry multiplanar views. To better visualize and quantify fetal brains, it is desirable to reconstruct 3D images from different 2D cross-sectional series. In this paper, we present a super-resolution CNN-based network for 3D image reconstruction using unsupervised learning, referred to as cross-sectional image reconstruction (C-SIR). The key idea is that different cross-sectional images can help each other for training the C-SIR model. Additionally, existing high resolution data can also be used for pre-training the network in a supervised manner. In experiments, we show that such a network can be trained to reconstruct 3D images using simulated down-sampled adult images with much better image quality and image segmentation accuracy. Then, we illustrate that the proposed C-SIR approach generates relatively clear 3D fetal images than other algorithms.

Yimeng Yang, Dongdong Gu, Xukun Zhang, Zhongxiang Ding, Fei Gao, Zhong Xue, Dinggang Shen
Robust Unsupervised Super-Resolution of Infant MRI via Dual-Modal Deep Image Prior

Magnetic resonance imaging (MRI) is commonly used for studying infant brain development. However, due to the lengthy image acquisition time and limited subject compliance, high-quality infant MRI can be challenging. Without imposing additional burden on image acquisition, image super-resolution (SR) can be used to enhance image quality post-acquisition. Most SR techniques are supervised and trained on multiple aligned low-resolution (LR) and high-resolution (HR) image pairs, which in practice are not usually available. Unlike supervised approaches, Deep Image Prior (DIP) can be employed for unsupervised single-image SR, utilizing solely the input LR image for de novo optimization to produce an HR image. However, determining when to stop early in DIP training is non-trivial and presents a challenge to fully automating the SR process. To address this issue, we constrain the low-frequency k-space of the SR image to be similar to that of the LR image. We further improve performance by designing a dual-modal framework that leverages shared anatomical information between T1-weighted and T2-weighted images. We evaluated our model, dual-modal DIP (dmDIP), on infant MRI data acquired from birth to one year of age, demonstrating that enhanced image quality can be obtained with substantially reduced sensitivity to early stopping.

Cheng Che Tsai, Xiaoyang Chen, Sahar Ahmad, Pew-Thian Yap
SR4ZCT: Self-supervised Through-Plane Resolution Enhancement for CT Images with Arbitrary Resolution and Overlap

Computed tomography (CT) is a widely used non-invasive medical imaging technique for disease diagnosis. The diagnostic accuracy is often affected by image resolution, which can be insufficient in practice. For medical CT images, the through-plane resolution is often worse than the in-plane resolution and there can be overlap between slices, causing difficulties in diagnoses. Self-supervised methods for through-plane resolution enhancement, which train on in-plane images and infer on through-plane images, have shown promise for both CT and MRI imaging. However, existing self-supervised methods either neglect overlap or can only handle specific cases with fixed combinations of resolution and overlap. To address these limitations, we propose a self-supervised method called SR4ZCT. It employs the same off-axis training approach while being capable of handling arbitrary combinations of resolution and overlap. Our method explicitly models the relationship between resolutions and voxel spacings of different planes to accurately simulate training images that match the original through-plane images. We highlight the significance of accurate modeling in self-supervised off-axis training and demonstrate the effectiveness of SR4ZCT using a real-world dataset.

Jiayang Shi, Daniël M. Pelt, K. Joost Batenburg
unORANIC: Unsupervised Orthogonalization of Anatomy and Image-Characteristic Features

We introduce unORANIC, an unsupervised approach that uses an adapted loss function to drive the orthogonalization of anatomy and image-characteristic features. The method is versatile for diverse modalities and tasks, as it does not require domain knowledge, paired data samples, or labels. During test time unORANIC is applied to potentially corrupted images, orthogonalizing their anatomy and characteristic components, to subsequently reconstruct corruption-free images, showing their domain-invariant anatomy only. This feature orthogonalization further improves generalization and robustness against corruptions. We confirm this qualitatively and quantitatively on 5 distinct datasets by assessing unORANIC’s classification accuracy, corruption detection and revision capabilities. Our approach shows promise for enhancing the generalizability and robustness of practical applications in medical image analysis. The source code is available at .

Sebastian Doerrich, Francesco Di Salvo, Christian Ledig
An Investigation of Different Deep Learning Pipelines for GABA-Edited MRS Reconstruction

Edited magnetic resonance spectroscopy (MRS) can provide localized information on gamma-aminobutyric acid (GABA) concentration in vivo. However, edited-MRS scans are long due to the fact that many acquisitions, known as transients, need to be collected and averaged to obtain a high-quality spectrum for reliable GABA quantification. In this work, we investigate Deep Learning (DL) pipelines for the reconstruction of GABA-edited MRS spectra using only a quarter of the transients typically acquired. We compared two neural network architectures: a 1D U-NET and a proposed dimension-reducing 2D U-NET (Rdc-UNET2D) that we proposed. We also compared the impact of training the DL pipelines using solely in vivo data or pre-training the models on simulated followed by fine-tuning on in vivo data. Results for this study showed the proposed Rdc-UNET2D model pre-trained on simulated data and fine-tuned on in vivo data had the best performance among the different DL pipelines compared. This model obtained a higher SNR and a lower fit error than a conventional reconstruction pipeline using the full amount of transients typically acquired. This indicates that through DL it is possible to reduce GABA-edited MRS scan times by four times while maintaining or improving data quality. In the spirit of open science, the code and data to reproduce our pipeline are publicly available.

Rodrigo Berto, Hanna Bugler, Roberto Souza, Ashley Harris
Towards Abdominal 3-D Scene Rendering from Laparoscopy Surgical Videos Using NeRFs

Given that a conventional laparoscope only provides a two-dimensional (2-D) view, the detection and diagnosis of medical ailments can be challenging. To overcome the visual constraints associated with laparoscopy, the use of laparoscopic images and videos to reconstruct the three-dimensional (3-D) anatomical structure of the abdomen has proven to be a promising approach. Neural Radiance Fields (NeRFs) have recently gained attention thanks to their ability to generate photorealistic images from a 3-D static scene, thus facilitating a more comprehensive exploration of the abdomen through the synthesis of new views. This distinguishes NeRFs from alternative methods such as Simultaneous Localization and Mapping (SLAM) and depth estimation. In this paper, we present a comprehensive examination of NeRFs in the context of laparoscopy surgical videos, with the goal of rendering abdominal scenes in 3-D. Although our experimental results are promising, the proposed approach encounters substantial challenges, which require further exploration in future research.

Khoa Tuan Nguyen, Francesca Tozzi, Nikdokht Rashidian, Wouter Willaert, Joris Vankerschaver, Wesley De Neve
Brain MRI to PET Synthesis and Amyloid Estimation in Alzheimer’s Disease via 3D Multimodal Contrastive GAN

Positron emission tomography (PET) can detect brain amyloid-β (Aβ) deposits, a diagnostic hallmark of Alzheimer’s disease and a target for disease modifying treatment. However, PET-Aβ is expensive, not widely available, and, unlike magnetic resonance imaging (MRI), exposes the patient to ionizing radiation. Here we propose a novel 3D multimodal generative adversarial network with contrastive learning to synthesize PET-Aβ images from cheaper, more accessible, and less invasive MRI scans (T1-weighted and fluid attenuated inversion recovery [FLAIR] images). In tests on independent samples of paired MRI/PET-Aβ data, our synthetic PET-Aβ images were of high quality with a structural similarity index measure of 0.94, which outperformed previously published methods. We also evaluated synthetic PET-Aβ images by extracting standardized uptake value ratio measurements. The synthetic images could identify amyloid positive patients with a balanced accuracy of 79%, holding promise for potential future use in a diagnostic clinical setting.

Yan Jin, Jonathan DuBois, Chongyue Zhao, Liang Zhan, Audrey Gabelle, Neda Jahanshad, Paul M. Thompson, Arie Gafson, Shibeshih Belachew
Accelerated MRI Reconstruction via Dynamic Deformable Alignment Based Transformer

Magnetic resonance imaging (MRI) is a slow diagnostic technique due to its time-consuming acquisition speed. To address this, parallel imaging and compressed sensing methods were developed. Parallel imaging acquires multiple anatomy views simultaneously, while compressed sensing acquires fewer samples than traditional methods. However, reconstructing images from undersampled multi-coil data remains challenging. Existing methods concatenate input slices and adjacent slices along the channel dimension to gather more information for MRI reconstruction. Implicit feature alignment within adjacent slices is crucial for optimal reconstruction performance. Hence, we propose MFormer: an accelerated MRI reconstruction transformer with cascading MFormer blocks containing multi-scale Dynamic Deformable Swin Transformer (DST) modules. Unlike other methods, our DST modules implicitly align adjacent slice features using dynamic deformable convolution and extract local non-local features before merging information. We adapt input variations by aggregating deformable convolution kernel weights and biases through a dynamic weight predictor. Extensive experiments on Stanford2D, Stanford3D, and large-scale FastMRI datasets show the merits of our contributions, achieving state-of-the-art MRI reconstruction performance. Our code and models are available at .

Wafa Alghallabi, Akshay Dudhane, Waqas Zamir, Salman Khan, Fahad Shahbaz Khan
Deformable Cross-Attention Transformer for Medical Image Registration

Transformers have recently shown promise for medical image applications, leading to an increasing interest in developing such models for medical image registration. Recent advancements in designing registration Transformers have focused on using cross-attention (CA) to enable a more precise understanding of spatial correspondences between moving and fixed images. Here, we propose a novel CA mechanism that computes windowed attention using deformable windows. In contrast to existing CA mechanisms that require intensive computational complexity by either computing CA globally or locally with a fixed and expanded search window, the proposed deformable CA can selectively sample a diverse set of features over a large search window while maintaining low computational complexity. The proposed model was extensively evaluated on multi-modal, mono-modal, and atlas-to-patient registration tasks, demonstrating promising performance against state-of-the-art methods and indicating its effectiveness for medical image registration. The source code for this work is available at .

Junyu Chen, Yihao Liu, Yufan He, Yong Du
Deformable Medical Image Registration Under Distribution Shifts with Neural Instance Optimization

Deep-learning deformable image registration methods often struggle if test-image characteristic shifts from the training domain, such as the large variations in anatomy and contrast changes with different imaging protocols. Gradient descent-based instance optimization is often introduced to refine the solution of deep-learning methods, but the performance gain is minimal due to the high degree of freedom in the solution and the absence of robust initial deformation. In this paper, we propose a new instance optimization method, Neural Instance Optimization (NIO), to correct the bias in the deformation field caused by the distribution shifts for deep-learning methods. Our method naturally leverages the inductive bias of the convolutional neural network, the prior knowledge learned from the training domain and the multi-resolution optimization strategy to fully adapt a learning-based method to individual image pairs, avoiding registration failure during the inference phase. We evaluate our method with gold standard, human cortical and subcortical segmentation, and manually identified anatomical landmarks to contrast NIO’s performance with conventional and deep-learning approaches. Our method compares favourably with both approaches and significantly improves the performance of deep-learning methods under distribution shifts with 1.5% to 3.0% and 2.3% to 6.2% gains in registration accuracy and robustness, respectively.

Tony C. W. Mok, Zi Li, Yingda Xia, Jiawen Yao, Ling Zhang, Jingren Zhou, Le Lu
Implicitly Solved Regularization for Learning-Based Image Registration

Deformable image registration is a fundamental step in many medical image analysis tasks and has attracted a large amount of research to develop efficient and accurate unsupervised machine learning approaches. Although much attention has been paid to finding suitable similarity measures, network architectures and training methods, much less research has been devoted to suitable regularization techniques to ensure the plausibility of the learned deformations.In this paper, we propose implicitly solved regularizers for unsupervised and weakly supervised learning of deformable image registration. In place of pure gradient descent with automatic differentiation, we combine efficient implicit solvers for the regularization term with the established gradient-based optimization regarding the network parameters. As a result, our approach is broadly applicable and can be combined with a range of similarity measures and network architectures. Our experiments with state-of-the-art network architectures show that the proposed approach has the potential to increase the smoothness, i.e. the plausibility, of the learned deformations and the registration accuracy measured as dice overlaps. Furthermore, we show that due to efficient GPU implementations of the implicit solvers, this increase in plausibility and accuracy comes at almost no additional cost in terms of computational time.

Jan Ehrhardt, Heinz Handels
BHSD: A 3D Multi-class Brain Hemorrhage Segmentation Dataset

Intracranial hemorrhage (ICH) is a pathological condition characterized by bleeding inside the skull or brain, which can be attributed to various factors. Identifying, localizing and quantifying ICH has important clinical implications, in a bleed-dependent manner. While deep learning techniques are widely used in medical image segmentation and have been applied to the ICH segmentation task, existing public ICH datasets do not support the multi-class segmentation problem. To address this, we develop the Brain Hemorrhage Segmentation Dataset (BHSD), which provides a 3D multi-class ICH dataset containing 192 volumes with pixel-level annotations and 2200 volumes with slice-level annotations across five categories of ICH. To demonstrate the utility of the dataset, we formulate a series of supervised and semi-supervised ICH segmentation tasks. We provide experimental results with state-of-the-art models as reference benchmarks for further model developments and evaluations on this dataset. The dataset and checkpoint is available at .

Biao Wu, Yutong Xie, Zeyu Zhang, Jinchao Ge, Kaspar Yaxley, Suzan Bahadir, Qi Wu, Yifan Liu, Minh-Son To
Contrastive Learning-Based Breast Tumor Segmentation in DCE-MRI

Precise and automated segmentation of tumors from breast dynamic contrast-enhanced magnetic resonance images (DCE-MRI) is crucial for obtaining quantitative morphological and functional information, thereby assisting subsequent diagnosis and treatment. However, many existing methods mainly focus on features within tumor regions and neglect enhanced background tissues, leading to the potential over-segmentation problem. To better distinguish tumor tissues from complex background structures (e.g., enhanced vessels), we propose a novel approach based on contrastive feature learning. Our method involves pre-training a highly sensitive encoder using contrastive learning, where tumor and background patches are utilized as paired positive-negative samples, to emphasize tumor tissues and to enhance their discriminative features. Furthermore, the well-trained encoder is employed for accurate tumor segmentation by using a feature fusion module in a global-to-local manner. Through extensive validations using a large dataset of breast DCE-MRI scans, our proposed model demonstrates superior segmentation performance, effectively reducing over-segmentation on enhanced tissue regions as expected.

Shanshan Guo, Jiadong Zhang, Dongdong Gu, Fei Gao, Yiqiang Zhan, Zhong Xue, Dinggang Shen
FFPN: Fourier Feature Pyramid Network for Ultrasound Image Segmentation

Ultrasound (US) image segmentation is an active research area that requires real-time and highly accurate analysis in many scenarios. The detect-to-segment (DTS) frameworks have been recently proposed to balance accuracy and efficiency. However, existing approaches may suffer from inadequate contour encoding or fail to effectively leverage the encoded results. In this paper, we introduce a novel Fourier-anchor-based DTS framework called Fourier Feature Pyramid Network (FFPN) to address the aforementioned issues. The contributions of this paper are two fold. First, the FFPN utilizes Fourier Descriptors to adequately encode contours. Specifically, it maps Fourier series with similar amplitudes and frequencies into the same layer of the feature map, thereby effectively utilizing the encoded Fourier information. Second, we propose a Contour Sampling Refinement (CSR) module based on the contour proposals and refined features produced by the FFPN. This module extracts rich features around the predicted contours to further capture detailed information and refine the contours. Extensive experimental results on three large and challenging datasets demonstrate that our method outperforms other DTS methods in terms of accuracy and efficiency. Furthermore, our framework can generalize well to other detection or segmentation tasks.

Chaoyu Chen, Xin Yang, Rusi Chen, Junxuan Yu, Liwei Du, Jian Wang, Xindi Hu, Yan Cao, Yingying Liu, Dong Ni
Mammo-SAM: Adapting Foundation Segment Anything Model for Automatic Breast Mass Segmentation in Whole Mammograms

Automated breast mass segmentation from mammograms is crucial for assisting radiologists in timely and accurate breast cancer diagnosis. Segment Anything Model (SAM) has recently demonstrated remarkable success in natural image segmentation, suggesting its potential for enhancing artificial intelligence-based automated diagnostic systems. Unfortunately, we observe that the zero-shot performance of SAM in mass segmentation falls short of usability. Therefore, fine-tuning SAM for transfer learning is necessary. However, full-tuning is cost-intensive for foundation models, making it unacceptable in clinical practice. To tackle this problem, in this paper, we propose a parameter-efficient fine-tuning framework named Mammo-SAM, which significantly improves the performance of SAM on the challenging task of mass segmentation. Our key insight includes a tailored adapter to explore multi-scale features and a re-designed CNN-style decoder for precise segmentation. Extensive experiments on the public datasets CBIS-DDSM and INbreast demonstrate that our proposed Mammo-SAM surpasses existing mass segmentation methods and other tuning paradigms designed for SAM, achieving new state-of-the-art performance.

Xinyu Xiong, Churan Wang, Wenxue Li, Guanbin Li
Consistent and Accurate Segmentation for Serial Infant Brain MR Images with Registration Assistance

The infant brain develops dramatically during the first two years of life. Accurate segmentation of brain tissues is essential to understand the early development of both normal and disease changes. However, the segmentation results of the same subject could demonstrate unexpectedly large variations across different time points, which may even lead to inaccurate and inconsistent results in charting infant brain development. In this paper, we propose a deep learning framework, which simultaneously exploits registration and segmentation for guaranteeing the longitudinal consistency among the segmentation results. Firstly, a manual label-guided registration model is designed to fast and accurately obtain the warped images from other time points. Secondly, a segmentation network with a longitudinal consistency constraint is developed to effectively obtain the temporal segmentation results. Thus, our proposed segmentation network could exploit the tissue information of warped intensity images from other time points to aid in segmenting the isointense phase (approximately 6–8 months) data, which is the most difficult case due to the low intensity contrast of tissues. Extensive experiments on infant brain images have shown improved performance achieved by our proposed method, compared with the existing state-of-the-art methods.

Yuhang Sun, Jiameng Liu, Feihong Liu, Kaicong Sun, Han Zhang, Feng Shi, Qianjin Feng, Dinggang Shen
Unifying and Personalizing Weakly-Supervised Federated Medical Image Segmentation via Adaptive Representation and Aggregation

Federated learning (FL) enables multiple sites to collaboratively train powerful deep models without compromising data privacy and security. The statistical heterogeneity (e.g., non-IID data and domain shifts) is a primary obstacle in FL, impairing the generalization performance of the global model. Weakly supervised segmentation, which uses sparsely-grained (i.e., point-, bounding box-, scribble-, block-wise) supervision, is increasingly being paid attention to due to its great potential of reducing annotation costs. However, there may exist label heterogeneity, i.e., different annotation forms across sites. In this paper, we propose a novel personalized FL framework for medical image segmentation, named FedICRA, which uniformly leverages heterogeneous weak supervision via adaptIve Contrastive Representation and Aggregation. Concretely, to facilitate personalized modeling and to avoid confusion, a channel selection based site contrastive representation module is employed to adaptively cluster intra-site embeddings and separate inter-site ones. To effectively integrate the common knowledge from the global model with the unique knowledge from each local model, an adaptive aggregation module is applied for updating and initializing local models at the element level. Additionally, a weakly supervised objective function that leverages a multiscale tree energy loss and a gated CRF loss is employed to generate more precise pseudo-labels and further boost the segmentation performance. Through extensive experiments on two distinct medical image segmentation tasks of different modalities, the proposed FedICRA demonstrates overwhelming performance over other state-of-the-art personalized FL methods. Its performance even approaches that of fully supervised training on centralized data. Our code and data are available at .

Li Lin, Jiewei Wu, Yixiang Liu, Kenneth K. Y. Wong, Xiaoying Tang
Unlocking Fine-Grained Details with Wavelet-Based High-Frequency Enhancement in Transformers

Medical image segmentation is a critical task that plays a vital role in diagnosis, treatment planning, and disease monitoring. Accurate segmentation of anatomical structures and abnormalities from medical images can aid in the early detection and treatment of various diseases. In this paper, we address the local feature deficiency of the Transformer model by carefully re-designing the self-attention map to produce accurate dense prediction in medical images. To this end, we first apply the wavelet transformation to decompose the input feature map into low-frequency (LF) and high-frequency (HF) subbands. The LF segment is associated with coarse-grained features, while the HF components preserve fine-grained features such as texture and edge information. Next, we reformulate the self-attention operation using the efficient Transformer to perform both spatial and context attention on top of the frequency representation. Furthermore, to intensify the importance of the boundary information, we impose an additional attention map by creating a Gaussian pyramid on top of the HF components. Moreover, we propose a multi-scale context enhancement block within skip connections to adaptively model inter-scale dependencies to overcome the semantic gap among stages of the encoder and decoder modules. Throughout comprehensive experiments, we demonstrate the effectiveness of our strategy on multi-organ and skin lesion segmentation benchmarks. The implementation code will be available upon acceptance. GitHub .

Reza Azad, Amirhossein Kazerouni, Alaa Sulaiman, Afshin Bozorgpour, Ehsan Khodapanah Aghdam, Abin Jose, Dorit Merhof
Prostate Segmentation Using Multiparametric and Multiplanar Magnetic Resonance Images

Diseases related to the prostate and distal urethra, such as prostate cancer, benign prostatic hyperplasia and urinary incontinence, may be detected and diagnosed through noninvasive medical imaging. T2-weighted (T2W) magnetic resonance imaging (MRI) is the most commonly used modality for prostate and urethral segmentation due to its distinguishable features of anatomical texture. In addition to T2W multiplanar images, which capture information in the axial, sagittal and coronal planes, multiparametric MRI modalities such as dynamic contrast enhanced (DCE) and diffusion-weighted imaging (DWI) are usually also acquired in the scanning process to measure functional features. Feature fusion by combining multiparametric and multiplanar images is challenging due to the movement of the patient during image acquisition, the need for accurate image registration and the sheer volume of available scans. Here we propose a multi-encoder deep neural network named 3DDOSPyResidualUSENet to learn anatomical and functional features from multiparametric and multiplanar MRI images. Our extensive experiments on a public dataset show that combining T2W axial, sagittal and coronal images along with DCE information and apparent diffusion coefficient (ADC) maps computed from DWI images results in increased segmentation performance.

Kuruparan Shanmugalingam, Arcot Sowmya, Daniel Moses, Erik Meijering
SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation

Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computational resources. (2) in point prompt mode, points are sampled from the center of the ground truth and more than one set of points is expected to achieve reliable performance, which is not efficient for practical applications. In this paper, a single-point prompt network is proposed for nuclei image segmentation, called SPPNet. We replace the original image encoder with a lightweight vision transformer. Also, an effective convolutional block is added in parallel to extract the low-level semantic information from the image and compensate for the performance degradation due to the small image encoder. We propose a new point-sampling method based on the Gaussian kernel. The proposed model is evaluated on the MoNuSeg-2018 dataset. The result demonstrated that SPPNet outperforms existing U-shape architectures and shows faster convergence in training. Compared to the segment anything model, SPPNet shows roughly 20 times faster inference, with 1/70 parameters and computational cost. Particularly, only one set of points is required in both the training and inference phases, which is more reasonable for clinical applications. The code for our work and more technical details can be found at .

Qing Xu, Wenwei Kuang, Zeyu Zhang, Xueyao Bao, Haoran Chen, Wenting Duan
Automated Coarse-to-Fine Segmentation of Thoracic Duct Using Anatomy Priors and Topology-Guided Curved Planar Reformation

Recent studies have emphasized the importance of protecting thoracic duct during radiation therapy (RT), as dose distributions in thoracic duct may be associated with the development radiation-induced lymphopenia. Because of its thin/slim size, curved geometry and extremely poor (intensity) contrast of thoracic duct, manual delineation of thoracic duct in RT planning CT is time-consuming and with large inter-observer variations. In this work, we aim to automatically and accurately segment thoracic duct in RT planning CT, as the first attempt to tackle this clinically critical yet under-studied task. A two-stage coarse-to-fine segmentation approach is proposed. At the first stage, we automatically segment six chest organs and combine these organ predictions with the input planning CT to better infer and localize the thoracic duct. Given the coarse initial segmentation from first stage, we subsequently extract the topology-corrected centerline of initial thoracic duct segmentation at stage two where curved planar reformation (CPR) is applied to transform the planning CT into a new 3D volume representation that provides a spatially smoother reformation of thoracic duct in its elongated medial axis direction. Thus the CPR-transformed CT is employed as input to the second stage deep segmentation network, and the output segmentation mask is transformed back to the original image space, as the final segmentation. We evaluate our approach on 117 lung cancer patients with RT planning CT scans. Our approach significantly outperforms a strong baseline model based on nnUNet, by reducing 57% relative Hausdorff distance error (from 49.9 mm to 21.2 mm) and improving 1.8% absolute Jaccard Index.

Puyang Wang, Panwen Hu, Jiali Liu, Hang Yu, Xianghua Ye, Jinliang Zhang, Hui Li, Li Yang, Le Lu, Dakai Jin, Feng-Ming (Spring) Kong
Leveraging Self-attention Mechanism in Vision Transformers for Unsupervised Segmentation of Optical Coherence Microscopy White Matter Images

A new microscope has been created to capture detailed images of the brain using a technology called optical coherence microscopy (OCM). However, there is still much to discover and understand about this valuable data. In this paper, we focus on the important task of segmenting the white matter in these high-resolution OCM images. A closed-up accurate segmentation of white matter tracts has the potential to enhance our knowledge of brain connections. In this paper, we propose an unsupervised segmentation approach that leverages the self-attention mechanism of Vision Transformers (ViT). Our approach uses the output attention weights from a ViT pre-trained with Masked Image Modeling (MIM) to generate binary segmentations that we use as Pseudo-Ground-Truth (PGT) to train an additional segmentation model. Our method achieved superior performance when compared with classical unsupervised computer vision methods and common unsupervised deep learning architectures designed for natural images. Additionally, we compared our results with those of a supervised U-Net model trained on different numbers of labels and a semi-supervised approach where we selected the best-performing model based on labeled data. Our model achieved comparable results to the U-Net model trained on 30% of the labeled data. Furthermore, through fine-tuning, our model demonstrated an improvement of 3% over the supervised U-Nets. The code and data are available on GitHub repository .

Mohamad Hawchar, Joël Lefebvre
PE-MED: Prompt Enhancement for Interactive Medical Image Segmentation

Interactive medical image segmentation refers to the accurate segmentation of the target of interest through interaction (e.g., click) between the user and the image. It has been widely studied in recent years as it is less dependent on abundant annotated data and more flexible than fully automated segmentation. However, current studies have not fully explored user-provided prompt information (e.g., points), including the knowledge mined in one interaction, and the relationship between multiple interactions. Thus, in this paper, we introduce a novel framework equipped with prompt enhancement, called PE-MED, for interactive medical image segmentation. First, we introduce a Self-Loop strategy to generate warm initial segmentation results based on the first prompt. It can prevent the highly unfavorable scenarios, such as encountering a blank mask as the initial input after the first interaction. Second, we propose a novel Prompt Attention Learning Module (PALM) to mine useful prompt information in one interaction, enhancing the responsiveness of the network to user clicks. Last, we build a Time Series Information Propagation (TSIP) mechanism to extract the temporal relationships between multiple interactions and increase the model stability. Comparative experiments with other state-of-the-art (SOTA) medical image segmentation algorithms show that our method exhibits better segmentation accuracy and stability.

Ao Chang, Xing Tao, Xin Yang, Yuhao Huang, Xinrui Zhou, Jiajun Zeng, Ruobing Huang, Dong Ni
A Super Token Vision Transformer and CNN Parallel Branch Network for mCNV Lesion Segmentation in OCT Images

Myopic choroidal neovascularization (mCNV) is a vision-threatening complication of high myopia characterized by the growth of abnormal blood vessels in the choroid layer of the eye. In OCT images, mCNV typically presents as a highly reflective area within the subretinal layer. Therefore, accurate segmentation of mCNV in OCT images can better assist clinicians in assessing the disease status and guiding treatment decisions. However, accurate segmentation in OCT images is highly challenging due to the presence of noise interference, complex lesion areas, and low contrast. Consequently, we propose a parallel-branch network architecture that combines super token vision transformer (STViT) and CNN to more efficiently capture global dependency and low-level feature details. The super token attention mechanism (STA) in STViT reduces the number of tokens in self-attention and preserves global modeling. Additionally, we create a novel feature fusion module that utilizes depth-wise separable convolutions to efficiently fuse multi-level features from two pathways. We conduct extensive experiments on an in-house OCT dataset and a public OCT dataset, and the results demonstrate that our proposed method achieves state-of-the-art segmentation performance.

Xiang Dong, Hai Xie, Yunlong Sun, Zhenquan Wu, Bao Yang, Junlong Qu, Guoming Zhang, Baiying Lei
Boundary-RL: Reinforcement Learning for Weakly-Supervised Prostate Segmentation in TRUS Images

We propose Boundary-RL, a novel weakly supervised segmentation method that utilises only patch-level labels for training. We envision segmentation as a boundary detection problem, rather than a pixel-level classification as in previous works. This outlook on segmentation may allow for boundary delineation under challenging scenarios such as where noise artefacts may be present within the region-of-interest (ROI) boundaries, where traditional pixel-level classification-based weakly supervised methods may not be able to effectively segment the ROI. Particularly of interest, ultrasound images, where intensity values represent acoustic impedance differences between boundaries, may also benefit from the boundary delineation approach. Our method uses reinforcement learning to train a controller function to localise boundaries of ROIs using a reward derived from a pre-trained boundary-presence classifier. The classifier indicates when an object boundary is encountered within a patch, serving as weak supervision, as the controller modifies the patch location in a sequential Markov decision process. The classifier itself is trained using only binary patch-level labels of object presence, the only labels used during training of the entire boundary delineation framework. The use of a controller ensures that sliding window over the entire image is not necessary and reduces possible false-positives or -negatives by minimising number of patches passed to the boundary-presence classifier. We evaluate our approach for a clinically relevant task of prostate gland segmentation on trans-rectal ultrasound images. We show improved performance compared to other tested weakly supervised methods, using the same labels e.g., multiple instance learning.

Weixi Yi, Vasilis Stavrinides, Zachary M. C. Baum, Qianye Yang, Dean C. Barratt, Matthew J. Clarkson, Yipeng Hu, Shaheer U. Saeed
A Domain-Free Semi-supervised Method for Myocardium Segmentation in 2D Echocardiography Sequences

Many deep learning methods have been applied in myocardium segmentation, however, the robustness of these algorithms is relatively low, especially when dealing with datasets from different domains, such as machines. In this paper, we propose a domain-free semi-supervised deep learning algorithm to improve the model robustness between different machines. Two domain-free factors (the shape of the myocardium and the motion tendency between adjacent frames) are adopted. Specifically, an optical flow field-based segmentation network is proposed for enhancing the performance by combining the motion tendency of myocardium between adjacent frames. Moreover, a shape-based semi-supervised adversarial network is presented to utilize the shape of the myocardium for the purposes of improving the segmentation robustness. Experiments on our private and public datasets show that the proposed method not only improves the segmentation performance, but also decreases the performance gap when applied to different machines, thus demonstrating the effectiveness of the proposed method.

Wenming Song, Xing An, Ting Liu, Yanbo Liu, Lei Yu, Jian Wang, Yuxiao Zhang, Lei Li, Longfei Cong, Lei Zhu
Self-training with Domain-Mixed Data for Few-Shot Domain Adaptation in Medical Image Segmentation Tasks

Deep learning has shown significant progress in medical image analysis tasks such as semantic segmentation. However, deep learning models typically require large amounts of annotated data to achieve high accuracy; often a limiting factor in medical applications where labeled data is scarce. Few-shot domain adaptation (FSDA) is one approach to address this problem. It adapts a model trained on a source domain to a target domain which includes a few labeled data. In this paper, we present an FSDA method adapting pre-trained models to the target domain via domain-mixed data in the self-training framework. Our network follows the traditional encoder-decoder structure, which consists of a Transformer encoder, a DeeplabV3+ decoder for segmentation tasks and an auxiliary decoder for boundary-supervised learning. Our approach fine-tunes the source-domain pre-trained model with a few labeled examples from the target domain and by including unlabeled target domain data as well. We evaluate our method on two commonly used publicly available datasets for optic disc/cup and polyp segmentation, and show that it outperforms other state-of-the-art FSDA methods with only 5 labeled examples in the target domain. Overall, our FSDA method shows promising results and has potential to be applied to other medical imaging tasks with limited labeled data in the target domain.

Yongze Wang, Maurice Pagnucco, Yang Song
Bridging the Task Barriers: Online Knowledge Distillation Across Tasks for Semi-supervised Mediastinal Segmentation in CT

Segmentation of the mediastinal vasculature in computed tomography (CT) enables automated extraction of important biomarkers for cardiopulmonary disease characterization and outcome prediction. However, the limited contrast between blood and surrounding soft tissue makes manual segmentation of mediastinal structures challenging in non-contrast CT (NCCT) images, resulting in limited annotations for training deep learning models. To overcome this challenge, we propose a semi-supervised mediastinal vasculature segmentation method that utilizes knowledge distillation from unlabeled training data of contrast-enhanced dual-energy CT to achieve segmentation of the main pulmonary artery, main pulmonary veins, and aorta in NCCT. Our framework incorporates multitask learning with attention feature fusion bridges for online knowledge transfer from a related image-to-image translation task to the target segmentation task. Experimental evaluations demonstrate superior segmentation accuracy of our approach compared to fully supervised methods as well as two sequential approaches that do not leverage distillation between tasks. The proposed approach achieves a Dice similarity coefficient of 0.871 for the main pulmonary artery, 0.920 for the aorta, and 0.824 for the main pulmonary veins. By leveraging a large dataset without annotations through multitask learning and knowledge distillation, our approach improves performance in the target task of mediastinal segmentation with limited annotated training data.

Muhammad F. A. Chaudhary, Seyed Soheil Hosseini, R. Graham Barr, Joseph M. Reinhardt, Eric A. Hoffman, Sarah E. Gerard
RelationalUNet for Image Segmentation

Medical image segmentation is one of the most classic applications of machine learning in healthcare. A variety of Deep Learning approaches, mostly based on Convolutional Neural Networks (CNNs), have been proposed to this end. In particular, U-Shaped Network (UNet) have emerged to exhibit superior performance for medical image segmentation. However, some properties of CNNs, such as the stationary kernels, may limit them from capturing more in-depth visual and spatial relations. The recent success of transformers in both language and vision has motivated dynamic feature transforms. We propose RelationalUNet (RelationalUNet) which introduces relational feature transformation to the UNet architecture. RelationalUNet models the dynamics between visual and depth dimensions of a 3D medical image by introducing Relational Self-Attention blocks in skip connections. As the architecture is mainly intended for the semantic segmentation of 3D medical images, we aim to learn their long-range depth relations. Our method was validated on the Multi-Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation. Robustness to distribution shifts is a particular challenge in safety-critical applications such as medical imaging. We further test our model performance on realistic distributional shifts on the Shifts 2.0 White Matter Multiple Sclerosis Lesion Segmentation. Experiments show that our architecture leads to competitive performance. The code is available at .

Ivaxi Sheth, Pedro H. M. Braga, Shivakanth Sujit, Sahar Dastani, Samira Ebrahimi Kahou
Interpretability-Guided Data Augmentation for Robust Segmentation in Multi-centre Colonoscopy Data

Multi-centre colonoscopy images from various medical centres exhibit distinct complicating factors and overlays that impact the image content, contingent on the specific acquisition centre. Existing Deep Segmentation networks struggle to achieve adequate generalizability in such data sets, and the currently available data augmentation methods do not effectively address these sources of data variability. As a solution, we introduce an innovative data augmentation approach centred on interpretability saliency maps, aimed at enhancing the generalizability of Deep Learning models within the realm of multi-centre colonoscopy image segmentation. The proposed augmentation technique demonstrates increased robustness across different segmentation models and domains. Thorough testing on a publicly available multi-centre dataset for polyp detection demonstrates the effectiveness and versatility of our approach, which is observed both in quantitative and qualitative results. The code is publicly available at: .

Valentina Corbetta, Regina Beets-Tan, Wilson Silva
Improving Automated Prostate Cancer Detection and Classification Accuracy with Multi-scale Cancer Information

Automated detection of prostate cancer via multi-parametric Magnetic Resonance Imaging (mp-MRI) could help radiologists in the detection and localization of cancer. Several existing deep learning-based prostate cancer detection methods have high cancer detection sensitivity but suffer from high rates of false positives and misclassification between indolent (Gleason Pattern = 3) and aggressive (Gleason Pattern $$\ge $$ ≥ 4) cancer. In this work, we propose a multi-scale Decision Prediction Module (DPM), a novel lightweight false-positive reduction module that can be added to cancer detection models to reduce false positives, while maintaining high sensitivity. The module guides pixel-level predictions with local context information inferred from multi-resolution coarse labels, which are derived from ground truth pixel-level labels with patch-wise calculation. The coarse label resolution varies from a quarter size and 16 times smaller, to a single label for the whole slice, indicating that the slice is normal, indolent, or aggressive. We also propose a novel multi-scale decision loss that supervises cancer prediction at each resolution. Evaluated on an internal test set of 56 studies, our proposed model, DecNet, which adds the DPM and multi-scale loss to the baseline model SPCNet, significantly increases precision from 0.49 to 0.63 ( $$p \le 0.005$$ p ≤ 0.005 in paired t-test) while keeping the same level of sensitivity (0.90) for clinically significant cancer predictions. Our model also significantly outperforms U-Net in sensitivity and Dice coefficient ( $$p \le 0.05$$ p ≤ 0.05 and $$p \le 0.005$$ p ≤ 0.005 , respectively). As shown in the appendix, a similar trend was found when validating with an external dataset containing multi-vendor MRI exams. An ablation study on different label resolutions of the DPM shows that decision loss at all three scales achieves the best performance.

Cynthia Xinran Li, Indrani Bhattacharya, Sulaiman Vesal, Sara Saunders, Simon John Christoph Soerensen, Richard E. Fan, Geoffrey A. Sonn, Mirabela Rusu
Skin Lesion Segmentation Improved by Transformer-Based Networks with Inter-scale Dependency Modeling

Melanoma, a dangerous type of skin cancer resulting from abnormal skin cell growth, can be treated if detected early. Various approaches using Fully Convolutional Networks (FCNs) have been proposed, with the U-Net architecture being prominent To aid in its diagnosis through automatic skin lesion segmentation. However, the symmetrical U-Net model’s reliance on convolutional operations hinders its ability to capture long-range dependencies crucial for accurate medical image segmentation. Several Transformer-based U-Net topologies have recently been created to overcome this limitation by replacing CNN blocks with different Transformer modules to capture local and global representations. Furthermore, the U-shaped structure is hampered by semantic gaps between the encoder and decoder. This study intends to increase the network’s feature re-usability by carefully building the skip connection path. Integrating an already calculated attention affinity within the skip connection path improves the typical concatenation process utilized in the conventional skip connection path. As a result, we propose a U-shaped hierarchical Transformer-based structure for skin lesion segmentation and an Inter-scale Context Fusion (ISCF) method that uses attention correlations in each stage of the encoder to adaptively combine the contexts from each stage to mitigate semantic gaps. The findings from two skin lesion segmentation benchmarks support the ISCF module’s applicability and effectiveness. The code is publicly available at .

Sania Eskandari, Janet Lumpp, Luis Sanchez Giraldo
MagNET: Modality-Agnostic Network for Brain Tumor Segmentation and Characterization with Missing Modalities

Multiple modalities provide complementary information in medical image segmentation tasks. However, in practice, not all modalities are available during inference. Missing modalities may affect the performance of segmentation and other downstream tasks like genomic biomarker prediction. Previous approaches either attempt a naive fusion of multi-modal features or synthesize missing modalities in the image or feature space. We propose an end-to-end modality-agnostic segmentation network (MagNET) to handle heterogeneous modality combinations, which is also utilized for radiogenomics classification. An attention-based fusion module is designed to generate a modality-agnostic tumor-aware representation. We design an adversarial training strategy to improve the quality of the representation. A missing-modality detector is used as a discriminator to push the encoded feature representation to mimic a full-modality setting. In addition, we introduce a loss function to maximize inter-modal correlations; this helps generate the modality-agnostic representation. MagNET significantly outperforms state-of-the-art segmentation and methylation status prediction methods under missing modality scenarios, as demonstrated on brain tumor datasets.

Aishik Konwer, Chao Chen, Prateek Prasanna
Unsupervised Anomaly Detection in Medical Images Using Masked Diffusion Model

It can be challenging to identify brain MRI anomalies using supervised deep-learning techniques due to anatomical heterogeneity and the requirement for pixel-level labeling. Unsupervised anomaly detection approaches provide an alternative solution by relying only on sample-level labels of healthy brains to generate a desired representation to identify abnormalities at the pixel level. Although, generative models are crucial for generating such anatomically consistent representations of healthy brains, accurately generating the intricate anatomy of the human brain remains a challenge. In this study, we present a method called the masked-denoising diffusion probabilistic model (mDDPM), which introduces masking-based regularization to reframe the generation task of diffusion models. Specifically, we introduce Masked Image Modeling (MIM) and Masked Frequency Modeling (MFM) in our self-supervised approach that enables models to learn visual representations from unlabeled data. To the best of our knowledge, this is the first attempt to apply MFM in denoising diffusion probabilistic models (DDPMs) for medical applications. We evaluate our approach on datasets containing tumors and numerous sclerosis lesions and exhibit the superior performance of our unsupervised method as compared to the existing fully/weakly supervised baselines. Project website: .

Hasan Iqbal, Umar Khalid, Chen Chen, Jing Hua
IA-GCN: Interpretable Attention Based Graph Convolutional Network for Disease Prediction

Interpretability in Graph Convolutional Networks (GCNs) has been explored to some extent in general in computer vision; yet, in the medical domain, it requires further examination. Most of the interpretability approaches for GCNs, especially in the medical domain, focus on interpreting the output of the model in a post-hoc fashion. In this paper, we propose an interpretable attention module (IAM) that explains the relevance of the input features to the classification task on a GNN Model. The model uses these interpretations to improve its performance. In a clinical scenario, such a model can assist the clinical experts in better decision-making for diagnosis and treatment planning. The main novelty lies in the IAM, which directly operates on input features. IAM learns the attention for each feature based on the unique interpretability-specific losses. We show the application of our model on two publicly available datasets, Tadpole and the UK Biobank (UKBB). For Tadpole we choose the task of disease classification, and for UKBB, age, and sex prediction. The proposed model achieves an increase in an average accuracy of 3.2% for Tadpole and 1.6% for UKBB sex and 2% for the UKBB age prediction task compared to the state-of-the-art. Further, we show exhaustive validation and clinical interpretation of our results.

Anees Kazi, Soroush Farghadani, Iman Aganj, Nassir Navab
Multi-modal Adapter for Medical Vision-and-Language Learning

Recently, medical vision-and-language learning has attracted great attention from biomedical communities. Thanks to the development of large pre-trained models, the performances on these medical multi-modal learning benchmarks have been greatly improved. However, due to the rapid growth of the model size, full fine-tuning these large pre-trained models has become costly in training and storing such huge parameters for each downstream task. Thus, we propose a parameter-efficient transfer learning method named Medical Multi-Modal Adapter (M $$^3$$ 3 AD) to mediate this problem. We select the state-of-the-art M $$^3$$ 3 AE model as our baseline, which is pre-trained on 30k medical image-text pairs with multiple proxy tasks and has about 340M parameters. To be specific, we first insert general adapters after multi-head attention layers and feed-forward layers in all transformer blocks of M $$^3$$ 3 AE. Then, we specifically design a modality-fusion adapter that adopts multi-head attention mechanisms and we insert them in the cross-modal encoder to enhance the multi-modal interactions. Compared to full fine-tuning, we freeze most parameters in M $$^3$$ 3 AE and only train these inserted adapters with much smaller sizes. Extensive experimental results on three medical visual question answering datasets and one medical multi-modal classification dataset demonstrate the effectiveness of our proposed method, where $$\mathrm M^{3}AD$$ M 3 A D achieves competitive performances compared to full fine-tuning with much fewer training parameters and memory consumption.

Zheng Yu, Yanyuan Qiao, Yutong Xie, Qi Wu
Vector Quantized Multi-modal Guidance for Alzheimer’s Disease Diagnosis Based on Feature Imputation

Magnetic Resonance Imaging (MRI) and positron emission tomography (PET) are the most used imaging modalities for Alzheimer’s disease (AD) diagnosis in clinics. Although PET can better capture AD-specific pathologies than MRI, it is less used compared with MRI due to high cost and radiation exposure. Imputing PET images from MRI is one way to bypass the issue of unavailable PET, but is challenging due to severe ill-posedness. Instead, we propose to directly impute classification-oriented PET features and combine them with real MRI to improve the overall performance of AD diagnosis. In order to more effectively impute PET features, we discretize the feature space by vector quantization and employ transformer to perform feature transition between MRI and PET. Our model is composed of three stages including codebook generation, mapping construction, and classifier enhancement based on combined features. We employ paired MRI-PET data during training to enhance the performance of MRI data during inference. Experimental results on ADNI dataset including 1346 subjects show a boost in classification performance of MRI without requiring PET. Our proposed method also outperforms other state-of-the-art data imputation methods.

Yuanwang Zhang, Kaicong Sun, Yuxiao Liu, Zaixin Ou, Dinggang Shen
Finding-Aware Anatomical Tokens for Chest X-Ray Automated Reporting

The task of radiology reporting comprises describing and interpreting the medical findings in radiographic images, including description of their location and appearance. Automated approaches to radiology reporting require the image to be encoded into a suitable token representation for input to the language model. Previous methods commonly use convolutional neural networks to encode an image into a series of image-level feature map representations. However, the generated reports often exhibit realistic style but imperfect accuracy. Inspired by recent works for image captioning in the general domain in which each visual token corresponds to an object detected in an image, we investigate whether using local tokens corresponding to anatomical structures can improve the quality of the generated reports. We introduce a novel adaptation of Faster R-CNN in which finding detection is performed for the candidate bounding boxes extracted during anatomical structure localisation. We use the resulting bounding box feature representations as our set of finding-aware anatomical tokens. This encourages the extracted anatomical tokens to be informative about the findings they contain (required for the final task of radiology reporting). Evaluating on the MIMIC-CXR dataset [12, 16, 17] of chest X-Ray images, we show that task-aware anatomical tokens give state-of-the-art performance when integrated into an automated reporting pipeline, yielding generated reports with improved clinical accuracy.

Francesco Dalla Serra, Chaoyang Wang, Fani Deligianni, Jeffrey Dalton, Alison Q. O’Neil
Dual-Stream Model with Brain Metrics and Images for MRI-Based Fetal Brain Age Estimation

The disparity between chronological age and estimated brain age from images is a significant indicator of abnormalities in brain development. However, MRI-based brain age estimation still encounters considerable challenges due to the unpredictable movement of the fetus and maternal abdominal motions, leading to fetal brain MRI scans of extremely low quality. In this work, we propose a novel deep learning-based dual-stream fetal brain age estimation framework, involving brain metrics and images. Given a stack of MRI data, we first locate and segment out brain regions of every slice. Since brain metrics are highly correlated with age, we introduce four brain metrics into the model. To enhance the representational capacity of these metrics in space, we design them as vector-based discrete spatial metrics(DSM). Then we design the 3D-FetalNet and DSM-Encoder to extract visual and metric features respectively. Additionally, we apply the Global and local regression to enable the model to learn various patterns across different age ranges. We evaluate our model on a fetal brain MRI dataset with 238 subjects and reach the age estimation error of 0.75 weeks. Our proposed method achieves state-of-the-art results compared with other models.

Shengxian Chen, Xin Zhang, Ruiyan Fang, Wenhao Zhang, He Zhang, Chaoxiang Yang, Gang Li
PECon: Contrastive Pretraining to Enhance Feature Alignment Between CT and EHR Data for Improved Pulmonary Embolism Diagnosis

Previous deep learning efforts have focused on improving the performance of Pulmonary Embolism (PE) diagnosis from Computed Tomography (CT) scans using Convolutional Neural Networks (CNN). However, the features from CT scans alone are not always sufficient for the diagnosis of PE. CT scans along with electronic heath records (EHR) can provide a better insight into the patient’s condition and can lead to more accurate PE diagnosis. In this paper, we propose Pulmonary Embolism Detection using Contrastive Learning (PECon), a supervised contrastive pretraining strategy that employs both the patient’s CT scans as well as the EHR data, aiming to enhance the alignment of feature representations between the two modalities and leverage information to improve the PE diagnosis. In order to achieve this, we make use of the class labels and pull the sample features of the same class together, while pushing away those of the other class. Results show that the proposed work outperforms the existing techniques and achieves state-of-the-art performance on the RadFusion dataset with an F1-score of 0.913, accuracy of 0.90 and an AUROC of 0.943. Furthermore, we also explore the explainability of our approach in comparison to other methods. Our code is publicly available at .

Santosh Sanjeev, Salwa K. Al Khatib, Mai A. Shaaban, Ibrahim Almakky, Vijay Ram Papineni, Mohammad Yaqub
Exploring the Transfer Learning Capabilities of CLIP in Domain Generalization for Diabetic Retinopathy

Diabetic Retinopathy (DR), a leading cause of vision impairment, requires early detection and treatment. Developing robust AI models for DR classification holds substantial potential, but a key challenge is ensuring their generalization in unfamiliar domains with varying data distributions. To address this, our paper investigates cross-domain generalization, also known as domain generalization (DG), within the context of DR classification. DG, a challenging problem in the medical domain, is complicated by the difficulty of gathering labeled data across different domains, such as patient demographics and disease stages. Some recent studies have shown the effectiveness of using CLIP to handle the DG problem in natural images. In this study, we investigate CLIP’s transfer learning capabilities and its potential for cross-domain generalization in diabetic retinopathy (DR) classification. We carry out comprehensive experiments to assess the efficacy and potential of CLIP in addressing DG for DR classification. Further, we introduce a multi-modal fine-tuning strategy named Context Optimization with Learnable Visual Tokens (CoOpLVT), which enhances context optimization by conditioning on visual features. Our findings demonstrate that the proposed method increases the F1-score by 1.8% over the baseline, thus underlining its promise for effective DG in DR classification. Our code is publicly available at .

Sanoojan Baliah, Fadillah A. Maani, Santosh Sanjeev, Muhammad Haris Khan
More from Less: Self-supervised Knowledge Distillation for Routine Histopathology Data

Medical imaging technologies are generating increasingly large amounts of high-quality, information-dense data. Despite the progress, practical use of advanced imaging technologies for research and diagnosis remains limited by cost and availability, so more information-sparse data such as H &E stains are relied on in practice. The study of diseased tissue would greatly benefit from methods which can leverage these information-dense data to extract more value from routine, information-sparse data. Using self-supervised learning (SSL), we demonstrate that it is possible to distil knowledge during training from information-dense data into models which only require information-sparse data for inference. This improves downstream classification accuracy on information-sparse data, making it comparable with the fully-supervised baseline. We find substantial effects on the learned representations, and pairing with relevant data can be used to extract desirable features without the arduous process of manual labelling. This approach enables the design of models which require only routine images, but contain insights from state-of-the-art data, allowing better use of the available resources.

Lucas Farndale, Robert Insall, Ke Yuan
Tailoring Large Language Models to Radiology: A Preliminary Approach to LLM Adaptation for a Highly Specialized Domain

In this preliminary work, we present a domain fine-tuned LLM model for radiology, an experimental large language model adapted for radiology. This model, created through an exploratory application of instruction tuning on a comprehensive dataset of radiological information, demonstrates promising performance when compared with broader language models such as StableLM, Dolly, and LLaMA. This model exhibits initial versatility in applications related to radiological diagnosis, research, and communication. Our work contributes an early but encouraging step towards the evolution of clinical NLP by implementing a large language model that is local and domain-specific, conforming to stringent privacy norms like HIPAA. The hypothesis of creating customized, large-scale language models catering to distinct requirements of various medical specialties, presents a thought-provoking direction. The blending of conversational prowess and specific domain knowledge in these models kindles hope for future enhancements in healthcare AI. While it is still in its early stages, the potential of generative large language models is intriguing and worthy of further exploration. The demonstration code of our domain fine-tuned LLM model for radiology can be accessed at .

Zhengliang Liu, Aoxiao Zhong, Yiwei Li, Longtao Yang, Chao Ju, Zihao Wu, Chong Ma, Peng Shu, Cheng Chen, Sekeun Kim, Haixing Dai, Lin Zhao, Dajiang Zhu, Jun Liu, Wei Liu, Dinggang Shen, Quanzheng Li, Tianming Liu, Xiang Li
Machine Learning in Medical Imaging
Xiaohuan Cao
Xuanang Xu
Islem Rekik
Zhiming Cui
Xi Ouyang
Copyright Year
Electronic ISBN
Print ISBN

Premium Partner