Skip to main content

Open Access 04.03.2024 | Original Article

Cytopathology image analysis method based on high-resolution medical representation learning in medical decision-making system

verfasst von: Baotian Li, Feng Liu, Baolong Lv, Yongjun Zhang, Fangfang Gou, Jia Wu

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Artificial intelligence has made substantial progress in many medical application scenarios. The quantity and complexity of pathology images are enormous, but conventional visual screening techniques are labor-intensive, time-consuming, and subject to some degree of subjectivity. Complex pathological data can be converted into mineable image features using artificial intelligence image analysis technology, enabling medical professionals to quickly and quantitatively identify regions of interest and extract information about cellular tissue. In this study, we designed a medical information assistance system for segmenting pathology images and quantifying statistical results, including data enhancement, cell nucleus segmentation, model tumor, and quantitative analysis. In cell nucleus segmentation, to address the problem of uneven healthcare resources, we designed a high-precision teacher model (HRMED_T) and a lightweight student model (HRMED_S). The HRMED_T model is based on visual Transformer and high-resolution representation learning. It achieves accurate segmentation by parallel low-resolution convolution and high-scaled image iterative fusion, while also maintaining the high-resolution representation. The HRMED_S model is based on the Channel-wise Knowledge Distillation approach to simplify the structure, achieve faster convergence, and refine the segmentation results by using conditional random fields instead of fully connected structures. The experimental results show that our system has better performance than other methods. The Intersection over the Union (IoU) of HRMED_T model reaches 0.756. The IoU of HRMED_S model also reaches 0.710 and params is only 3.99 M.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

With the continuous integration of computer science and medical information, artificial intelligence has achieved substantial success in many medical scenarios, such as assisting in diagnosis, informing treatment decisions, predicting risks, and reducing medical errors [13]. AI can complete medical tasks with speed and accuracy, including the analysis of cardiology images [4], the diagnosis of eye diseases from optical coherence tomography [5], and the determination of bone age from X-rays [6]. Artificial intelligence is attempting to directly address some difficult medical diagnostic issues, including left ventricular hypertrophy [7], and prostate cancer [8]. However, there are still numerous auxiliary diagnostic techniques that can aid medical professionals in making their jobs easier for those medical issues that artificial intelligence is still unable to resolve on its own [9, 10]. Artificial intelligence can help medical workers save time and focus on more complex medical problems. Compared with traditional manual diagnosis, artificial intelligence-assisted diagnosis has higher repeatability, objectivity, and real-time performance. In addition, artificial intelligence does not get tired and can be used in multiple places. Even remote areas can benefit from the artificial intelligence medical information support system developed by top hospitals, helping to alleviate the problem of limited medical resources in developing countries [11].
Artificial intelligence-assisted segmentation of pathological images of tumors can significantly enhance the efficiency of medical practitioners in quantifying various cellular indicators, visualizing organelle morphologies, locating regions of interest (ROI), and supporting the customization of surgical plans [1214]. To a certain extent, it can also alleviate the issue of untimely patient diagnosis in developing countries caused by a shortage of pathologists.
Investigations into pathological image segmentation methods primarily encompass two divisions: conventional machine learning techniques and deep learning approaches. The former relies heavily on extensive manual feature selection through specialized knowledge, while the latter’s end-to-end learning nature reduces dependence on such knowledge. Deep learning models such as CNN [15], Encoder-Decoders [16], and Generative Adversarial Networks (GANs) [17] have seen rapid development in recent years. However, as models strive to learn increasingly complex features, they become increasingly large and computationally demanding. Model compression techniques such as knowledge distillation (KD) have emerged to address these challenges and bring cutting-edge models to regions or devices with limited computational resources [18]. In addition, in medical decision-making systems, the use of appropriate event-triggering mechanisms not only improves the performance of data transmission but also minimizes computational and storage resource consumption [19]. For example, Markov-based control systems and reinforcement learning have been applied to robotic systems, biological systems, and so on [20, 21].
We have summarized the potential problems that we perceive to exist in both traditional manual discrimination of pathological images and intelligent segmentation of pathological images as follows:
(1)
The cost of pixel-level labeling in pathological images is extremely high. As an example, the TCGA Kumar dataset comprises merely thirty 40 × magnification 1000 × 1000 pathological images [22]. But the number of annotated nuclei surpasses 20,000, which makes such a high specialist cost unfeasible in economically disadvantaged developing countries;
 
(2)
Manually screening and processing the images would be a labor-intensive and time-consuming task, with the large number of pathology sections and the ultra-high resolution of each image. The vast majority of the images are just background, and the average cell area in our dataset is only 8.29% of the background;
 
(3)
Cell nucleus segmentation models with high accuracy and strong generalization properties are often cumbersome. This means that the deployment of the models requires significant computational resources. Developing countries are not well equipped medically to enable the use of such equipment;
 
(4)
The number of pathologists in developing countries is low and the distribution of healthcare resources is often highly uneven [9, 14]. In China, for example, more than 80% of medical resources are concentrated in cities with only 10% of the population. This results in the majority of patients in regions with scarce resources and outdated equipment facing difficulties in receiving timely diagnoses and treatments in the early stages of disease development;
 
Due to the variability and complexity of pathology images, the diagnosis of tumors by doctors with different experiences has a certain degree of subjectivity. For example, consider the geometric features, texture features, and shape features of tumor cells and their complexity. This may lead to an increased misdiagnosis rate for inexperienced doctors and a decrease in the reproducibility of analysis results. Therefore, how to quickly and effectively help doctors objectively extract features is an urgent problem.
In summary, we have designed a medical information aid system for segmentation and quantitative statistics of cell nuclei oriented to pathology images. The system includes several parts of data preprocessing, cell nucleus segmentation, tumor modeling and quantitative analysis. Taking High-Resolution Medical Representation Learning as the backbone network, we designed a high-precision teacher model (HRMED_T). The HRMED_T model solves the problems of cost and parameter explosion of the traditional deformer by the design of a cross-crossing window, which allows the deformer to replace the convolutional structure. Meanwhile, based on the lack of resources in many domains, we refine the model by knowledge refinement and design a lightweight student model (HRMED_S). The HRMED_S model simplifies the hierarchical structure and uses conditional random fields to replace the original fully-connected layer to obtain more contextual information about the units. The HRMED_S model has only 4.5% of the computation and 7.7% of the number of parameters of the HRMED_T model, providing better real-time performance. Our system provides more options for districts with different computational resources. In addition, we have designed a quantitative analysis tool for physicians to quickly analyze images. Physicians can visually identify cell regions, obtain data such as cell eccentricity, and quickly localize regions of interest in the system.
(1)
To maximize the use of limited data and prevent overfitting due to the small sample size, we established a data augmentation pipeline for images of malignant tumor pathology. This pipeline included basic augmentation operations and four color treatments, namely random brightness, saturation, hue, and contrast enhancement; as a result, the model's robustness and generalization ability were enhanced;
 
(2)
Based on high-resolution representation learning and Transformer, a high-precision teacher model (HRMED_T) was developed that allows the Transformer to replace the convolutional structure through crossed cross-windows. The model achieves accurate segmentation by heavy-stream fusion and parallel low-resolution (LR) convolution with high-resolution images while maintaining the high-resolution (HR) representation. In addition, the HRMED_T model introduces a uniform focus loss function to address class imbalance.
 
(3)
A student model that is lightweight (HRMED_S) is built. It refines the split results using a conditional random field structure and migrates knowledge from the HRMED_T model using the Channel-wise Knowledge Distillation algorithm. With faster training, the HRMED_S model can achieve relatively better prediction accuracy with only fewer parameters. This works particularly well in places with limited access to healthcare resources. This is very effective for areas that are not rich in healthcare resources. They can rely on large hospitals to train the model and then deploy it locally.
 
(4)
We validated the feasibility of our system on 2,164 pathological images of sarcoma that we generated, and multiple comparative experiments showed that both models within our system can perform accurate segmentation of the pathological images. Finally, we also designed a segmentation result processing module to assist doctors in quickly quantifying various cellular indices, thus enabling efficient and accurate diagnosis.
 
With the advancement of information technology and computer hardware, AI is continuously empowering various industries, and the combination of AI and medical diagnosis is one of the current hot topics [2325].
In the field of image segmentation, since the emergence of the Neural Network (NN), a plethora of segmentation networks have emerged. Wang et al. [26] proposed a novel CNN-based network High-Resolution Net (HRNet), which preserves high-resolution representations throughout the training, and leverages parallel low-resolution convolutions and high-scale image fusion to achieve multi-scale integration. This results in a more refined and spatially accurate segmentation outcome. Yutong Xie et al. [27] proposed a 3D segmentation network (CoTr) with CNN to extract features with an encoder-decoder structure. But the flaw that the CNN structure cannot model long-range dependencies still exists. The authors eventually added a Transformer to solve the problem. But it also increased the number of parameters in the model. Hsiang-Yu Han et al. [28] proposed a network based on deep CNN for real-time semantic segmentation, aiming to improve inference speed by class-aware edge loss. The results show that the network has good performance in real-time segmentation. However, the method is difficult to fully obtain global information. Xiang Li et al. [29] proposed a convolutional neural network incorporating an attentional mechanism, which replaces the traditional model of focusing on regions one at a time in the convolutional layers by establishing connections in the intermediate feature layers, thus improving the ability to capture global information.
In the recent research on image segmentation, the Vision Transformers (ViT) structure has started to show unique advantages and is beginning to challenge the dominance of CNNs in this field [30, 31]. Many CNN networks have gradually been improved with ViT structures, such as the High-Resolution Transformer (HRFormer) proposed by Yuan et al. [32], which is an improvement of HRNet's ViT structure. This structure can greatly improve the segmentation accuracy by obtaining global aggregated features right at the shallow network through multi-resolution parallel design and local window self-attention. Gu et al. [33] also proposed the High-Resolution Vision Transformer (HRViT) based on HRNet's ViT improvement. It focuses on fusing high-resolution multi-branch architecture and ViT to solve the problem of poor performance of ViT in dense task segmentation. Moreover, the HRViT network improves segmentation accuracy and efficiency by reducing the redundancy of heterogeneous branches and enhancing the attention module. Although the accuracy of HRViT is improved compared to the traditional CNN segmentation network. However, the number of parameters of the ViT network is still very large and is only applicable to images of realistic scenes. For complex pathology image segmentation, the performance of such models still falls short of expectations. How to minimize the parameters under the premise of ensuring accurate segmentation? Researchers have turned their attention to model refinement techniques.
Knowledge distillation is the act of conveying the insights of a robust but massive teacher model to a compact, lighter student model, to deploy it on devices with restricted computational capabilities. Yifan Liu et al. [34] designed a structured-based KD scheme to refine similarity by pairwise distillation modules and overall information by holistic distillation modules. However, this method is difficult to adapt to the complex background changes in pathology images, and the demand for computation is still high. Dian Qin et al. [35] designed a distillation strategy to apply information extraction from existing models to lightweight student networks. The authors designed a region affinity module to distill information. However, the method still faces the problem of memory storage capacity in medical image processing. Yuannan Hou et al. [36] proposed a point-level to voxel-level KD method (PVD), which improves the ability of the student model to extract features through the similarity information between points and voxels. The results show that the method has lower resource consumption and training speed in the radar domain.
Due to the complexity of digital pathology images, all existing segmentation networks are difficult to apply directly in the field of cytopathology images. Researchers are beginning to focus on automatic cell nucleus segmentation techniques. Lukasz Roszkowiak et al. [37] proposed a clustering segmentation algorithm based on distance transformation. For the problem of aggregated cell nuclei, the algorithm divides the aggregated region into smaller regions and uses them as segmentation objects based on shape. The method can reduce the over-segmentation of cell nuclei. Similarly, to solve the problem of overlapping cell segmentation, Qingbo Kang et al. [38] designed a novel kernel segmentation method with a two-stage learning framework and deep layer aggregation (DLA). It generates kernel boundary coarse segmentation results based on DLA in the first stage, and uses shallow U-Net in the second stage to refine the results of the previous stage. The core of the method is to decompose a complex task into subtasks. Jinjie Huang et al. [39] proposed a cervical cell clump image segmentation method based on a multi-scale fuzzy clustering algorithm. It mainly consists of three steps: separation of the cell region and background region; node of interest extraction; and cell nucleus segmentation. The method has high accuracy on cervical sections. However, the number of parameters of such pathology image methods is still large, which is difficult to meet the demand for training speed. It is difficult to realize for many areas with insufficient resources.
A comparative analysis of related work is shown in Table 1. The complexity of tumor cell nuclei and the limitations of medical resources in developing countries pose a challenge to the automatic segmentation of images. Existing large models sacrifice processing time and complexity to achieve high prediction accuracy. Lightweight models improve the training speed, but they are difficult to achieve the expected results for the recognition of digital pathology images and are difficult to apply directly. Based on this, we design a high-precision segmentation network specifically for malignant tumor pathology images based on high-resolution representation learning and a lightweight network with relatively low hardware equipment requirements. Our method is more suitable for malignant tumor diagnosis application scenarios, especially for regions with insufficient medical resources.
Table 1
Summary analysis of related work
Classification
Literatures
Dominance
Gaps
CNN-based
Literature [26]
High spatial accuracy
Relatively high complexity of the network
Literature [27]
Solves the long-range dependency problem of CNNs
Literature [28]
Better performance in real-time segmentation
Difficulty in obtaining full access to global information
Literature [29]
Improved ability to capture global information
Slower training
ViT-based
Literature [30]
Good performance in remote sensing images
Limited application scenarios
Literature [31]
Reduced need for labeled data
Recognition of small targets is limited
Literature [32]
Precision Improvement of Segmentation
Model parameters increase too quickly
Literature [33]
Improved performance relative to CNN-based methods
The number of parameters is still very large for realistic scenarios
KD-based
Literature [34]
Better performance in the field of compact semantic segmentation networks
Computationally intensive and not applicable to the field of digital pathology images
Literature [35]
Introduction of a regional affinity module
Literature [36]
Lower resource consumption in the radar field
Digital pathology image segmentation model
Literature [37]
Reduced excessive segmentation of the nucleus
Higher resource consumption
Literature [38]
Solved the cell overlap problem
Literature [39]
Suitable for cervical sectioning
The first three groups are based on backbone networks for classification and the last group is a classical algorithm in digital pathology images

System model design

Digital pathology image analysis is of great value for tumor diagnosis [40]. Due to the huge number and complexity of pathology images, traditional visual screening methods are not only time-consuming and labor-intensive but also have certain subjective variability. Through the technique of image segmentation, complex pathological data can be extracted into mineable image features, which can assist doctors to quickly and quantitatively extracting the information on cell tissues [4144]. There have been many models on digital pathology image segmentation. Facing the complexity of cell nuclei overlapping and clustering, many networks keep increasing network layers to achieve higher prediction accuracy, resulting in higher complexity and slow training speed of the models [45]. The accuracy of lightweight networks with fast training speed is again relatively low [29]. Based on this, we designed an artificial intelligence medical information-assisted diagnosis system for digital pathology images, as shown in Fig. 1. It aims to assist doctors in diagnosing cytopathology images, improve the efficiency and quality of treatment while reducing medical costs, and provide practical and effective help to both doctors and patients.
Our intelligent diagnostic support system comprises four modules: Firstly, the input image passes through a data augmentation pipeline, incorporating several fundamental data augmentation techniques as well as four color-targeted data augmentations; subsequently, the image is fed into a teacher model HRMED_T for training, yielding a segmentation result; thirdly, after the teacher model has been trained, knowledge is transferred to a lightweight student network through a model compression technique known as knowledge distillation; fourthly, regions with varying medical resources can choose from either model to output the segmentation result of the image and perform quantifiable analysis of the pathology image.
Throughout the system's model, the image processing dimensions of width, height, and number of channels are represented as \(W\times H\times C\), and the projection matrix or feature map during network training using \(M\). The loss function is represented by \(L\). The symbols and explanations in this chapter are shown in Table 2.
Table 2
Symbols and their interpretation
Symbol
Paraphrase
Symbol
Paraphrase
\(W\times H\times C\)
The dimensions (length and width) and number of channels of the processed pathological images
\(\theta \)
A custom parameter is used to control the relative weight of positive and negative examples. \(\theta \in [\mathrm{0,1}]\)
\(p\)
The pixel value of a certain point in a pathological image
\(\mu \)
Thresholds for randomized enhanced contrast
\(M\)
Thresholds for randomized enhanced contrast
\(e\)
Eccentricity of the nucleus
\({D}_{m}\)
Data matrix
\(\mathcal{l}\)
Customed block list
\(\mathcal{M}\)
The initial network architecture
\(\mathcal{N}\)
Initial student network architecture
\(\mathcal{y}\)
The label matrix
\(vl\)
Customed ViT list
\(\tau \)
A hyperparameter is used to suppress the background class.\(\tau \in [\mathrm{0,1}]\)
\(\mu \)
A hyperparameter is used to determine the weight of the two loss functions. \(\mu \in \left[\mathrm{0,1}\right]\)
\({L}_{maf}\)
Asymmetric focal loss for minority class enhancement
\({T}_{i,j}\)
The relationship between point \({x}_{i}\) and the features of the surrounding points
\({L}_{amF}\)
Loss value of the merged pathological images with unified parameters
\(\Phi \)
The difference between the feature maps of the HRMED_T and HRMED_S
\(T\)
The distilled temperature
\(\widehat{M}\)
The predicted value matrix for different categories
\(y\)
The output of the model’s prediction
\(\widehat{y}\)
The ground truth label of the pathological image
\(U({x}_{i})\)
The pixel points around point \({x}_{i}\)
\({L}_{maFT}\)
Asynchronous Focal Tversky Loss after modification
\({S}_{i}\)
The features of the point \({x}_{i}\) itself
\(Z(x)\)
A normalized function
The first and third columns are the main symbols covered in this chapter; the second and fourth columns are the meanings corresponding to the symbols

Image pre-processing module

Pixel-level annotation of pathology images incurs an exorbitantly high cost, our dataset comprises only over 2000 pathology images at 40 × magnification and 512 × 512 in size, yet the number of annotated cell nuclei reaches hundreds of thousands. To maximize the utilization of limited and expensive annotated data, we devised a data augmentation pipeline for malignant tumor pathology images, incorporating diversity in the input data by passing each image through the pipeline before every training iteration.
Considering the rotational invariance of cells and the basic features of random appearance at any position in the image, the input image first undergoes four fundamental data augmentation modules: random rotation, random translation, random cropping, and random flipping. Each module is executed with a fifty-percent chance, and the corresponding label image is also altered. Blank areas resulting from the base transformation are filled with black.
Finally, we simulate staining differences due to variations in the staining process, batch, scanner, and visual noise during the processing of pathology images using stochastic color enhancement techniques. This module has a fifty-percent chance of being executed, and it includes several sub-modules. Unless otherwise specified, each sub-module has a fifty-percent chance of being executed. Black areas in the original image are not enhanced, and the label image remains unchanged. Next, I will introduce each sub-module in detail.
(a) Random Brightness Augmentation: Cell pathology images often present uneven illumination due to factors such as microscope settings, sample thickness, staining depth, and leakage. To simulate lighting variations, we adopt a method of changing the image brightness for data augmentation.
$$ p_{{1i}} = p_{i} + \gamma $$
(1)
where \({p}_{i}\) denotes the pixel value at any point in the image. The decimal \(brt\) denotes the random variation of brightness, \(\upgamma \in [-\mathrm{32,32}]\) \(\upgamma \in [-\mathrm{32,32}]\). A change in brightness results in a change in pixel value.
(b) Image Conversion Module: The image undergoes a conversion from the RGB format to the HSL format shown in Eq. (2). This is an intermediate step and will be executed whenever the color enhancement module is executed.
$$ \left\{ {\begin{array}{*{20}c} {\vartheta = \cos ^{{ - 1}} \left[ {\frac{{\frac{1}{2}\left[ {\left( {R - G} \right) + \left( {R - B} \right)} \right]}}{{\sqrt {\left( {R - G} \right)^{2} + \left( {R - B} \right)\left( {G - B} \right)} }}} \right]} \\ {H_{{hue}} = \left\{ {\begin{array}{*{20}c} {0\quad G \ge B} \\ {2\pi - \vartheta \quad G < B} \\ \end{array} } \right.} \\ {L = \frac{1}{{\sqrt 3 }}\left( {R + G + B} \right)} \\ {S = 1 - \frac{{3\min \left( {R,G,B} \right)}}{{R + G + B}}} \\ \end{array} } \right.\left\{ {\begin{array}{*{20}c} {\vartheta = \cos ^{{ - 1}} \left[ {\frac{{\frac{1}{2}\left[ {\left( {R - G} \right) + \left( {R - B} \right)} \right]}}{{\sqrt {\left( {R - G} \right)^{2} + \left( {R - B} \right)\left( {G - B} \right)} }}} \right]} \\ {H = \left\{ {\begin{array}{*{20}c} {0\quad G \ge B} \\ {2\pi - \vartheta \quad G < B} \\ \end{array} } \right.} \\ {L = \frac{1}{{\sqrt 3 }}\left( {R + G + B} \right)} \\ {S = 1 - \frac{{3\min \left( {R,G,B} \right)}}{{R + G + B}}} \\ \end{array} } \right.\left\{ {\begin{array}{*{20}c} {\vartheta = \cos ^{{ - 1}} \left[ {\frac{{\frac{1}{2}\left[ {\left( {R - G} \right) + \left( {R - B} \right)} \right]}}{{\sqrt {\left( {R - G} \right)^{2} + \left( {R - B} \right)\left( {G - B} \right)} }}} \right]} \\ {H = \left\{ {\begin{array}{*{20}c} {0\quad G \ge B} \\ {2\pi - \vartheta \quad G < B} \\ \end{array} } \right.} \\ {L = \frac{1}{{\sqrt 3 }}\left( {R + G + B} \right)} \\ {S = 1 - \frac{{3\min \left( {R,G,B} \right)}}{{R + G + B}}} \\ \end{array} } \right. $$
(2)
(c) Random Saturation Augmentation: Due to factors such as varying staining depth, inconsistent dyes, and uneven dye absorption, cell pathology images often have different background and nucleus staining colors among different pathological images. To simulate this situation, we randomly adjust the saturation of colors within a certain range.
$${S}{^\prime}=S+\eta $$
(3)
where \(S\) and \({S}{^\prime}\) are the saturation of the image. \(\eta \) is a stochastic parameter of saturation variation, \(\eta \in [\mathrm{0.7,1.3}]\).
(d) Random Hue Augmentation: The reason for performing this data augmentation is the same as the previous one. We adjust the hue value of the image to achieve this.
$${{\text{H}}}^{\mathrm{^{\prime}}}=\left({\text{H}}+\upupsilon \right)\mathrm{\%}100$$
(4)
where both \({H}_{hue}\) and \({{H}_{hue}}{^\prime}\) are the hue values of the image. The parameter \(\upsilon \) denotes a random hue enhancement value and \(\upsilon \in [-\mathrm{18,18}]\).
(e) The Image Conversion Module: Converts an image from the HLS format back to the RGB format, utilizing the formula described in Eq. (2). It will always be executed whenever the Color Enhancement Module is performed.
(f) Random contrast enhancement: To simulate color differences, a random contrast enhancement threshold \(\mu \in [\mathrm{0.5,1.5}]\) is established and each pixel's change in value is as shown in (5).
$${p}_{2i}={p}_{1i}\times \mu $$
(5)
The images used to train the model go through various modules ranging from random rotation to random color enhancement in the data enhancement pipeline described above to generate the enhanced images and feed them into the system. Despite the limited quantity of data, this approach ensures that each input image for training is nearly unique.
Additionally, we attempted to address common noise present in cellular pathological images, such as background speckles and interference, different staining in contaminated areas, inconsistent staining depths, microscope settings, and so on, by denoising, but we were unable to find a highly effective denoising method specifically for cellular pathological images. We attempted to utilize denoising methods for actual images, but the results were not greatly improved following testing. For the problem of varying degrees of staining due to different staining batches, we did not choose to use a complex staining normalization method. This is mainly because a complex method would entail more computational cost, which is contrary to our goal of reducing the amount of computation. We try to model these noise and color differences through data augmentation so that the deep learning network learns to distinguish their features.

Image segmentation model

The digital pathology image teacher model we designed is HRMED_T, which is mainly referenced to the HRViT model [33]. HRMED_T combines a high-resolution network with ViT, which can maintain high-resolution feature extraction throughout the entire network pathway. At the same time, we use a unique customized attention structure. This structure reduces the computational complexity while ensuring that the global contextual information in the osteosarcoma pathology images is not ignored.
The overall architecture of the model is illustrated in Fig. 2. Next, let us delve into the individual details of the modules in the network architecture diagram.

MEDAttn

HRMED_T does not directly use a traditional Transformer to replace the convolution in HRNet, which would lead to an explosion in computational cost and parameter size. We introduce a cross-shaped window design as in Fig. 3.
Regarding image \(x\in {\mathbb{R}}^{H\times W\times C}\), instead of directly inputting the entire image for attention calculation, the image is first partitioned into two parts based on the number of channels. The upper part performs row attention calculation with a window size of \(s\times W\), while the lower part performs column attention calculation with a window size of \(H\times s\). Within each window, the patch is divided into \(I\) \({d}_{i}\)-dimensional subunits, following which local self-attention is employed. The overall formula can be seen in Eq. (6).
$$\left\{\begin{array}{l}\begin{array}{l}MEDAttn\left(x\right)=BN\left(\phi \left({M}^{o}\left[{y}_{1},\cdots ,{y}_{i},\cdots ,{y}_{I}\right]\right)\right)\\ {y}_{i}={r}_{t}+\mathit{DWConv}\left(\sigma \left({M}_{i}^{V}x\right)\right) \end{array}\\ \left[{r}_{i}^{1},\cdots ,{r}_{i}^{N}\right]={r}_{i}=\left\{\begin{array}{cc}{\text{HAtt}}{\text{n}}_{i}\left(x\right),& 1\le i<\frac{I}{2}\\ {\text{VAtt}}{\text{n}}_{i}\left(x\right),& \frac{I}{2}\le i\le I\end{array}\right. \\ {r}_{i}^{n}=Attn\left({M}_{i}^{Q}{x}^{n},{M}_{i}^{K}{x}^{n},{M}_{i}^{V}{x}^{n}\right) \\ \left[{x}^{1},\cdots ,{x}^{n},\cdots ,{x}^{N}\right]=x,\hspace{1em}{x}^{n}\in {\mathbb{R}}^{\left(\frac{H\times W}{s}\right)\times C} \end{array}\right. $$
(6)
where \({M}_{i}^{Q}{x}^{n},{M}_{i}^{K}{x}^{n},{M}_{i}^{V}{x}^{n}\) denotes the projection matrices of \({Q}_{i}\)(query), \({K}_{i}\)(key), and \({V}_{i}\)(value) generated by the \(t-th\) self-attention header, respectively; \({M}^{o}\in {\mathbb{R}}^{C\times C}\) is the final projection matrix, and \(\phi \) is the Hard-Swish activation function. As can be seen from the second line of Eq. (6) and the accompanying Fig. 3, the mapped \({V}_{i}\) (value) matrix not only participates in the attention computation but also undergoes the Hard-Swish and Deep Convolution (DWConv) [46] operations, respectively, before being added to the original attention values. The \({r}_{i}\) in the third line of Eq. (6) is the attention score obtained after calculating the row and column attention of the input image. The formula for attention calculation is shown in Eq. (7):
$$Attn\left({M}_{i}^{Q}{x}^{n},{M}_{i}^{V}{x}^{n},{M}_{i}^{V}{x}^{n}\right)=\mathit{softmax}\left(\frac{{Q}_{i}^{n}{\left({V}_{i}^{n}\right)}^{T}}{\sqrt{{d}_{i}}}\right){V}_{i}^{n}$$
(7)
To alleviate the feature collapse phenomenon that arises with deepening networks, we have established an enhanced residual shortcut. Furthermore, since the MEDAttn structure does not take entire rows or columns, padding with zeros is required when H or W does not satisfy a multiple of s to ensure the completeness of the final window.

Other detailed structures in HRMED_T

The Stem module is utilized to reduce the computation cost of ViT by downsizing images. The Patch Embed module is employed to integrate feature information from different branches during fusion. The structure of the Mixed-scale convolutional feedforward network (MixCFN) for further extraction of multi-scale features of images is shown in Fig. 4a.
Different resolution fusion layer: For feature fusion between the \(i\)-th input and the \(j\)-th output (where \(i<j\)) in Fig. 4b, we first utilize a DWConv with a kernel size of \({2}^{j-i}+1\) to downsample the feature map, followed by \({2}^{j-i}\times C\) 1 × 1 convolutional kernels to increase the input channel dimension. On the other hand, in the up-scaling process (when \(i>j\)), we use \({2}^{j-i}\times C\) 1 × 1 convolutional kernels to increase the input channel dimension, and then perform nearest-neighbor upsampling to enlarge the image by a factor of \({2}^{i-j}\). For the case where \(i=j\), the input and output are directly connected through a skip connection.
Loss function: In the field of cytopathology image segmentation, where the vast majority of images contain only a few cells and a small number of images contain a high density of cells (cells occupying over 20% of the image), both types of images are crucial for qualitative cancer diagnosis. To address this, we have designed the Medical Focal Loss (MFL) as our loss function, which focuses on maximizing the accuracy of cell segmentation in pathological images. It leverages asymmetry to selectively suppress background and enhance cell segmentation while requiring only three hyperparameters.
The medical asymmetric Focal loss (\({L}_{maf}\)) can be expressed as Eq. (8) [47]. In our binary classification task, compared to the original Focal loss, we merely multiply \({p}_{i}\) with the prediction value of the rare class to achieve the objective of selective amplification.
$${L}_{maf}=-\frac{\theta }{N}{p}_{i}log\left({\widehat{M}}_{i,c}\right)-\frac{1-\theta }{N}{\left(1-{\widehat{{\varvec{M}}}}_{i,b}\right)}^{\tau }log({\widehat{{\varvec{M}}}}_{i,c})$$
(8)
\(\widehat{M}\) denotes the matrix of predicted values for different categories. Its three indices \(i\), \(c\), and \(b\) are respectively the row index of the matrix, the index of the minority sample class (cell), and the index of the majority sample class (background), used to iterate the entire matrix. \(y\) represents the pixel values of each point in the labeled image. N represents the number of pixel points in the image. \(\theta \in [\mathrm{0,1}]\) is a custom parameter used to control the relative weight of positive and negative examples. \(\tau \in [\mathrm{0,1}]\) is also a hyperparameter used to suppress the background class.
The medical asymmetric Focal Tversky loss (\({L}_{maFT}\)) can be represented by Eq. (9). Similarly, instead of merely enhancing the majority class (background), we enhance only the minority class (cell) as compared to the original Tversky loss.
$${L}_{maFT}=\left(1-mT{I}_{b}\right)+{(1-mT{I}_{c})}^{1-\tau }$$
(9)
where \(mTI\) stands for medical Tversky loss, its representation is given by Eq. (10).
$$mTI=\frac{{\sum }_{i=1}^{N}{\widehat{{\varvec{M}}}}_{0i}{p}_{0i}}{{\sum }_{i=1}^{N}{\widehat{{\varvec{M}}}}_{0i}{p}_{0i}+\theta {\sum }_{i=1}^{N}{\widehat{{\varvec{M}}}}_{0i}{p}_{1i}+(1-\theta ){\sum }_{i=1}^{N}{\widehat{{\varvec{M}}}}_{1i}{p}_{0i}}$$
(10)
The \({\widehat{M}}_{0i}\) denotes the probability of pixel i belonging to the foreground, \({\widehat{M}}_{1i}\) represents the probability of pixel i belonging to the background, whereas \({p}_{0i}\) signifies the actual pixel value. \({p}_{0i}=1\) designates foreground, while \({p}_{0i}=0\) denotes background, conversely \({p}_{1i}=1\) indicates background. \({p}_{1i}=0\) signifies foreground.
The final asymmetric Medical Focal loss (MFL) can be represented by Eq. (11).
$${L}_{amF}=\mu {L}_{maf}+(1-\mu ){L}_{maFT}$$
(11)
where \(\mu \in [\mathrm{0,1}]\) is similarly a hyperparameter used to determine the weight of the two loss functions.

The branching structure of the HRMED_T

In Fig. 2, the final structure of the network is described in detail. Stage 3 of HRMED_T will be performed twice to repeatedly fuse multi-scale features. Additionally, in the third branch of Stage 3, there will be 6 Transformer Blocks to extract LR features. Since the computation complication of ViT is proportional to the square of the image size, for the third branch, the image size is only one-sixteenth of the original size, but it has a relatively large receptive field. By adding more ViT blocks, the network can capture more global features without adding too many parameters. Furthermore, compared to the fourth branch, the receptive field of the third branch is not large enough to lose too many image details. Therefore, in Stage 4, the third branch still has 6 Transformer Blocks and the fourth branch only has 3 Transformer Blocks. Finally, all feature maps are merged as the final image output through a 1 × 1 Conv + BN + bi-linear interpolation upsampling method. The main steps of the HRMED_T algorithm are described in Algorithm 1. 图片输入后.

Model compression module

Channel distribution distillation

We adopt the model compression method of knowledge distillation, where a large teacher model with a large number of parameters and computational costs teaches a low-cost student model with low memory usage. The HRMED_T model is used as the teacher model and the HRMED_S as the student model, both of which have similar architectures.
The model distillation algorithm we use maps the intermediate layer features of the HRMED_T as the learning object for the HRMED_S. The student and teacher networks' corresponding intermediate layer features are aligned and transformed into a probability distribution through an activation function. The difference between the student and teacher is estimated by the KL divergence (Fig. 5). In this representation, the HRMED_T is represented by \(A\), the HRMED_S by \(B\), and the intermediate layer features of the teacher and student after being transformed by an activation function into a probability distribution are represented by \({M}^{A}\) and \({M}^{B}\), respectively. The formula for knowledge distillation can be found in (12) [48].
$${L}_{amF}(\widehat{y},{M}^{B})+\alpha \cdot \Phi (\varphi ({M}^{A}),\varphi ({M}^{B}))$$
(12)
In this formula, \(y\) represents the true label of an image, \(L\) refers to the Medical Focal Loss, \(\alpha \) serves as a hyperparameter that regulates the discrepancy between the real image and the predicted image by the teacher model. In the mapping between sub-models, \(\alpha \) is often set higher because the HRMED_T can guide the HRMED_S better with the features it has learned. Decreasing \(\alpha \) in the final output layer helps the student model to approach a more realistic outcome. If one desires to train the student model with unlabeled data, simply removing \({L}_{amF}\) and \(\alpha \) will suffice.
The difference between the feature maps of the HRMED_T and HRMED_S can be seen in (13):
$$\Phi (\varphi ({M}^{A}),\varphi ({M}^{B}))=\Phi (\varphi ({M}_{c}^{A}),\varphi ({M}_{c}^{B}))$$
(13)
The channel \(c\in [1,C]\) represents the c-th channel of the feature map. The definition of \(\varphi (\cdot )\) can be seen in (14).
$$\varphi ({M}_{c})=\frac{exp(\frac{{M}_{c,i}}{T})}{{\sum }_{i=1}^{W\bullet H}exp(\frac{{M}_{c,i}}{T})}$$
(14)
\(T\) represents the distilled temperature. The probability becomes softer if a larger \(T\) is used, implying a wider spatial focus for each channel. If a mismatch occurs between the middle layer of the student and the teacher, a Conv with a kernel size of 1 × 1 is required to expand the student's channel number to match that of the teacher. The function \(\Phi (\cdot )\) is utilized to evaluate the distributional difference between two feature maps, here the KL divergence is employed [49], as seen in (15).
$$\Phi ({M}^{A},{M}^{B})=\frac{{T}^{2}}{C}\sum_{c=1}^{C}\sum_{i=1}^{W\bullet H}\varphi ({M}_{c,i}^{A})\cdot log\left[\frac{\varphi ({M}_{c,i}^{A})}{\varphi ({M}_{c,i}^{B})}\right]\\$$
(15)
The method of transferring knowledge from the teacher model to the student model through knowledge distillation boasts numerous advantages, such as the reduced risk of overfitting in the student model due to the use of soft targets, improved results obtained with a small amount of data, even with fewer or zero samples, and the potential to leverage an almost unlimited amount of unlabeled data through the guidance of the teacher in addressing sample scarcity. Additionally, the lighter parameters of the student model translate to greater deployability, higher real-time performance and efficiency, and reduced storage space. In our experiment, we set the temperature \(T=3\). For the logits map, we set \(\alpha =5\) and for the intermediate feature map, we set \(\alpha =50\).

HRMED_S model structure

The student model has a simple structure and does not use a Transformer structure. The details of each convolution and the corresponding details are described in detail in Fig. 6.
For the fusion module at the end of each stage, the network of the lower branch adjusts the channel number through a 1 × 1 convolution kernel and a BN layer, and then performs n-times upsampling through bilinear nearest interpolation. The specific magnification is represented by different line colors in the image. The network of the upper branch performs down-sampling through t 3 × 3 convolution kernels with a stride of 2, a BN layer, and a ReLU activation function. The value of t can also be seen in the colors of the image. The lines of the same layer are directly concatenated and added through a skip connection.
At the end of the student model, we replaced the original fully connected layer with a Conditional Random Field (CRF) to obtain the final image prediction output. This operation facilitates the pure convolutional network to obtain more contextual information about the units and reduces the ambiguity of the edge segmentation of the densely stacked units, thus improving the segmentation performance to some extent [50]. Equation (16) represents the probability of each pixel of the final prediction:
$$ P\left( {y{\mid }x} \right) = \frac{1}{{Z\left( x \right)}}\exp \left( {\mathop \sum \limits_{{i \in U}} \mathop \sum \limits_{{j \in U\left( {x_{i} } \right)}} T_{{i,j}} \left( {y_{j} ,y_{i} ,x_{i} ,i} \right) + \mathop \sum \limits_{{i \in U}} S_{i} \left( {y_{i} ,x_{i} } \right)} \right) $$
(16)
\(X=\{{x}_{1},{x}_{2},\cdots {x}_{n}\}\) represents the features of each pixel point in the original cytopathology image, and \(Y=\{{y}_{1},{y}_{2},\cdots {y}_{n}\}\) is used to represent the features obtained from the mapping of each pixel point in the segmentation prediction map outputted from HRMED_S, which contains the color of the image, the texture, the predicted probability of the surrounding points, and other features. \(U({x}_{i})\) represents the pixel points around the point \({x}_{i}\). \({T}_{i,j}\) is a function to express the relationship between point \({x}_{i}\) and the features of the surrounding points. the \({S}_{i}\) function is used to express the features of the point \({x}_{i}\) itself. \(Z(x)\) is a function used to normalize the function, whose expression is shown in (17) [51]:
$$Z(x)=\sum_{y\in Y}P(y|x)$$
(17)
Similar to the teacher model, the input images in the student model go through the Conv-BN-ReLu module. This module reduces the image to a quarter of its original size while compensating for the slight increase in the number of channels. This is mainly to keep the image size consistent during the training of the teacher model so that the student model can refine the information. Similarly, phases 3–4 are executed twice and finally the CRF performs operations to obtain the output.

Specific details of the model distillation process

From the above analysis, it can be seen that the teacher model HRMED_T (Fig. 2) and the student model HRMED_S (Fig. 6) have very similar structures. For example, the two models have the same number of branches and the same image size at each stage \(m\). HRMED_T has the structure of Vision Transformer, which can better obtain global image information and consider contextual relationships in semantic segmentation. The HRMED_S model pays more attention to lightweight and lower computational complexity. To reduce the structural complexity of the student model while maintaining the predictive performance of the model, we correspond to each merged layer of the HRMED_T and HRMED_S models. The positions in both architecture diagrams correspond to “ + ” signs to further enhance the performance of the student model. At the same fusion position, the size of the images is the same, and the number of channels of the teacher model is often more. The HRMED_S model we designed uses a 1 × 1 convolution to increase the number of channels and then calculate the difference with the teacher model, so as to learn this type of soft target to achieve the purpose of improving robustness and generalization performance.

Quantitative module for segmentation results

In this section, we have combined various indices used in the analysis of tumor cells to design a tool for quantifying the segmentation results of cells, allowing pathologists to quickly and quantitatively analyze and measure large amounts of pathological images without the need for computer vision or programming training. The tool includes:
(a) Providing the nucleus-cytoplasm ratio (the ratio of cell pixels to background pixels in the segmentation result) and cell count for each image, and allowing the pathological images to be sorted based on these two indices.
(b) For each segmented image, to better assist doctors in observing tissue constituent morphology, we have synthesized the segmentation outcome with the source image, marking the background as dark and marking the cells as bright.
(c) For a selected image, we reveal the shape and circular deviation of the cell by calculating the cell's roundness (\(R=\frac{4\pi S}{{C}^{2}}\), where S and C are the area and circumference of the cell, respectively), aspect ratio (\(AR=\frac{{L}_{l}}{{L}_{s}}\), where \({L}_{l}\) and \({L}_{s}\) symbolize the dimensions of the long and short axis of the cell, respectively), and eccentricity (\(e=\frac{c}{a}\), where c is the focus of the ellipse composed of the cell's long and short axes and a is the length of the ellipse’s long axis). Based on the thresholds set by experts, the calculated results are displayed in a graded format, highlighting those recognition results with unusual shapes, thereby revealing the heterogeneity and polymorphism of the cells.
(d) To facilitate comparative analysis, ROI localization, and surgical plan customization for physicians, this tool offers the ability to select, crop, rotate, and resize cells in the segmentation result, while also enabling easy viewing of selected cell area, perimeter, and the corresponding image position in the original pathological image.

Experiments and results

Experimental environment and dataset introduction

Dataset: The data we used was sourced from the Research Center for Artificial Intelligence, Monash University, and a total of 1000 pathology images were counted [52]. We selected 40 × magnification, took screenshots of random areas, and intercepted 10 sub-images for each pathology image, with sub-image size of 512*512. Finally, a total of 10,000 images were obtained. After screening, there are 2164 images available for training.
Contrasting Models: The models utilized for comparison encompass: U-Net [53], Unet +  + [54], DeepLabv3 + [55], Attention U-Net [56], SETR [57], Swin-Unet [58], and CSWin Transformer [59].
Evaluation Metrics: Confusion matrix is a commonly used discriminant for medical image segmentation [60]. The network's prediction accuracy with regard to segmented regions is measured by the F1-score (F1), precision (Pre), recall (Re), Dice Similarity Coefficient (DSC), and Inter-section of Union (IoU), which are all derived based on the confusion matrix [61]. Recall provides a clear image of the percentage of the true region's area that the correctly segmented region occupies. The percentage of the model’s predicted portion that is occupied by the predicted portion of the model is clearly depicted by precision. The area of overlap (IOU) between the labels and the predicted segmentation is calculated by dividing the area of union (AOU) between the labels and the predicted segmentation. The similarity between the true and predicted regions is computed by DSC. Similarly, such metrics are also used as metrics to evaluate the performance of the network in cell segmentation [62]. We work to enhance the model's various metrics in digital pathology image segmentation to achieve precise cell nuclei prediction. Furthermore, to compare the computational cost of the lightweight model obtained through knowledge extraction with the computational cost of the original model, we use Floating Point Operations (FLOPs) and Parameters to measure the size of the model [63]. FLOPs denote the number of floating-point operations per second, which is a measure of the computational volume of a deep learning model and enables the assessment of the computational complexity of the model in the inference phase. Params refers to the number of parameters of the model, which is used as a measure of the complexity and expressiveness of the model [64].
Hyperparameter Settings: In all the following experiments, we trained the model using 300 epochs with a batch size of 4. During training, we used the Stochastic Gradient Descent (SGD) optimization algorithm to converge parameters, with an initial learning rate of 0.01, momentum of 0.9, and decay rate of 0.0005.

Result and discussion

In Fig. 7, we show a set of data that has been randomly color-enhanced. We omit the first four basic data enhancement results and focus on showing the designed color enhancement effects. As can be seen from the figure, after multiple color enhancement operations, we almost completely simulate the noise caused by staining differences in pathology images.
Next, we conducted comprehensive comparative experiments between our segmentation models, HRMED_T and HRMED_S, and other commonly used medical image segmentation models or models that are highly representative in the field of semantic segmentation, under the same initial conditions. Table 3 presents the complete experimental results. As can be seen, under default parameters, HRMED_T achieves excellent segmentation results with an IoU that is 1.6% higher than the best-performing model, CSWin. However, under default conditions, the performance of the student model HRMED_S is not excellent and its segmentation results are similar to those of U-Net. It is also worth noting that the ViT model, Swin-Unet, which has a small parameter and computational cost and performs well on abdominal organ segmentation, does not perform satisfactorily in pathology image segmentation. Its performance is even inferior to that of Unet +  + . This directly indicates that applying segmentation models from other fields to pathology image segmentation may not yield good results. However, our submodel, HRMED_S, which is designed specifically for pathology images and is a pure CNN architecture, achieves better results than Swin-Unet despite having lower FLOPs and Params due to its ability to maintain high-resolution images throughout the process.
Table 3
The prediction performance of each model for cell nuclei in digital pathology images of osteosarcoma
Model
IoU
DSC
Re
Pr
F1
FLOPs
Params
U-Net
0.639
0.773
0.789
0.803
0.773
110.02 G
9.16 M
Unet +  + 
0.662
0.791
0.815
0.800
0.791
277.26 G
54.71 M
DeepLabv3 + 
0.655
0.787
0.844
0.762
0.751
667.36 G
34.88 M
Attention u-net
0.656
0.786
0.786
0.800
0.815
533.08 G
86.21 M
SETR
0.595
0.739
0.790
0.718
0.739
397.21 G
27.17 M
Swin-unet
0.605
0.748
0.901
0.659
0.748
11.74 G
52.0 M
CSWin transformer
0.657
0.893
0.894
0.862
0.893
230.69 G
7.77 M
Our (HRMED_T)
0.673
0.895
0.930
0.896
0.895
232.49 G
55.7 M
Our (HRMED_T + MFL)
0.681
0.898
0.926
0.875
0.898
232.49 G
55.7 M
Our (HRMED_T + MFL + DA)
0.756
0.911
0.911
0.912
0.911
232.49 G
55.7 M
Our (HRMED_S)
0.599
0.864
0.905
0.812
0.864
10.33 G
3.99 M
Our (HRMED_S + MFL)
0.615
0.879
0.876
0.846
0.879
10.33 G
3.99 M
Our (HRMED_S + MFL + DA)
0.642
0.890
0.919
0.892
0.890
10.33 G
3.99 M
Our (HRMED_S + MFL + DA + KD)
0.710
0.910
0.903
0.916
0.910
10.33 G
3.99 M
Our (HRMED_S + MFL + DA + KD + CRF)
0.710
0.911
0.897
0.932
0.915
10.33 G
3.99 M
Simultaneously, in Table 3, it can be observed that Medical Focus Loss has resulted in a 0.8% and 1.6% increase in the key indicators IoU for HRMED_T and HRMED_S, respectively. The data augmentation pipeline has brought them a 3.5% and 2.7% improvement, respectively. Furthermore, the model distillation technique has provided incredible help to the sub-models, nearly aligning their performance with that of the teacher model. This is primarily because the student model we designed is highly similar to the teacher model, and underwent multi-layer channel distillation. Additionally, with our persistent parameter tuning, the model's IoU, DSC, Pr, and F1 have improved by 11.1%, 4.7%, 12.0%, and 5.1%, respectively, compared to the original HRMED_S. Although CRF, as a post-processing technique, does not greatly improve the sub-models, it helps connect the segmentation model to the context of the image in actual effects, making the edge segmentation of stacked cells in pathological images clearer. The FLOPs of the HRMED_T model in this study are only 232.49G, and the FLOPs of the HRMED_S model are only 10.33G. Although compared to Swin-unet, which has only 11.74G FLOPs, the HRMED_T model requires 20 times more calculations. However, the HRMED_S model has similar computational complexity to the Swin-unet model. Moreover, our HRMED_S has higher prediction performance, and its indicators in all aspects of IoU and DSC are better than the Swin-unet algorithm.
In Fig. 8, we present our experimental results in a more intuitive way using line graphs and bar charts. In graph (a), it can be seen that data augmentation has the most significant improvement on the performance of the HRMED_T model. We believe that this is due to the data pipeline significantly enriching the scarce dataset, as well as the color augmentation amplifying the staining noise in different pathological images, enabling the deep learning network to better learn the features of this dataset. In graph (b), the distilled model’s various indicators are almost all in a leading position. Figure (c) compares the key indicators of several models with significant differences, and it can be seen that our two models are leading in nearly all indicators. Graph (d) is to highlight the significant difference in parameter and computation between the distilled HRMED_S model and other models, thus highlighting the lightweight characteristics of our model. It can be clearly seen from the figure that the HRMED_T model and the HRMED_S model have the lowest number of parameters, only 3.99 M, indicating that our research has relatively low hardware requirements; the HRMED_S model has the lowest computational cost, with only 10.33G FLOPs. It shows that the processing speed of HRMED_S model is faster.
Figure 9 illustrates the segmentation performance of different models on several representative pathological images of osteosarcoma under default training conditions. Our HRMED_T model yields the most reliable segmentation results. Although HRMED_T does not exhibit a significant advantage in handling simple segmentation tasks such as (d) and (e), it excels in processing images with dense cells, such as (a) and (b), where it can outline the boundaries of densely stacked cells more clearly. Moreover, in the presence of cells that are unevenly distributed, have varying sizes, or possess irregular shapes, such as (c) and (f), HRMED_T demonstrates good robustness and generalization.
The Fig. 10 depicts the optimized model's prediction results, showcasing 4 images from Fig. 9. It is evident that for simple segmentation tasks, HRMED_T and HEMED_S show very similar performance, effectively alleviating problems such as over-segmentation and under-segmentation. However, for complex boundaries, such as those present in the first image, the submodel’s segmentation at unit stack locations is still insufficient. Despite utilizing CRF technology to incorporate contextual information, the CNN model is still less sensitive to spatial information than the ViT model. However, the HEMED_S model has lower computational complexity and faster training speed, which is easier to implement in areas with insufficient medical resources.

Ablation experiments

Indicators such as parameters and window size in the experiment will affect the segmentation performance of the model. We performed ablation experiments to verify the sensitivity of the results.
First, we focused on the impact of changes in the attention window \(s\) in the high-precision HRMED_T model on model performance. We continuously adjust the window size s based on the original model to observe changes in model performance, as shown in Table 4. The value of \(s\) gradually increases from 1 to 10. The larger the window size, the more computationally expensive the model is. But the performance of the model does not simply increase with the size of the window. When \(s\le 3\), the IoU of the model increases as the window increases; when \(s>3\), the IoU of the model gradually decreases as the window increases. This means that a suitable window size may be more conducive to the model capturing detailed features. After comprehensively considering the computational cost and segmentation accuracy of the model, we chose \(s=3\) to design the model. At this value, the model achieved the best results.
Table 4
The impact of different window sizes on the Medical Attention structure
Window size \(s\)
1
3
5
7
9
FLOPs (G)
231.99
232.49
232.8
233.78
234.93
mIoU
0.709
0.716
0.713
0.712
0.712
Then, we evaluate the impact of three parameters of MFL on the algorithm. As shown in Fig. 11, we experimented with three parameters of MFL in the HRMED_T and HRMED_S models. In Fig. 11a, we first set \(\mu =0.5\) and \(\theta =0.5\), and then optimize the parameter \(\tau \in [0.1, 0.9]\) to analyze the variation of IoU with \(\tau \) for the HRMED_T model. Then we set \(\mu =0.5\) and \(\tau =0.5\) for the case, we adjusted the parameter \(\theta \in [0.1, 0.9]\) to obtain the variation of the model's IoU with respect to \(\theta \). Finally, with \(\theta =0.5\) and \(\tau =0.5\), we adjusted the parameter \(\mu \in [0.1, 0.9]\) to analyze the relationship between IoU and \(\mu \) of the model. In addition, to analyze the effect of different loss functions on model sensitivity, we also compare common loss functions (e.g., Cross entropy, Dice, Focal, and Tversky loss) with MFL. The results in the HRMED_T model show that the model performs best for values of \(\tau \),\(\theta \), and \(\mu \) of 0.7, 0.6, and 0.5, respectively. \(\mu \in [0.1, 0.9]\) for any value of \(\mu \in [0.1, 0.9]\) yields results that are almost always better than the best-performing Tversky loss. And for fixed \(\tau \) and \(\theta \), the best experimental results for the HRMED_T model improve the IoU by 0.6% compared to the choice of Tversky loss. Figure 11b shows the performance variation of the HRMED_S model for different parameter values. Same as the HRMED_T model, we analyze the variation of the IoU of the HRMED_S model with respect to the parameters while fixing \(\tau \), \(\theta \), and \(\mu \), respectively. The best performance of the HRMED_S model is achieved for values of \(\tau \), \(\theta \), and \(\mu \) of 0.6, 0.3, and 0.4 respectively. It can be seen that the results obtained for values of \(\mu \in [0.1, 0.9]\) almost always outperform the best-performing Tversky loss. For any \(\tau \), \(\theta \), and \(\mu \), the model outperforms the Focal loss. And for fixed \(\tau \), and \(\theta \), the best experimental results of the HRMED_S model improve the IoU by 1% compared to the Tversky loss. Based on the above analysis, it can be obtained that the overall performance of the algorithm is better when \(\mu =0.5\), \(\theta =0.6\) and \(\tau =0.7\). Therefore in this experiment, we set \(\mu =0.5\), \(\theta =0.6\), \(\tau =0.7\).
In addition, the distillation temperature \(T\) and the loss function α of the output layer also have a large impact on the algorithm, especially in the HRMED_S model, where the value of \(T\) affects the effectiveness of knowledge distillation. As shown in Fig. 12, we further tuned the parameters of the output layer loss function \(\alpha \) and distillation temperature \(T\) on HRMED_S. It can be seen from the figure that softer probability maps can convey knowledge better regardless of the value of \(\alpha \). However, too high a temperature may result in a uniform probability distribution, which may degrade the performance. Increasing the weight of the soft target relative to the hard target \(\alpha \) also improves the segmentation performance. \(\alpha \) is either too large or too small and the IoU of the model decreases. When \(\alpha =5\), the overall IoU of the model is at a high level. In particular, the IoU reaches an optimal value when \(T=3\). At this point, the model has the best prediction.
Finally, we analyze the influence of the convolutional layers on the model for each branch of the sub-model HRMED_S at each stage. The more convolutional layers, the higher the value of IoU of the model. However, as the number of convolutional layers increases, so does the computational cost. For example, when the number of convolutions per module was increased from 2 to 5, the IoU of the model increased by 0.0084, but the FLOPs increased by about 80%. Therefore, after considering the performance and computational complexity of the model, this study sets the number of convolutions of the module to 2 CNN layers (Table 5).
Table 5
Comparison of the number of CNN layers in each sub-module of the HRMED_S model
Number of CNN
1
2
3
4
5
FLOPs (G)
6.66
10.33
12.65
15.64
18.64
IoU
0.691
0.710
0.709
0.714
0.718

Conclusion

This article presents the development of an intelligent assistive diagnosis system for malignant tumor pathology images based on the Multi-Scale High-Resolution Vision Transformer. It encompasses four modules: image enhancement, model segmentation, knowledge distillation, and quantification measurement. This system aims to alleviate the workload and difficulty for pathologists while achieving good predictive results with limited computational cost. The system offers flexibility for medical institutions under different conditions to make choices. The experiments were conducted using pathology images of osteosarcoma as an example, showing that our system can achieve accurate segmentation of malignant tumor pathology images.
In the future, we will continue to dedicate ourselves to enhancing the accuracy of our segmentation models, improving the clarity of cell edge stacking, and striving to make our models leaner. Additionally, we will endeavor to address our small sample size issue by utilizing semi-supervised or unsupervised learning models. In the field of data processing, we will explore various noise reduction algorithms. As for our design of auxiliary tools, we aspire to utilize clustering methods to classify complex cell borders, rather than relying solely on mathematical parameters. Finally, we will integrate clinical practices to continuously refine our intelligent diagnostic assistance system.

Acknowledgements

Availability of data and materials: all data analyzed during the current study are included in the submission.

Declarations

Conflict of interest

The authors declare no conflict of interest.
Not applicable.

Institutional review board

Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
27.
Zurück zum Zitat Y. Xie, J. Zhang, C. Shen, and Y. Xia, "CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, Cham, M. de Bruijne et al., Eds., 2021// 2021: Springer International Publishing, pp. 171–180. Y. Xie, J. Zhang, C. Shen, and Y. Xia, "CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, Cham, M. de Bruijne et al., Eds., 2021// 2021: Springer International Publishing, pp. 171–180.
31.
Zurück zum Zitat Wang Z, Dong N, Voiculescu I (2022) “Computationally-Efficient Vision Transformer for Medical Image Seman-tic Segmentation Via Dual Pseudo-Label Supervision,” 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 2022, pp. 1961–1965, doi: https://doi.org/10.1109/ICIP46576.2022.9897482 Wang Z, Dong N, Voiculescu I (2022) “Computationally-Efficient Vision Transformer for Medical Image Seman-tic Segmentation Via Dual Pseudo-Label Supervision,” 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 2022, pp. 1961–1965, doi: https://​doi.​org/​10.​1109/​ICIP46576.​2022.​9897482
32.
33.
Zurück zum Zitat Gu J et al. (2022) “Multi-scale high-resolution vision transformer for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12094–12103 Gu J et al. (2022) “Multi-scale high-resolution vision transformer for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12094–12103
34.
Zurück zum Zitat Liu Y, Chen K, Liu C, Qin Z, Luo Z, Wang J (2019) “Structured knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2604–2613 Liu Y, Chen K, Liu C, Qin Z, Luo Z, Wang J (2019) “Structured knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2604–2613
36.
Zurück zum Zitat Hou Y, Zhu X, Ma Y, Loy CC, Li Y (2022) “Point-to-voxel knowledge distillation for lidar semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8479–8488. Hou Y, Zhu X, Ma Y, Loy CC, Li Y (2022) “Point-to-voxel knowledge distillation for lidar semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8479–8488.
38.
Zurück zum Zitat Kang Q, Lao Q, Fevens T (2019) Nuclei Segmentation in Histopathological Images Using Two-Stage Learning," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Cham, D. Shen et al., Eds., 2019// 2019: Springer International Publishing, pp. 703–711 Kang Q, Lao Q, Fevens T (2019) Nuclei Segmentation in Histopathological Images Using Two-Stage Learning," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Cham, D. Shen et al., Eds., 2019// 2019: Springer International Publishing, pp. 703–711
46.
Zurück zum Zitat M. Yeung, E. Sala, C.-B. Schoenlieb, and L. Rundo, "Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation," Computerized Medical Imaging and Graphics, Article vol. 95, Jan 2022, Art. no. 102026. M. Yeung, E. Sala, C.-B. Schoenlieb, and L. Rundo, "Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation," Computerized Medical Imaging and Graphics, Article vol. 95, Jan 2022, Art. no. 102026.
47.
Zurück zum Zitat M. Yeung, E. Sala, C.-B. Schoenlieb, and L. Rundo, "Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation," Computerized Medical Imaging and Graphics, Article vol. 95, Jan 2022, Art. no. 102026. M. Yeung, E. Sala, C.-B. Schoenlieb, and L. Rundo, "Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation," Computerized Medical Imaging and Graphics, Article vol. 95, Jan 2022, Art. no. 102026.
48.
Zurück zum Zitat Shu C, Liu Y, Gao J, Yan Z, Shen C (2021) “Channel-wise knowledge distillation for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5311–5320. Shu C, Liu Y, Gao J, Yan Z, Shen C (2021) “Channel-wise knowledge distillation for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5311–5320.
49.
Zurück zum Zitat Hershey JR, Olsen PA (2007) “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models,” 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, Honolulu, HI, USA, 2007, pp. IV-317-IV-320, doi: https://doi.org/10.1109/ICASSP.2007.366913 Hershey JR, Olsen PA (2007) “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models,” 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, Honolulu, HI, USA, 2007, pp. IV-317-IV-320, doi: https://​doi.​org/​10.​1109/​ICASSP.​2007.​366913
51.
Zurück zum Zitat Ma X, Huang H, Wang Y, Romano S, Erfani S, Bailey J (2020) “Normalized Loss Functions for Deep Learning with Noisy Labels,” presented at the Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. [Online]. Available: https://proceedings.mlr.press/v119/ma20c.html Ma X, Huang H, Wang Y, Romano S, Erfani S, Bailey J (2020) “Normalized Loss Functions for Deep Learning with Noisy Labels,” presented at the Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. [Online]. Available: https://​proceedings.​mlr.​press/​v119/​ma20c.​html
53.
Zurück zum Zitat Ronneberger O, Fischer P, Brox T (2015) “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234–241: Springer Ronneberger O, Fischer P, Brox T (2015) “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234–241: Springer
54.
Zurück zum Zitat Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J (2018) UNet++: a nested U-Net architecture for medical image segmentation. In: Stoyanov D et al (eds) Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer International Publishing, Cham, pp 3–11CrossRef Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J (2018) UNet++: a nested U-Net architecture for medical image segmentation. In: Stoyanov D et al (eds) Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer International Publishing, Cham, pp 3–11CrossRef
55.
Zurück zum Zitat Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV). pp. 801–818 Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV). pp. 801–818
56.
57.
Zurück zum Zitat Zheng S et al. (2021) “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6881–6890 Zheng S et al. (2021) “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6881–6890
58.
Zurück zum Zitat Cao H et al. (2023) “Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation,” in Computer Vision – ECCV 2022 Workshops, Cham, L. Karlinsky, T. Michaeli, and K. Nishino, Eds., 2023//: Springer Nature Switzerland, pp. 205–218 Cao H et al. (2023) “Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation,” in Computer Vision – ECCV 2022 Workshops, Cham, L. Karlinsky, T. Michaeli, and K. Nishino, Eds., 2023//: Springer Nature Switzerland, pp. 205–218
59.
Zurück zum Zitat Dong X et al. (2022) “Cswin transformer: a general vision transformer backbone with cross-shaped windows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, , pp. 12124–12134 Dong X et al. (2022) “Cswin transformer: a general vision transformer backbone with cross-shaped windows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, , pp. 12124–12134
Metadaten
Titel
Cytopathology image analysis method based on high-resolution medical representation learning in medical decision-making system
verfasst von
Baotian Li
Feng Liu
Baolong Lv
Yongjun Zhang
Fangfang Gou
Jia Wu
Publikationsdatum
04.03.2024
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-024-01390-7