nach oben

Complex & Intelligent Systems

Open Access 04.03.2024 | Original Article

Cytopathology image analysis method based on high-resolution medical representation learning in medical decision-making system

verfasst von: Baotian Li, Feng Liu, Baolong Lv, Yongjun Zhang, Fangfang Gou, Jia Wu

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Artificial intelligence has made substantial progress in many medical application scenarios. The quantity and complexity of pathology images are enormous, but conventional visual screening techniques are labor-intensive, time-consuming, and subject to some degree of subjectivity. Complex pathological data can be converted into mineable image features using artificial intelligence image analysis technology, enabling medical professionals to quickly and quantitatively identify regions of interest and extract information about cellular tissue. In this study, we designed a medical information assistance system for segmenting pathology images and quantifying statistical results, including data enhancement, cell nucleus segmentation, model tumor, and quantitative analysis. In cell nucleus segmentation, to address the problem of uneven healthcare resources, we designed a high-precision teacher model (HRMED_T) and a lightweight student model (HRMED_S). The HRMED_T model is based on visual Transformer and high-resolution representation learning. It achieves accurate segmentation by parallel low-resolution convolution and high-scaled image iterative fusion, while also maintaining the high-resolution representation. The HRMED_S model is based on the Channel-wise Knowledge Distillation approach to simplify the structure, achieve faster convergence, and refine the segmentation results by using conditional random fields instead of fully connected structures. The experimental results show that our system has better performance than other methods. The Intersection over the Union (IoU) of HRMED_T model reaches 0.756. The IoU of HRMED_S model also reaches 0.710 and params is only 3.99 M.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

With the continuous integration of computer science and medical information, artificial intelligence has achieved substantial success in many medical scenarios, such as assisting in diagnosis, informing treatment decisions, predicting risks, and reducing medical errors [1‐3]. AI can complete medical tasks with speed and accuracy, including the analysis of cardiology images [4], the diagnosis of eye diseases from optical coherence tomography [5], and the determination of bone age from X-rays [6]. Artificial intelligence is attempting to directly address some difficult medical diagnostic issues, including left ventricular hypertrophy [7], and prostate cancer [8]. However, there are still numerous auxiliary diagnostic techniques that can aid medical professionals in making their jobs easier for those medical issues that artificial intelligence is still unable to resolve on its own [9, 10]. Artificial intelligence can help medical workers save time and focus on more complex medical problems. Compared with traditional manual diagnosis, artificial intelligence-assisted diagnosis has higher repeatability, objectivity, and real-time performance. In addition, artificial intelligence does not get tired and can be used in multiple places. Even remote areas can benefit from the artificial intelligence medical information support system developed by top hospitals, helping to alleviate the problem of limited medical resources in developing countries [11].

Artificial intelligence-assisted segmentation of pathological images of tumors can significantly enhance the efficiency of medical practitioners in quantifying various cellular indicators, visualizing organelle morphologies, locating regions of interest (ROI), and supporting the customization of surgical plans [12‐14]. To a certain extent, it can also alleviate the issue of untimely patient diagnosis in developing countries caused by a shortage of pathologists.

Investigations into pathological image segmentation methods primarily encompass two divisions: conventional machine learning techniques and deep learning approaches. The former relies heavily on extensive manual feature selection through specialized knowledge, while the latter’s end-to-end learning nature reduces dependence on such knowledge. Deep learning models such as CNN [15], Encoder-Decoders [16], and Generative Adversarial Networks (GANs) [17] have seen rapid development in recent years. However, as models strive to learn increasingly complex features, they become increasingly large and computationally demanding. Model compression techniques such as knowledge distillation (KD) have emerged to address these challenges and bring cutting-edge models to regions or devices with limited computational resources [18]. In addition, in medical decision-making systems, the use of appropriate event-triggering mechanisms not only improves the performance of data transmission but also minimizes computational and storage resource consumption [19]. For example, Markov-based control systems and reinforcement learning have been applied to robotic systems, biological systems, and so on [20, 21].

We have summarized the potential problems that we perceive to exist in both traditional manual discrimination of pathological images and intelligent segmentation of pathological images as follows:

(1)

The cost of pixel-level labeling in pathological images is extremely high. As an example, the TCGA Kumar dataset comprises merely thirty 40 × magnification 1000 × 1000 pathological images [22]. But the number of annotated nuclei surpasses 20,000, which makes such a high specialist cost unfeasible in economically disadvantaged developing countries;

(2)

Manually screening and processing the images would be a labor-intensive and time-consuming task, with the large number of pathology sections and the ultra-high resolution of each image. The vast majority of the images are just background, and the average cell area in our dataset is only 8.29% of the background;

(3)

Cell nucleus segmentation models with high accuracy and strong generalization properties are often cumbersome. This means that the deployment of the models requires significant computational resources. Developing countries are not well equipped medically to enable the use of such equipment;

(4)

The number of pathologists in developing countries is low and the distribution of healthcare resources is often highly uneven [9, 14]. In China, for example, more than 80% of medical resources are concentrated in cities with only 10% of the population. This results in the majority of patients in regions with scarce resources and outdated equipment facing difficulties in receiving timely diagnoses and treatments in the early stages of disease development;

Due to the variability and complexity of pathology images, the diagnosis of tumors by doctors with different experiences has a certain degree of subjectivity. For example, consider the geometric features, texture features, and shape features of tumor cells and their complexity. This may lead to an increased misdiagnosis rate for inexperienced doctors and a decrease in the reproducibility of analysis results. Therefore, how to quickly and effectively help doctors objectively extract features is an urgent problem.

In summary, we have designed a medical information aid system for segmentation and quantitative statistics of cell nuclei oriented to pathology images. The system includes several parts of data preprocessing, cell nucleus segmentation, tumor modeling and quantitative analysis. Taking High-Resolution Medical Representation Learning as the backbone network, we designed a high-precision teacher model (HRMED_T). The HRMED_T model solves the problems of cost and parameter explosion of the traditional deformer by the design of a cross-crossing window, which allows the deformer to replace the convolutional structure. Meanwhile, based on the lack of resources in many domains, we refine the model by knowledge refinement and design a lightweight student model (HRMED_S). The HRMED_S model simplifies the hierarchical structure and uses conditional random fields to replace the original fully-connected layer to obtain more contextual information about the units. The HRMED_S model has only 4.5% of the computation and 7.7% of the number of parameters of the HRMED_T model, providing better real-time performance. Our system provides more options for districts with different computational resources. In addition, we have designed a quantitative analysis tool for physicians to quickly analyze images. Physicians can visually identify cell regions, obtain data such as cell eccentricity, and quickly localize regions of interest in the system.

(1)

To maximize the use of limited data and prevent overfitting due to the small sample size, we established a data augmentation pipeline for images of malignant tumor pathology. This pipeline included basic augmentation operations and four color treatments, namely random brightness, saturation, hue, and contrast enhancement; as a result, the model's robustness and generalization ability were enhanced;

(2)

Based on high-resolution representation learning and Transformer, a high-precision teacher model (HRMED_T) was developed that allows the Transformer to replace the convolutional structure through crossed cross-windows. The model achieves accurate segmentation by heavy-stream fusion and parallel low-resolution (LR) convolution with high-resolution images while maintaining the high-resolution (HR) representation. In addition, the HRMED_T model introduces a uniform focus loss function to address class imbalance.

(3)

A student model that is lightweight (HRMED_S) is built. It refines the split results using a conditional random field structure and migrates knowledge from the HRMED_T model using the Channel-wise Knowledge Distillation algorithm. With faster training, the HRMED_S model can achieve relatively better prediction accuracy with only fewer parameters. This works particularly well in places with limited access to healthcare resources. This is very effective for areas that are not rich in healthcare resources. They can rely on large hospitals to train the model and then deploy it locally.

(4)

We validated the feasibility of our system on 2,164 pathological images of sarcoma that we generated, and multiple comparative experiments showed that both models within our system can perform accurate segmentation of the pathological images. Finally, we also designed a segmentation result processing module to assist doctors in quickly quantifying various cellular indices, thus enabling efficient and accurate diagnosis.

With the advancement of information technology and computer hardware, AI is continuously empowering various industries, and the combination of AI and medical diagnosis is one of the current hot topics [23‐25].

In the field of image segmentation, since the emergence of the Neural Network (NN), a plethora of segmentation networks have emerged. Wang et al. [26] proposed a novel CNN-based network High-Resolution Net (HRNet), which preserves high-resolution representations throughout the training, and leverages parallel low-resolution convolutions and high-scale image fusion to achieve multi-scale integration. This results in a more refined and spatially accurate segmentation outcome. Yutong Xie et al. [27] proposed a 3D segmentation network (CoTr) with CNN to extract features with an encoder-decoder structure. But the flaw that the CNN structure cannot model long-range dependencies still exists. The authors eventually added a Transformer to solve the problem. But it also increased the number of parameters in the model. Hsiang-Yu Han et al. [28] proposed a network based on deep CNN for real-time semantic segmentation, aiming to improve inference speed by class-aware edge loss. The results show that the network has good performance in real-time segmentation. However, the method is difficult to fully obtain global information. Xiang Li et al. [29] proposed a convolutional neural network incorporating an attentional mechanism, which replaces the traditional model of focusing on regions one at a time in the convolutional layers by establishing connections in the intermediate feature layers, thus improving the ability to capture global information.

In the recent research on image segmentation, the Vision Transformers (ViT) structure has started to show unique advantages and is beginning to challenge the dominance of CNNs in this field [30, 31]. Many CNN networks have gradually been improved with ViT structures, such as the High-Resolution Transformer (HRFormer) proposed by Yuan et al. [32], which is an improvement of HRNet's ViT structure. This structure can greatly improve the segmentation accuracy by obtaining global aggregated features right at the shallow network through multi-resolution parallel design and local window self-attention. Gu et al. [33] also proposed the High-Resolution Vision Transformer (HRViT) based on HRNet's ViT improvement. It focuses on fusing high-resolution multi-branch architecture and ViT to solve the problem of poor performance of ViT in dense task segmentation. Moreover, the HRViT network improves segmentation accuracy and efficiency by reducing the redundancy of heterogeneous branches and enhancing the attention module. Although the accuracy of HRViT is improved compared to the traditional CNN segmentation network. However, the number of parameters of the ViT network is still very large and is only applicable to images of realistic scenes. For complex pathology image segmentation, the performance of such models still falls short of expectations. How to minimize the parameters under the premise of ensuring accurate segmentation? Researchers have turned their attention to model refinement techniques.

Knowledge distillation is the act of conveying the insights of a robust but massive teacher model to a compact, lighter student model, to deploy it on devices with restricted computational capabilities. Yifan Liu et al. [34] designed a structured-based KD scheme to refine similarity by pairwise distillation modules and overall information by holistic distillation modules. However, this method is difficult to adapt to the complex background changes in pathology images, and the demand for computation is still high. Dian Qin et al. [35] designed a distillation strategy to apply information extraction from existing models to lightweight student networks. The authors designed a region affinity module to distill information. However, the method still faces the problem of memory storage capacity in medical image processing. Yuannan Hou et al. [36] proposed a point-level to voxel-level KD method (PVD), which improves the ability of the student model to extract features through the similarity information between points and voxels. The results show that the method has lower resource consumption and training speed in the radar domain.

Due to the complexity of digital pathology images, all existing segmentation networks are difficult to apply directly in the field of cytopathology images. Researchers are beginning to focus on automatic cell nucleus segmentation techniques. Lukasz Roszkowiak et al. [37] proposed a clustering segmentation algorithm based on distance transformation. For the problem of aggregated cell nuclei, the algorithm divides the aggregated region into smaller regions and uses them as segmentation objects based on shape. The method can reduce the over-segmentation of cell nuclei. Similarly, to solve the problem of overlapping cell segmentation, Qingbo Kang et al. [38] designed a novel kernel segmentation method with a two-stage learning framework and deep layer aggregation (DLA). It generates kernel boundary coarse segmentation results based on DLA in the first stage, and uses shallow U-Net in the second stage to refine the results of the previous stage. The core of the method is to decompose a complex task into subtasks. Jinjie Huang et al. [39] proposed a cervical cell clump image segmentation method based on a multi-scale fuzzy clustering algorithm. It mainly consists of three steps: separation of the cell region and background region; node of interest extraction; and cell nucleus segmentation. The method has high accuracy on cervical sections. However, the number of parameters of such pathology image methods is still large, which is difficult to meet the demand for training speed. It is difficult to realize for many areas with insufficient resources.

A comparative analysis of related work is shown in Table 1. The complexity of tumor cell nuclei and the limitations of medical resources in developing countries pose a challenge to the automatic segmentation of images. Existing large models sacrifice processing time and complexity to achieve high prediction accuracy. Lightweight models improve the training speed, but they are difficult to achieve the expected results for the recognition of digital pathology images and are difficult to apply directly. Based on this, we design a high-precision segmentation network specifically for malignant tumor pathology images based on high-resolution representation learning and a lightweight network with relatively low hardware equipment requirements. Our method is more suitable for malignant tumor diagnosis application scenarios, especially for regions with insufficient medical resources.

Table 1

Summary analysis of related work

Classification	Literatures	Dominance	Gaps
CNN-based	Literature [26]	High spatial accuracy	Relatively high complexity of the network
	Literature [27]	Solves the long-range dependency problem of CNNs	Relatively high complexity of the network
	Literature [28]	Better performance in real-time segmentation	Difficulty in obtaining full access to global information
	Literature [29]	Improved ability to capture global information	Slower training
ViT-based	Literature [30]	Good performance in remote sensing images	Limited application scenarios
	Literature [31]	Reduced need for labeled data	Recognition of small targets is limited
	Literature [32]	Precision Improvement of Segmentation	Model parameters increase too quickly
	Literature [33]	Improved performance relative to CNN-based methods	The number of parameters is still very large for realistic scenarios
KD-based	Literature [34]	Better performance in the field of compact semantic segmentation networks	Computationally intensive and not applicable to the field of digital pathology images
	Literature [35]	Introduction of a regional affinity module
	Literature [36]	Lower resource consumption in the radar field
Digital pathology image segmentation model	Literature [37]	Reduced excessive segmentation of the nucleus	Higher resource consumption
	Literature [38]	Solved the cell overlap problem
	Literature [39]	Suitable for cervical sectioning

The first three groups are based on backbone networks for classification and the last group is a classical algorithm in digital pathology images

System model design

Digital pathology image analysis is of great value for tumor diagnosis [40]. Due to the huge number and complexity of pathology images, traditional visual screening methods are not only time-consuming and labor-intensive but also have certain subjective variability. Through the technique of image segmentation, complex pathological data can be extracted into mineable image features, which can assist doctors to quickly and quantitatively extracting the information on cell tissues [41‐44]. There have been many models on digital pathology image segmentation. Facing the complexity of cell nuclei overlapping and clustering, many networks keep increasing network layers to achieve higher prediction accuracy, resulting in higher complexity and slow training speed of the models [45]. The accuracy of lightweight networks with fast training speed is again relatively low [29]. Based on this, we designed an artificial intelligence medical information-assisted diagnosis system for digital pathology images, as shown in Fig. 1. It aims to assist doctors in diagnosing cytopathology images, improve the efficiency and quality of treatment while reducing medical costs, and provide practical and effective help to both doctors and patients.

Our intelligent diagnostic support system comprises four modules: Firstly, the input image passes through a data augmentation pipeline, incorporating several fundamental data augmentation techniques as well as four color-targeted data augmentations; subsequently, the image is fed into a teacher model HRMED_T for training, yielding a segmentation result; thirdly, after the teacher model has been trained, knowledge is transferred to a lightweight student network through a model compression technique known as knowledge distillation; fourthly, regions with varying medical resources can choose from either model to output the segmentation result of the image and perform quantifiable analysis of the pathology image.

Throughout the system's model, the image processing dimensions of width, height, and number of channels are represented as $W\times H\times C$, and the projection matrix or feature map during network training using $M$. The loss function is represented by $L$. The symbols and explanations in this chapter are shown in Table 2.

Table 2

Symbols and their interpretation

Symbol	Paraphrase	Symbol	Paraphrase
$W\times H\times C$	The dimensions (length and width) and number of channels of the processed pathological images	$\theta $	A custom parameter is used to control the relative weight of positive and negative examples. $\theta \in [\mathrm{0,1}]$
$p$	The pixel value of a certain point in a pathological image	$\mu $	Thresholds for randomized enhanced contrast
$M$	Thresholds for randomized enhanced contrast	$e$	Eccentricity of the nucleus
${D}_{m}$	Data matrix	$\mathcal{l}$	Customed block list
$\mathcal{M}$	The initial network architecture	$\mathcal{N}$	Initial student network architecture
$\mathcal{y}$	The label matrix	$vl$	Customed ViT list
$\tau $	A hyperparameter is used to suppress the background class.$\tau \in [\mathrm{0,1}]$	$\mu $	A hyperparameter is used to determine the weight of the two loss functions. $\mu \in \left[\mathrm{0,1}\right]$
${L}_{maf}$	Asymmetric focal loss for minority class enhancement	${T}_{i,j}$	The relationship between point ${x}_{i}$ and the features of the surrounding points
${L}_{amF}$	Loss value of the merged pathological images with unified parameters	$\Phi $	The difference between the feature maps of the HRMED_T and HRMED_S
$T$	The distilled temperature	$\widehat{M}$	The predicted value matrix for different categories
$y$	The output of the model’s prediction	$\widehat{y}$	The ground truth label of the pathological image
$U({x}_{i})$	The pixel points around point ${x}_{i}$	${L}_{maFT}$	Asynchronous Focal Tversky Loss after modification
${S}_{i}$	The features of the point ${x}_{i}$ itself	$Z(x)$	A normalized function

The first and third columns are the main symbols covered in this chapter; the second and fourth columns are the meanings corresponding to the symbols

Image pre-processing module

Pixel-level annotation of pathology images incurs an exorbitantly high cost, our dataset comprises only over 2000 pathology images at 40 × magnification and 512 × 512 in size, yet the number of annotated cell nuclei reaches hundreds of thousands. To maximize the utilization of limited and expensive annotated data, we devised a data augmentation pipeline for malignant tumor pathology images, incorporating diversity in the input data by passing each image through the pipeline before every training iteration.

Considering the rotational invariance of cells and the basic features of random appearance at any position in the image, the input image first undergoes four fundamental data augmentation modules: random rotation, random translation, random cropping, and random flipping. Each module is executed with a fifty-percent chance, and the corresponding label image is also altered. Blank areas resulting from the base transformation are filled with black.

Finally, we simulate staining differences due to variations in the staining process, batch, scanner, and visual noise during the processing of pathology images using stochastic color enhancement techniques. This module has a fifty-percent chance of being executed, and it includes several sub-modules. Unless otherwise specified, each sub-module has a fifty-percent chance of being executed. Black areas in the original image are not enhanced, and the label image remains unchanged. Next, I will introduce each sub-module in detail.

(a) Random Brightness Augmentation: Cell pathology images often present uneven illumination due to factors such as microscope settings, sample thickness, staining depth, and leakage. To simulate lighting variations, we adopt a method of changing the image brightness for data augmentation.

$$ p_{{1i}} = p_{i} + \gamma $$

(1)

where ${p}_{i}$ denotes the pixel value at any point in the image. The decimal $brt$ denotes the random variation of brightness, $\upgamma \in [-\mathrm{32,32}]$ $\upgamma \in [-\mathrm{32,32}]$. A change in brightness results in a change in pixel value.

(b) Image Conversion Module: The image undergoes a conversion from the RGB format to the HSL format shown in Eq. (2). This is an intermediate step and will be executed whenever the color enhancement module is executed.

$$ \left\{ {\begin{array}{*{20}c} {\vartheta = \cos ^{{ - 1}} \left[ {\frac{{\frac{1}{2}\left[ {\left( {R - G} \right) + \left( {R - B} \right)} \right]}}{{\sqrt {\left( {R - G} \right)^{2} + \left( {R - B} \right)\left( {G - B} \right)} }}} \right]} \\ {H_{{hue}} = \left\{ {\begin{array}{*{20}c} {0\quad G \ge B} \\ {2\pi - \vartheta \quad G < B} \\ \end{array} } \right.} \\ {L = \frac{1}{{\sqrt 3 }}\left( {R + G + B} \right)} \\ {S = 1 - \frac{{3\min \left( {R,G,B} \right)}}{{R + G + B}}} \\ \end{array} } \right.\left\{ {\begin{array}{*{20}c} {\vartheta = \cos ^{{ - 1}} \left[ {\frac{{\frac{1}{2}\left[ {\left( {R - G} \right) + \left( {R - B} \right)} \right]}}{{\sqrt {\left( {R - G} \right)^{2} + \left( {R - B} \right)\left( {G - B} \right)} }}} \right]} \\ {H = \left\{ {\begin{array}{*{20}c} {0\quad G \ge B} \\ {2\pi - \vartheta \quad G < B} \\ \end{array} } \right.} \\ {L = \frac{1}{{\sqrt 3 }}\left( {R + G + B} \right)} \\ {S = 1 - \frac{{3\min \left( {R,G,B} \right)}}{{R + G + B}}} \\ \end{array} } \right.\left\{ {\begin{array}{*{20}c} {\vartheta = \cos ^{{ - 1}} \left[ {\frac{{\frac{1}{2}\left[ {\left( {R - G} \right) + \left( {R - B} \right)} \right]}}{{\sqrt {\left( {R - G} \right)^{2} + \left( {R - B} \right)\left( {G - B} \right)} }}} \right]} \\ {H = \left\{ {\begin{array}{*{20}c} {0\quad G \ge B} \\ {2\pi - \vartheta \quad G < B} \\ \end{array} } \right.} \\ {L = \frac{1}{{\sqrt 3 }}\left( {R + G + B} \right)} \\ {S = 1 - \frac{{3\min \left( {R,G,B} \right)}}{{R + G + B}}} \\ \end{array} } \right. $$

(2)

(c) Random Saturation Augmentation: Due to factors such as varying staining depth, inconsistent dyes, and uneven dye absorption, cell pathology images often have different background and nucleus staining colors among different pathological images. To simulate this situation, we randomly adjust the saturation of colors within a certain range.

$${S}{^\prime}=S+\eta $$

(3)

where $S$ and ${S}{^\prime}$ are the saturation of the image. $\eta $ is a stochastic parameter of saturation variation, $\eta \in [\mathrm{0.7,1.3}]$.

(d) Random Hue Augmentation: The reason for performing this data augmentation is the same as the previous one. We adjust the hue value of the image to achieve this.

$${{\text{H}}}^{\mathrm{^{\prime}}}=\left({\text{H}}+\upupsilon \right)\mathrm{\%}100$$

(4)

where both ${H}_{hue}$ and ${{H}_{hue}}{^\prime}$ are the hue values of the image. The parameter $\upsilon $ denotes a random hue enhancement value and $\upsilon \in [-\mathrm{18,18}]$.

(e) The Image Conversion Module: Converts an image from the HLS format back to the RGB format, utilizing the formula described in Eq. (2). It will always be executed whenever the Color Enhancement Module is performed.

(f) Random contrast enhancement: To simulate color differences, a random contrast enhancement threshold $\mu \in [\mathrm{0.5,1.5}]$ is established and each pixel's change in value is as shown in (5).

$${p}_{2i}={p}_{1i}\times \mu $$

(5)

The images used to train the model go through various modules ranging from random rotation to random color enhancement in the data enhancement pipeline described above to generate the enhanced images and feed them into the system. Despite the limited quantity of data, this approach ensures that each input image for training is nearly unique.

Additionally, we attempted to address common noise present in cellular pathological images, such as background speckles and interference, different staining in contaminated areas, inconsistent staining depths, microscope settings, and so on, by denoising, but we were unable to find a highly effective denoising method specifically for cellular pathological images. We attempted to utilize denoising methods for actual images, but the results were not greatly improved following testing. For the problem of varying degrees of staining due to different staining batches, we did not choose to use a complex staining normalization method. This is mainly because a complex method would entail more computational cost, which is contrary to our goal of reducing the amount of computation. We try to model these noise and color differences through data augmentation so that the deep learning network learns to distinguish their features.

Image segmentation model

The digital pathology image teacher model we designed is HRMED_T, which is mainly referenced to the HRViT model [33]. HRMED_T combines a high-resolution network with ViT, which can maintain high-resolution feature extraction throughout the entire network pathway. At the same time, we use a unique customized attention structure. This structure reduces the computational complexity while ensuring that the global contextual information in the osteosarcoma pathology images is not ignored.

The overall architecture of the model is illustrated in Fig. 2. Next, let us delve into the individual details of the modules in the network architecture diagram.

MEDAttn

HRMED_T does not directly use a traditional Transformer to replace the convolution in HRNet, which would lead to an explosion in computational cost and parameter size. We introduce a cross-shaped window design as in Fig. 3.

Regarding image $x\in {\mathbb{R}}^{H\times W\times C}$, instead of directly inputting the entire image for attention calculation, the image is first partitioned into two parts based on the number of channels. The upper part performs row attention calculation with a window size of $s\times W$, while the lower part performs column attention calculation with a window size of $H\times s$. Within each window, the patch is divided into $I$ ${d}_{i}$-dimensional subunits, following which local self-attention is employed. The overall formula can be seen in Eq. (6).

$$\left\{\begin{array}{l}\begin{array}{l}MEDAttn\left(x\right)=BN\left(\phi \left({M}^{o}\left[{y}_{1},\cdots ,{y}_{i},\cdots ,{y}_{I}\right]\right)\right)\\ {y}_{i}={r}_{t}+\mathit{DWConv}\left(\sigma \left({M}_{i}^{V}x\right)\right) \end{array}\\ \left[{r}_{i}^{1},\cdots ,{r}_{i}^{N}\right]={r}_{i}=\left\{\begin{array}{cc}{\text{HAtt}}{\text{n}}_{i}\left(x\right),& 1\le i<\frac{I}{2}\\ {\text{VAtt}}{\text{n}}_{i}\left(x\right),& \frac{I}{2}\le i\le I\end{array}\right. \\ {r}_{i}^{n}=Attn\left({M}_{i}^{Q}{x}^{n},{M}_{i}^{K}{x}^{n},{M}_{i}^{V}{x}^{n}\right) \\ \left[{x}^{1},\cdots ,{x}^{n},\cdots ,{x}^{N}\right]=x,\hspace{1em}{x}^{n}\in {\mathbb{R}}^{\left(\frac{H\times W}{s}\right)\times C} \end{array}\right. $$

(6)

where ${M}_{i}^{Q}{x}^{n},{M}_{i}^{K}{x}^{n},{M}_{i}^{V}{x}^{n}$ denotes the projection matrices of ${Q}_{i}$(query), ${K}_{i}$(key), and ${V}_{i}$(value) generated by the $t-th$ self-attention header, respectively; ${M}^{o}\in {\mathbb{R}}^{C\times C}$ is the final projection matrix, and $\phi $ is the Hard-Swish activation function. As can be seen from the second line of Eq. (6) and the accompanying Fig. 3, the mapped ${V}_{i}$ (value) matrix not only participates in the attention computation but also undergoes the Hard-Swish and Deep Convolution (DWConv) [46] operations, respectively, before being added to the original attention values. The ${r}_{i}$ in the third line of Eq. (6) is the attention score obtained after calculating the row and column attention of the input image. The formula for attention calculation is shown in Eq. (7):

$$Attn\left({M}_{i}^{Q}{x}^{n},{M}_{i}^{V}{x}^{n},{M}_{i}^{V}{x}^{n}\right)=\mathit{softmax}\left(\frac{{Q}_{i}^{n}{\left({V}_{i}^{n}\right)}^{T}}{\sqrt{{d}_{i}}}\right){V}_{i}^{n}$$

(7)

To alleviate the feature collapse phenomenon that arises with deepening networks, we have established an enhanced residual shortcut. Furthermore, since the MEDAttn structure does not take entire rows or columns, padding with zeros is required when H or W does not satisfy a multiple of s to ensure the completeness of the final window.

Other detailed structures in HRMED_T

The Stem module is utilized to reduce the computation cost of ViT by downsizing images. The Patch Embed module is employed to integrate feature information from different branches during fusion. The structure of the Mixed-scale convolutional feedforward network (MixCFN) for further extraction of multi-scale features of images is shown in Fig. 4a.

Different resolution fusion layer: For feature fusion between the $i$-th input and the $j$-th output (where $i<j$) in Fig. 4b, we first utilize a DWConv with a kernel size of ${2}^{j-i}+1$ to downsample the feature map, followed by ${2}^{j-i}\times C$ 1 × 1 convolutional kernels to increase the input channel dimension. On the other hand, in the up-scaling process (when $i>j$), we use ${2}^{j-i}\times C$ 1 × 1 convolutional kernels to increase the input channel dimension, and then perform nearest-neighbor upsampling to enlarge the image by a factor of ${2}^{i-j}$. For the case where $i=j$, the input and output are directly connected through a skip connection.

Loss function: In the field of cytopathology image segmentation, where the vast majority of images contain only a few cells and a small number of images contain a high density of cells (cells occupying over 20% of the image), both types of images are crucial for qualitative cancer diagnosis. To address this, we have designed the Medical Focal Loss (MFL) as our loss function, which focuses on maximizing the accuracy of cell segmentation in pathological images. It leverages asymmetry to selectively suppress background and enhance cell segmentation while requiring only three hyperparameters.

The medical asymmetric Focal loss (${L}_{maf}$) can be expressed as Eq. (8) [47]. In our binary classification task, compared to the original Focal loss, we merely multiply ${p}_{i}$ with the prediction value of the rare class to achieve the objective of selective amplification.

$${L}_{maf}=-\frac{\theta }{N}{p}_{i}log\left({\widehat{M}}_{i,c}\right)-\frac{1-\theta }{N}{\left(1-{\widehat{{\varvec{M}}}}_{i,b}\right)}^{\tau }log({\widehat{{\varvec{M}}}}_{i,c})$$

(8)

$\widehat{M}$ denotes the matrix of predicted values for different categories. Its three indices $i$, $c$, and $b$ are respectively the row index of the matrix, the index of the minority sample class (cell), and the index of the majority sample class (background), used to iterate the entire matrix. $y$ represents the pixel values of each point in the labeled image. N represents the number of pixel points in the image. $\theta \in [\mathrm{0,1}]$ is a custom parameter used to control the relative weight of positive and negative examples. $\tau \in [\mathrm{0,1}]$ is also a hyperparameter used to suppress the background class.

The medical asymmetric Focal Tversky loss (${L}_{maFT}$) can be represented by Eq. (9). Similarly, instead of merely enhancing the majority class (background), we enhance only the minority class (cell) as compared to the original Tversky loss.

$${L}_{maFT}=\left(1-mT{I}_{b}\right)+{(1-mT{I}_{c})}^{1-\tau }$$

(9)

where $mTI$ stands for medical Tversky loss, its representation is given by Eq. (10).

$$mTI=\frac{{\sum }_{i=1}^{N}{\widehat{{\varvec{M}}}}_{0i}{p}_{0i}}{{\sum }_{i=1}^{N}{\widehat{{\varvec{M}}}}_{0i}{p}_{0i}+\theta {\sum }_{i=1}^{N}{\widehat{{\varvec{M}}}}_{0i}{p}_{1i}+(1-\theta ){\sum }_{i=1}^{N}{\widehat{{\varvec{M}}}}_{1i}{p}_{0i}}$$

(10)

The ${\widehat{M}}_{0i}$ denotes the probability of pixel i belonging to the foreground, ${\widehat{M}}_{1i}$ represents the probability of pixel i belonging to the background, whereas ${p}_{0i}$ signifies the actual pixel value. ${p}_{0i}=1$ designates foreground, while ${p}_{0i}=0$ denotes background, conversely ${p}_{1i}=1$ indicates background. ${p}_{1i}=0$ signifies foreground.

The final asymmetric Medical Focal loss (MFL) can be represented by Eq. (11).

$${L}_{amF}=\mu {L}_{maf}+(1-\mu ){L}_{maFT}$$

(11)

where $\mu \in [\mathrm{0,1}]$ is similarly a hyperparameter used to determine the weight of the two loss functions.

The branching structure of the HRMED_T

In Fig. 2, the final structure of the network is described in detail. Stage 3 of HRMED_T will be performed twice to repeatedly fuse multi-scale features. Additionally, in the third branch of Stage 3, there will be 6 Transformer Blocks to extract LR features. Since the computation complication of ViT is proportional to the square of the image size, for the third branch, the image size is only one-sixteenth of the original size, but it has a relatively large receptive field. By adding more ViT blocks, the network can capture more global features without adding too many parameters. Furthermore, compared to the fourth branch, the receptive field of the third branch is not large enough to lose too many image details. Therefore, in Stage 4, the third branch still has 6 Transformer Blocks and the fourth branch only has 3 Transformer Blocks. Finally, all feature maps are merged as the final image output through a 1 × 1 Conv + BN + bi-linear interpolation upsampling method. The main steps of the HRMED_T algorithm are described in Algorithm 1. 图片输入后.

Model compression module

Channel distribution distillation

We adopt the model compression method of knowledge distillation, where a large teacher model with a large number of parameters and computational costs teaches a low-cost student model with low memory usage. The HRMED_T model is used as the teacher model and the HRMED_S as the student model, both of which have similar architectures.

The model distillation algorithm we use maps the intermediate layer features of the HRMED_T as the learning object for the HRMED_S. The student and teacher networks' corresponding intermediate layer features are aligned and transformed into a probability distribution through an activation function. The difference between the student and teacher is estimated by the KL divergence (Fig. 5). In this representation, the HRMED_T is represented by $A$, the HRMED_S by $B$, and the intermediate layer features of the teacher and student after being transformed by an activation function into a probability distribution are represented by ${M}^{A}$ and ${M}^{B}$, respectively. The formula for knowledge distillation can be found in (12) [48].

$${L}_{amF}(\widehat{y},{M}^{B})+\alpha \cdot \Phi (\varphi ({M}^{A}),\varphi ({M}^{B}))$$

(12)

In this formula, $y$ represents the true label of an image, $L$ refers to the Medical Focal Loss, $\alpha $ serves as a hyperparameter that regulates the discrepancy between the real image and the predicted image by the teacher model. In the mapping between sub-models, $\alpha $ is often set higher because the HRMED_T can guide the HRMED_S better with the features it has learned. Decreasing $\alpha $ in the final output layer helps the student model to approach a more realistic outcome. If one desires to train the student model with unlabeled data, simply removing ${L}_{amF}$ and $\alpha $ will suffice.

The difference between the feature maps of the HRMED_T and HRMED_S can be seen in (13):

$$\Phi (\varphi ({M}^{A}),\varphi ({M}^{B}))=\Phi (\varphi ({M}_{c}^{A}),\varphi ({M}_{c}^{B}))$$

(13)

The channel $c\in [1,C]$ represents the c-th channel of the feature map. The definition of $\varphi (\cdot )$ can be seen in (14).

$$\varphi ({M}_{c})=\frac{exp(\frac{{M}_{c,i}}{T})}{{\sum }_{i=1}^{W\bullet H}exp(\frac{{M}_{c,i}}{T})}$$

(14)

$T$ represents the distilled temperature. The probability becomes softer if a larger $T$ is used, implying a wider spatial focus for each channel. If a mismatch occurs between the middle layer of the student and the teacher, a Conv with a kernel size of 1 × 1 is required to expand the student's channel number to match that of the teacher. The function $\Phi (\cdot )$ is utilized to evaluate the distributional difference between two feature maps, here the KL divergence is employed [49], as seen in (15).

$$\Phi ({M}^{A},{M}^{B})=\frac{{T}^{2}}{C}\sum_{c=1}^{C}\sum_{i=1}^{W\bullet H}\varphi ({M}_{c,i}^{A})\cdot log\left[\frac{\varphi ({M}_{c,i}^{A})}{\varphi ({M}_{c,i}^{B})}\right]\\$$

(15)

The method of transferring knowledge from the teacher model to the student model through knowledge distillation boasts numerous advantages, such as the reduced risk of overfitting in the student model due to the use of soft targets, improved results obtained with a small amount of data, even with fewer or zero samples, and the potential to leverage an almost unlimited amount of unlabeled data through the guidance of the teacher in addressing sample scarcity. Additionally, the lighter parameters of the student model translate to greater deployability, higher real-time performance and efficiency, and reduced storage space. In our experiment, we set the temperature $T=3$. For the logits map, we set $\alpha =5$ and for the intermediate feature map, we set $\alpha =50$.

HRMED_S model structure

The student model has a simple structure and does not use a Transformer structure. The details of each convolution and the corresponding details are described in detail in Fig. 6.

For the fusion module at the end of each stage, the network of the lower branch adjusts the channel number through a 1 × 1 convolution kernel and a BN layer, and then performs n-times upsampling through bilinear nearest interpolation. The specific magnification is represented by different line colors in the image. The network of the upper branch performs down-sampling through t 3 × 3 convolution kernels with a stride of 2, a BN layer, and a ReLU activation function. The value of t can also be seen in the colors of the image. The lines of the same layer are directly concatenated and added through a skip connection.

At the end of the student model, we replaced the original fully connected layer with a Conditional Random Field (CRF) to obtain the final image prediction output. This operation facilitates the pure convolutional network to obtain more contextual information about the units and reduces the ambiguity of the edge segmentation of the densely stacked units, thus improving the segmentation performance to some extent [50]. Equation (16) represents the probability of each pixel of the final prediction:

$$ P\left( {y{\mid }x} \right) = \frac{1}{{Z\left( x \right)}}\exp \left( {\mathop \sum \limits_{{i \in U}} \mathop \sum \limits_{{j \in U\left( {x_{i} } \right)}} T_{{i,j}} \left( {y_{j} ,y_{i} ,x_{i} ,i} \right) + \mathop \sum \limits_{{i \in U}} S_{i} \left( {y_{i} ,x_{i} } \right)} \right) $$

(16)

$X=\{{x}_{1},{x}_{2},\cdots {x}_{n}\}$ represents the features of each pixel point in the original cytopathology image, and $Y=\{{y}_{1},{y}_{2},\cdots {y}_{n}\}$ is used to represent the features obtained from the mapping of each pixel point in the segmentation prediction map outputted from HRMED_S, which contains the color of the image, the texture, the predicted probability of the surrounding points, and other features. $U({x}_{i})$ represents the pixel points around the point ${x}_{i}$. ${T}_{i,j}$ is a function to express the relationship between point ${x}_{i}$ and the features of the surrounding points. the ${S}_{i}$ function is used to express the features of the point ${x}_{i}$ itself. $Z(x)$ is a function used to normalize the function, whose expression is shown in (17) [51]:

$$Z(x)=\sum_{y\in Y}P(y|x)$$

(17)

Similar to the teacher model, the input images in the student model go through the Conv-BN-ReLu module. This module reduces the image to a quarter of its original size while compensating for the slight increase in the number of channels. This is mainly to keep the image size consistent during the training of the teacher model so that the student model can refine the information. Similarly, phases 3–4 are executed twice and finally the CRF performs operations to obtain the output.

Specific details of the model distillation process

From the above analysis, it can be seen that the teacher model HRMED_T (Fig. 2) and the student model HRMED_S (Fig. 6) have very similar structures. For example, the two models have the same number of branches and the same image size at each stage $m$. HRMED_T has the structure of Vision Transformer, which can better obtain global image information and consider contextual relationships in semantic segmentation. The HRMED_S model pays more attention to lightweight and lower computational complexity. To reduce the structural complexity of the student model while maintaining the predictive performance of the model, we correspond to each merged layer of the HRMED_T and HRMED_S models. The positions in both architecture diagrams correspond to “ + ” signs to further enhance the performance of the student model. At the same fusion position, the size of the images is the same, and the number of channels of the teacher model is often more. The HRMED_S model we designed uses a 1 × 1 convolution to increase the number of channels and then calculate the difference with the teacher model, so as to learn this type of soft target to achieve the purpose of improving robustness and generalization performance.

Quantitative module for segmentation results

In this section, we have combined various indices used in the analysis of tumor cells to design a tool for quantifying the segmentation results of cells, allowing pathologists to quickly and quantitatively analyze and measure large amounts of pathological images without the need for computer vision or programming training. The tool includes:

(a) Providing the nucleus-cytoplasm ratio (the ratio of cell pixels to background pixels in the segmentation result) and cell count for each image, and allowing the pathological images to be sorted based on these two indices.

(b) For each segmented image, to better assist doctors in observing tissue constituent morphology, we have synthesized the segmentation outcome with the source image, marking the background as dark and marking the cells as bright.

(c) For a selected image, we reveal the shape and circular deviation of the cell by calculating the cell's roundness ($R=\frac{4\pi S}{{C}^{2}}$, where S and C are the area and circumference of the cell, respectively), aspect ratio ($AR=\frac{{L}_{l}}{{L}_{s}}$, where ${L}_{l}$ and ${L}_{s}$ symbolize the dimensions of the long and short axis of the cell, respectively), and eccentricity ($e=\frac{c}{a}$, where c is the focus of the ellipse composed of the cell's long and short axes and a is the length of the ellipse’s long axis). Based on the thresholds set by experts, the calculated results are displayed in a graded format, highlighting those recognition results with unusual shapes, thereby revealing the heterogeneity and polymorphism of the cells.

(d) To facilitate comparative analysis, ROI localization, and surgical plan customization for physicians, this tool offers the ability to select, crop, rotate, and resize cells in the segmentation result, while also enabling easy viewing of selected cell area, perimeter, and the corresponding image position in the original pathological image.

Experiments and results

Experimental environment and dataset introduction

Dataset: The data we used was sourced from the Research Center for Artificial Intelligence, Monash University, and a total of 1000 pathology images were counted [52]. We selected 40 × magnification, took screenshots of random areas, and intercepted 10 sub-images for each pathology image, with sub-image size of 512*512. Finally, a total of 10,000 images were obtained. After screening, there are 2164 images available for training.

Contrasting Models: The models utilized for comparison encompass: U-Net [53], Unet + + [54], DeepLabv3 + [55], Attention U-Net [56], SETR [57], Swin-Unet [58], and CSWin Transformer [59].

Evaluation Metrics: Confusion matrix is a commonly used discriminant for medical image segmentation [60]. The network's prediction accuracy with regard to segmented regions is measured by the F1-score (F1), precision (Pre), recall (Re), Dice Similarity Coefficient (DSC), and Inter-section of Union (IoU), which are all derived based on the confusion matrix [61]. Recall provides a clear image of the percentage of the true region's area that the correctly segmented region occupies. The percentage of the model’s predicted portion that is occupied by the predicted portion of the model is clearly depicted by precision. The area of overlap (IOU) between the labels and the predicted segmentation is calculated by dividing the area of union (AOU) between the labels and the predicted segmentation. The similarity between the true and predicted regions is computed by DSC. Similarly, such metrics are also used as metrics to evaluate the performance of the network in cell segmentation [62]. We work to enhance the model's various metrics in digital pathology image segmentation to achieve precise cell nuclei prediction. Furthermore, to compare the computational cost of the lightweight model obtained through knowledge extraction with the computational cost of the original model, we use Floating Point Operations (FLOPs) and Parameters to measure the size of the model [63]. FLOPs denote the number of floating-point operations per second, which is a measure of the computational volume of a deep learning model and enables the assessment of the computational complexity of the model in the inference phase. Params refers to the number of parameters of the model, which is used as a measure of the complexity and expressiveness of the model [64].

Hyperparameter Settings: In all the following experiments, we trained the model using 300 epochs with a batch size of 4. During training, we used the Stochastic Gradient Descent (SGD) optimization algorithm to converge parameters, with an initial learning rate of 0.01, momentum of 0.9, and decay rate of 0.0005.

Result and discussion

In Fig. 7, we show a set of data that has been randomly color-enhanced. We omit the first four basic data enhancement results and focus on showing the designed color enhancement effects. As can be seen from the figure, after multiple color enhancement operations, we almost completely simulate the noise caused by staining differences in pathology images.

Next, we conducted comprehensive comparative experiments between our segmentation models, HRMED_T and HRMED_S, and other commonly used medical image segmentation models or models that are highly representative in the field of semantic segmentation, under the same initial conditions. Table 3 presents the complete experimental results. As can be seen, under default parameters, HRMED_T achieves excellent segmentation results with an IoU that is 1.6% higher than the best-performing model, CSWin. However, under default conditions, the performance of the student model HRMED_S is not excellent and its segmentation results are similar to those of U-Net. It is also worth noting that the ViT model, Swin-Unet, which has a small parameter and computational cost and performs well on abdominal organ segmentation, does not perform satisfactorily in pathology image segmentation. Its performance is even inferior to that of Unet + + . This directly indicates that applying segmentation models from other fields to pathology image segmentation may not yield good results. However, our submodel, HRMED_S, which is designed specifically for pathology images and is a pure CNN architecture, achieves better results than Swin-Unet despite having lower FLOPs and Params due to its ability to maintain high-resolution images throughout the process.

Table 3

The prediction performance of each model for cell nuclei in digital pathology images of osteosarcoma

Model	IoU	DSC	Re	Pr	F1	FLOPs	Params
U-Net	0.639	0.773	0.789	0.803	0.773	110.02 G	9.16 M
Unet + +	0.662	0.791	0.815	0.800	0.791	277.26 G	54.71 M
DeepLabv3 +	0.655	0.787	0.844	0.762	0.751	667.36 G	34.88 M
Attention u-net	0.656	0.786	0.786	0.800	0.815	533.08 G	86.21 M
SETR	0.595	0.739	0.790	0.718	0.739	397.21 G	27.17 M
Swin-unet	0.605	0.748	0.901	0.659	0.748	11.74 G	52.0 M
CSWin transformer	0.657	0.893	0.894	0.862	0.893	230.69 G	7.77 M
Our (HRMED_T)	0.673	0.895	0.930	0.896	0.895	232.49 G	55.7 M
Our (HRMED_T + MFL)	0.681	0.898	0.926	0.875	0.898	232.49 G	55.7 M
Our (HRMED_T + MFL + DA)	0.756	0.911	0.911	0.912	0.911	232.49 G	55.7 M
Our (HRMED_S)	0.599	0.864	0.905	0.812	0.864	10.33 G	3.99 M
Our (HRMED_S + MFL)	0.615	0.879	0.876	0.846	0.879	10.33 G	3.99 M
Our (HRMED_S + MFL + DA)	0.642	0.890	0.919	0.892	0.890	10.33 G	3.99 M
Our (HRMED_S + MFL + DA + KD)	0.710	0.910	0.903	0.916	0.910	10.33 G	3.99 M
Our (HRMED_S + MFL + DA + KD + CRF)	0.710	0.911	0.897	0.932	0.915	10.33 G	3.99 M

Simultaneously, in Table 3, it can be observed that Medical Focus Loss has resulted in a 0.8% and 1.6% increase in the key indicators IoU for HRMED_T and HRMED_S, respectively. The data augmentation pipeline has brought them a 3.5% and 2.7% improvement, respectively. Furthermore, the model distillation technique has provided incredible help to the sub-models, nearly aligning their performance with that of the teacher model. This is primarily because the student model we designed is highly similar to the teacher model, and underwent multi-layer channel distillation. Additionally, with our persistent parameter tuning, the model's IoU, DSC, Pr, and F1 have improved by 11.1%, 4.7%, 12.0%, and 5.1%, respectively, compared to the original HRMED_S. Although CRF, as a post-processing technique, does not greatly improve the sub-models, it helps connect the segmentation model to the context of the image in actual effects, making the edge segmentation of stacked cells in pathological images clearer. The FLOPs of the HRMED_T model in this study are only 232.49G, and the FLOPs of the HRMED_S model are only 10.33G. Although compared to Swin-unet, which has only 11.74G FLOPs, the HRMED_T model requires 20 times more calculations. However, the HRMED_S model has similar computational complexity to the Swin-unet model. Moreover, our HRMED_S has higher prediction performance, and its indicators in all aspects of IoU and DSC are better than the Swin-unet algorithm.

In Fig. 8, we present our experimental results in a more intuitive way using line graphs and bar charts. In graph (a), it can be seen that data augmentation has the most significant improvement on the performance of the HRMED_T model. We believe that this is due to the data pipeline significantly enriching the scarce dataset, as well as the color augmentation amplifying the staining noise in different pathological images, enabling the deep learning network to better learn the features of this dataset. In graph (b), the distilled model’s various indicators are almost all in a leading position. Figure (c) compares the key indicators of several models with significant differences, and it can be seen that our two models are leading in nearly all indicators. Graph (d) is to highlight the significant difference in parameter and computation between the distilled HRMED_S model and other models, thus highlighting the lightweight characteristics of our model. It can be clearly seen from the figure that the HRMED_T model and the HRMED_S model have the lowest number of parameters, only 3.99 M, indicating that our research has relatively low hardware requirements; the HRMED_S model has the lowest computational cost, with only 10.33G FLOPs. It shows that the processing speed of HRMED_S model is faster.

Figure 9 illustrates the segmentation performance of different models on several representative pathological images of osteosarcoma under default training conditions. Our HRMED_T model yields the most reliable segmentation results. Although HRMED_T does not exhibit a significant advantage in handling simple segmentation tasks such as (d) and (e), it excels in processing images with dense cells, such as (a) and (b), where it can outline the boundaries of densely stacked cells more clearly. Moreover, in the presence of cells that are unevenly distributed, have varying sizes, or possess irregular shapes, such as (c) and (f), HRMED_T demonstrates good robustness and generalization.

The Fig. 10 depicts the optimized model's prediction results, showcasing 4 images from Fig. 9. It is evident that for simple segmentation tasks, HRMED_T and HEMED_S show very similar performance, effectively alleviating problems such as over-segmentation and under-segmentation. However, for complex boundaries, such as those present in the first image, the submodel’s segmentation at unit stack locations is still insufficient. Despite utilizing CRF technology to incorporate contextual information, the CNN model is still less sensitive to spatial information than the ViT model. However, the HEMED_S model has lower computational complexity and faster training speed, which is easier to implement in areas with insufficient medical resources.

Ablation experiments

Indicators such as parameters and window size in the experiment will affect the segmentation performance of the model. We performed ablation experiments to verify the sensitivity of the results.

First, we focused on the impact of changes in the attention window $s$ in the high-precision HRMED_T model on model performance. We continuously adjust the window size s based on the original model to observe changes in model performance, as shown in Table 4. The value of $s$ gradually increases from 1 to 10. The larger the window size, the more computationally expensive the model is. But the performance of the model does not simply increase with the size of the window. When $s\le 3$, the IoU of the model increases as the window increases; when $s>3$, the IoU of the model gradually decreases as the window increases. This means that a suitable window size may be more conducive to the model capturing detailed features. After comprehensively considering the computational cost and segmentation accuracy of the model, we chose $s=3$ to design the model. At this value, the model achieved the best results.

Table 4

The impact of different window sizes on the Medical Attention structure

Window size $s$	1	3	5	7	9
FLOPs (G)	231.99	232.49	232.8	233.78	234.93
mIoU	0.709	0.716	0.713	0.712	0.712

Then, we evaluate the impact of three parameters of MFL on the algorithm. As shown in Fig. 11, we experimented with three parameters of MFL in the HRMED_T and HRMED_S models. In Fig. 11a, we first set $\mu =0.5$ and $\theta =0.5$, and then optimize the parameter $\tau \in [0.1, 0.9]$ to analyze the variation of IoU with $\tau $ for the HRMED_T model. Then we set $\mu =0.5$ and $\tau =0.5$ for the case, we adjusted the parameter $\theta \in [0.1, 0.9]$ to obtain the variation of the model's IoU with respect to $\theta $. Finally, with $\theta =0.5$ and $\tau =0.5$, we adjusted the parameter $\mu \in [0.1, 0.9]$ to analyze the relationship between IoU and $\mu $ of the model. In addition, to analyze the effect of different loss functions on model sensitivity, we also compare common loss functions (e.g., Cross entropy, Dice, Focal, and Tversky loss) with MFL. The results in the HRMED_T model show that the model performs best for values of $\tau $,$\theta $, and $\mu $ of 0.7, 0.6, and 0.5, respectively. $\mu \in [0.1, 0.9]$ for any value of $\mu \in [0.1, 0.9]$ yields results that are almost always better than the best-performing Tversky loss. And for fixed $\tau $ and $\theta $, the best experimental results for the HRMED_T model improve the IoU by 0.6% compared to the choice of Tversky loss. Figure 11b shows the performance variation of the HRMED_S model for different parameter values. Same as the HRMED_T model, we analyze the variation of the IoU of the HRMED_S model with respect to the parameters while fixing $\tau $, $\theta $, and $\mu $, respectively. The best performance of the HRMED_S model is achieved for values of $\tau $, $\theta $, and $\mu $ of 0.6, 0.3, and 0.4 respectively. It can be seen that the results obtained for values of $\mu \in [0.1, 0.9]$ almost always outperform the best-performing Tversky loss. For any $\tau $, $\theta $, and $\mu $, the model outperforms the Focal loss. And for fixed $\tau $, and $\theta $, the best experimental results of the HRMED_S model improve the IoU by 1% compared to the Tversky loss. Based on the above analysis, it can be obtained that the overall performance of the algorithm is better when $\mu =0.5$, $\theta =0.6$ and $\tau =0.7$. Therefore in this experiment, we set $\mu =0.5$, $\theta =0.6$, $\tau =0.7$.

In addition, the distillation temperature $T$ and the loss function α of the output layer also have a large impact on the algorithm, especially in the HRMED_S model, where the value of $T$ affects the effectiveness of knowledge distillation. As shown in Fig. 12, we further tuned the parameters of the output layer loss function $\alpha $ and distillation temperature $T$ on HRMED_S. It can be seen from the figure that softer probability maps can convey knowledge better regardless of the value of $\alpha $. However, too high a temperature may result in a uniform probability distribution, which may degrade the performance. Increasing the weight of the soft target relative to the hard target $\alpha $ also improves the segmentation performance. $\alpha $ is either too large or too small and the IoU of the model decreases. When $\alpha =5$, the overall IoU of the model is at a high level. In particular, the IoU reaches an optimal value when $T=3$. At this point, the model has the best prediction.

Finally, we analyze the influence of the convolutional layers on the model for each branch of the sub-model HRMED_S at each stage. The more convolutional layers, the higher the value of IoU of the model. However, as the number of convolutional layers increases, so does the computational cost. For example, when the number of convolutions per module was increased from 2 to 5, the IoU of the model increased by 0.0084, but the FLOPs increased by about 80%. Therefore, after considering the performance and computational complexity of the model, this study sets the number of convolutions of the module to 2 CNN layers (Table 5).

Table 5

Comparison of the number of CNN layers in each sub-module of the HRMED_S model

Number of CNN	1	2	3	4	5
FLOPs (G)	6.66	10.33	12.65	15.64	18.64
IoU	0.691	0.710	0.709	0.714	0.718

Conclusion

This article presents the development of an intelligent assistive diagnosis system for malignant tumor pathology images based on the Multi-Scale High-Resolution Vision Transformer. It encompasses four modules: image enhancement, model segmentation, knowledge distillation, and quantification measurement. This system aims to alleviate the workload and difficulty for pathologists while achieving good predictive results with limited computational cost. The system offers flexibility for medical institutions under different conditions to make choices. The experiments were conducted using pathology images of osteosarcoma as an example, showing that our system can achieve accurate segmentation of malignant tumor pathology images.

In the future, we will continue to dedicate ourselves to enhancing the accuracy of our segmentation models, improving the clarity of cell edge stacking, and striving to make our models leaner. Additionally, we will endeavor to address our small sample size issue by utilizing semi-supervised or unsupervised learning models. In the field of data processing, we will explore various noise reduction algorithms. As for our design of auxiliary tools, we aspire to utilize clustering methods to classify complex cell borders, rather than relying solely on mathematical parameters. Finally, we will integrate clinical practices to continuously refine our intelligent diagnostic assistance system.

Acknowledgements

Availability of data and materials: all data analyzed during the current study are included in the submission.

Declarations

Conflict of interest

The authors declare no conflict of interest.

Not applicable.

Institutional review board

Not applicable.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

DiGiorgio AM, Ehrenfeld JM (2023) Artificial intelligence in medicine & ChatGPT: de-tether the physician. J Med Syst 47(1):32. https://doi.org/10.1007/s10916-023-01926-3CrossRefPubMed

Zhao L, Huang J (2023) A distribution information sharing federated learning approach for medical image data. Complex Intell Syst 9(5):5625–5636. https://doi.org/10.1007/s40747-023-01035-1CrossRef

Zhan X et al (2023) An intelligent auxiliary framework for bone malignant tumor lesion segmentation in medical image analysis. Diagnostics 13(2):223. https://doi.org/10.3390/diagnostics13020223CrossRefPubMedPubMedCentral

Ouyang T et al (2022) Rethinking U-net from an attention perspective with transformers for osteosarcoma MRI image segmentation. Comput Intell Neurosci. https://doi.org/10.1155/2022/7973404CrossRefPubMedPubMedCentral

Singh SR, Vaidya H, Borrelli E, Chhablani J (2023) Foveal photoreceptor disruption in ocular diseases: an optical coherence tomography-based differential diagnosis. Surv Ophthalmol 68(4):655–668. https://doi.org/10.1016/j.survophthal.2023.03.003CrossRefPubMed

Peng L-Q et al (2022) Forensic bone age estimation of adolescent pelvis X-rays based on two-stage convolutional neural network. Int J Legal Med 136(3):797–810. https://doi.org/10.1007/s00414-021-02746-1CrossRefPubMed

Lv B, Liu F, Li Y, Nie J (2023) Artificial intelligence-aided diagnosis solution by enhancing the edge features of medical images. Diagnostics 13(6):1063. https://doi.org/10.3390/diagnostics13061063CrossRefPubMedPubMedCentral

Qin Y et al (2023) A novel medical decision-making system based on multi-scale feature enhancement for small samples. Mathematics 11(9):2116. https://doi.org/10.3390/math11092116CrossRef

Liu F et al (2022) An attention-preserving network-based method for assisted segmentation of osteosarcoma MRI images. Mathematics 10(10):1665. https://doi.org/10.3390/math10101665CrossRef

10.

Reverberi C et al (2022) Experimental evidence of effective human–AI collaboration in medical decision-making. Sci Rep 12(1):14952. https://doi.org/10.1038/s41598-022-18751-2ADSCrossRefPubMedPubMedCentral

11.

Liu J et al (2022) A multimodal auxiliary classification system for osteosarcoma histopathological images based on deep active learning. Healthcare 10(11):2189. https://doi.org/10.3390/healthcare10112189CrossRefPubMedPubMedCentral

12.

Guan P, Yu K, Wei W, Tan Y (2023) Big data analytics on lung cancer diagnosis framework with deep learning. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2023.3281638CrossRefPubMed

13.

Liu F et al (2022) Auxiliary segmentation method of osteosarcoma MRI image based on transformer and U-Net. Comput Intell Neurosci. https://doi.org/10.1155/2022/9990092CrossRefPubMedPubMedCentral

14.

Guo Y et al (2022) A medical assistant segmentation method for MRI images of osteosarcoma based on DecoupleSegNet. Int J Intell Syst 37(11):8436–8461. https://doi.org/10.1002/int.22949CrossRef

15.

Kim D, Lee J, Woo Y, Jeong J, Kim C, Kim D-K (2022) Deep learning application to clinical decision support system in sleep stage classification. J Pers Med 12(2):136. https://doi.org/10.3390/jpm12020136CrossRefPubMedPubMedCentral

16.

Zhou Z, Tan Y, Wu J (2022) A cascaded multi-stage framework for automatic detection and segmentation of pulmonary nodules in developing countries. IEEE J Biomed Health Inform. https://doi.org/10.1109/JBHI.2022.3198509CrossRefPubMedPubMedCentral

17.

Jeong JJ, Tariq A, Adejumo T, Trivedi H, Gichoya JW, Banerjee I (2022) Systematic review of generative adversarial networks (GANs) for medical image classification and segmentation. J Digital Imaging 35(2):137–152. https://doi.org/10.1007/s10278-021-00556-wCrossRef

18.

Wang Y, Wang Y, Cai J, Lee TK, Miao C, Wang ZJ (2023) SSD-KD: a self-supervised diverse knowledge distillation method for lightweight skin lesion classification using dermoscopic images. Med Image Anal. https://doi.org/10.1016/j.media.2022.102693CrossRefPubMedPubMedCentral

19.

Song X, Song Y, Stojanovic V, Song S (2023) Improved dynamic event-triggered security control for T-S fuzzy LPV-PDE systems via pointwise measurements and point control. Int J Fuzzy Syst 25(8):3177–3192. https://doi.org/10.1007/s40815-023-01563-5CrossRef

20.

Fang H et al (2021) Adaptive optimization algorithm for nonlinear Markov jump systems with partial unknown dynamics. Int J Robust Nonlinear Control 31(6):2126–2140. https://doi.org/10.1002/rnc.5350MathSciNetCrossRef

21.

Wan H, Luan X, Stojanovic V, Liu F (2023) Self-triggered finite-time control for discrete-time Markov jump systems. Inform Sci 634:101–121. https://doi.org/10.1016/j.ins.2023.03.070CrossRef

22.

Rashid R et al (2022) Narrative online guides for the interpretation of digital-pathology images and tissue-atlas data. Nat Biomed Eng 6(5):515–526. https://doi.org/10.1038/s41551-021-00789-8CrossRefPubMed

23.

Xiao P, Huang H, Zhou Z, Dai Z (2022) An artificial intelligence multiprocessing scheme for the diagnosis of osteosarcoma MRI images. IEEE J Biomed Health Inform 26(9):4656–4667. https://doi.org/10.1109/JBHI.2022.3184930CrossRefPubMed

24.

Wang L, Yu L, Zhu J, Tang H (2022) Auxiliary segmentation method of osteosarcoma in MRI images based on denoising and local enhancement. Healthcare 10(8):1468. https://doi.org/10.3390/healthcare10081468CrossRefPubMedPubMedCentral

25.

Liu Z et al (2022) BA-GCA net: boundary-aware grid contextual attention net in osteosarcoma mri image segmentation (in eng). Comput Intell Neurosci 2022:3881833. https://doi.org/10.1155/2022/3881833CrossRefPubMedPubMedCentral

26.

Wang J et al. (2021) “Deep High-Resolution Representation Learning for Visual Recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3349–3364, 1 Oct. 2021, doi: https://doi.org/10.1109/TPAMI.2020.2983686

27.

Y. Xie, J. Zhang, C. Shen, and Y. Xia, "CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, Cham, M. de Bruijne et al., Eds., 2021// 2021: Springer International Publishing, pp. 171–180.

28.

Han H-Y, Chen Y-C, Hsiao P-Y, Fu L-C (2021) Using channel-wise attention for deep CNN based real-time semantic segmentation with class-aware edge information. IEEE Trans Intell Transp Syst 22(2):1041–1051. https://doi.org/10.1109/TITS.2019.2962094CrossRef

29.

Li X, Jiang Y, Li M, Yin S (2021) Lightweight attention convolutional neural network for retinal vessel image segmentation. IEEE Trans Industr Inf 17(3):1958–1967. https://doi.org/10.1109/TII.2020.2993842CrossRef

30.

Wang W, Tang C, Wang X, Zheng B (2022) “A ViT-Based Multiscale Feature Fusion Approach for Remote Sensing Image Segmentation,” in IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022, Art no. 4510305, doi: https://doi.org/10.1109/LGRS.2022.3187135

31.

Wang Z, Dong N, Voiculescu I (2022) “Computationally-Efficient Vision Transformer for Medical Image Seman-tic Segmentation Via Dual Pseudo-Label Supervision,” 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 2022, pp. 1961–1965, doi: https://doi.org/10.1109/ICIP46576.2022.9897482

32.

Y. Yuan et al. (2021) “Hrformer: High-resolution transformer for dense prediction” arXiv preprint arXiv:2110.09408

33.

Gu J et al. (2022) “Multi-scale high-resolution vision transformer for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12094–12103

34.

Liu Y, Chen K, Liu C, Qin Z, Luo Z, Wang J (2019) “Structured knowledge distillation for semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2604–2613

35.

Qin D et al (2021) Efficient medical image segmentation based on knowledge distillation. IEEE Trans Med Imaging 40(12):3820–3831. https://doi.org/10.1109/TMI.2021.3098703CrossRefPubMed

36.

Hou Y, Zhu X, Ma Y, Loy CC, Li Y (2022) “Point-to-voxel knowledge distillation for lidar semantic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8479–8488.

37.

Roszkowiak L, Korzynska A, Pijanowska D, Bosch R, Lejeune M, Lopez C (2020) Clustered nuclei splitting based on recurrent distance transform in digital pathology images. EURASIP J Image Video Process. https://doi.org/10.1186/s13640-020-00514-6CrossRef

38.

Kang Q, Lao Q, Fevens T (2019) Nuclei Segmentation in Histopathological Images Using Two-Stage Learning," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Cham, D. Shen et al., Eds., 2019// 2019: Springer International Publishing, pp. 703–711

39.

Huang J, Wang T, Zheng D, He Y (2020) Nucleus segmentation of cervical cytology images based on multi-scale fuzzy clustering algorithm. Bioengineered 11(1):484–501. https://doi.org/10.1080/21655979.2020.1747834CrossRefPubMedPubMedCentral

40.

Zheng Y et al (2021) Diagnostic regions attention network (DRA-Net) for histopathology wsi recommendation and retrieval. IEEE Trans Med Imaging 40(3):1090–1103. https://doi.org/10.1109/TMI.2020.3046636CrossRefPubMed

41.

Shen Y et al (2022) Node screening method based on federated learning with IoT in opportunistic social networks. Mathematics 10(10):1669. https://doi.org/10.3390/math10101669CrossRef

42.

He K et al (2023) Image segmentation technology based on transformer in medical decision-making system. IET Image Process 17(10):3040–3054. https://doi.org/10.1049/ipr2.12854CrossRef

43.

Yang S et al (2022) Intelligent segmentation medical assistance system for MRI images of osteosarcoma in developing countries. Computat Math Methods Med. https://doi.org/10.1155/2022/7703583CrossRef

44.

Wu J et al (2022) Data transmission strategy based on node motion prediction iot system in opportunistic social networks. Wireless Personal Commun 126(2):1751–1768. https://doi.org/10.1007/s11277-022-09820-wCrossRef

45.

Huang J et al (2023) An effective data communication community establishment scheme in opportunistic networks. IET Commun 17(12):1354–1367. https://doi.org/10.1049/cmu2.12628CrossRef

46.

M. Yeung, E. Sala, C.-B. Schoenlieb, and L. Rundo, "Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation," Computerized Medical Imaging and Graphics, Article vol. 95, Jan 2022, Art. no. 102026.

47.

48.

Shu C, Liu Y, Gao J, Yan Z, Shen C (2021) “Channel-wise knowledge distillation for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5311–5320.

49.

Hershey JR, Olsen PA (2007) “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models,” 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, Honolulu, HI, USA, 2007, pp. IV-317-IV-320, doi: https://doi.org/10.1109/ICASSP.2007.366913

50.

Xu Y, Fan X, Yang Y (2023) Numerical solution of ruin probability of continuous time model based on optimal adaptive particle swarm optimization-triangular neural network algorithm. Soft Comput 27(19):14321–14335. https://doi.org/10.1007/s00500-023-08602-1CrossRef

51.

Ma X, Huang H, Wang Y, Romano S, Erfani S, Bailey J (2020) “Normalized Loss Functions for Deep Learning with Noisy Labels,” presented at the Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. [Online]. Available: https://proceedings.mlr.press/v119/ma20c.html

52.

Zhu J, Li L (2023) Two-stage coarse-to-fine method for pathological images in medical decision-making systems. IET Image Process. https://doi.org/10.1049/ipr2.12941CrossRef

53.

Ronneberger O, Fischer P, Brox T (2015) “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234–241: Springer

54.

Zhou Z, Rahman Siddiquee MM, Tajbakhsh N, Liang J (2018) UNet++: a nested U-Net architecture for medical image segmentation. In: Stoyanov D et al (eds) Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer International Publishing, Cham, pp 3–11CrossRef

55.

Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV). pp. 801–818

56.

Oktay O et al. (2018) “Attention u-net: learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999

57.

Zheng S et al. (2021) “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6881–6890

58.

Cao H et al. (2023) “Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation,” in Computer Vision – ECCV 2022 Workshops, Cham, L. Karlinsky, T. Michaeli, and K. Nishino, Eds., 2023//: Springer Nature Switzerland, pp. 205–218

59.

Dong X et al. (2022) “Cswin transformer: a general vision transformer backbone with cross-shaped windows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, , pp. 12124–12134

60.

Valero-Carreras D, Alcaraz J, Landete M (2023) Comparing two SVM models through different metrics based on the confusion matrix. Comput Operat Res. https://doi.org/10.1016/j.cor.2022.106131MathSciNetCrossRef

61.

Li W et al (2024) Artificial intelligence auxiliary diagnosis and treatment system for breast cancer in developing countries. J X-Ray Sci Technol. https://doi.org/10.3233/XST-230194CrossRef

62.

Zhan X, Long H (2023) A semantic fidelity interpretable-assisted decision model for lung nodule classification. Int J Comput Assisted Radiol Surg. https://doi.org/10.1007/s11548-023-03043-5CrossRef

63.

Zhao Y, Wang S, Zhang Y, Qiao S, Zhang M (2023) WRANet: wavelet integrated residual attention U-Net network for medical image segmentation. Complex Intell Syst 9(6):6971–6983. https://doi.org/10.1007/s40747-023-01119-yCrossRef

64.

Li S, Mao Y, Zhang F, Wang D, Zhong G (2023) DLW-NAS: differentiable light-weight neural architecture search. Cogn Comput 15(2):429–439. https://doi.org/10.1007/s12559-022-10046-yCrossRef

Titel: Cytopathology image analysis method based on high-resolution medical representation learning in medical decision-making system
verfasst von: Baotian Li
Feng Liu
Baolong Lv
Yongjun Zhang
Fangfang Gou
Jia Wu
Publikationsdatum: 04.03.2024
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-024-01390-7

Symbol	Paraphrase	Symbol	Paraphrase
\(W\times H\times C\)	The dimensions (length and width) and number of channels of the processed pathological images	\(\theta \)	A custom parameter is used to control the relative weight of positive and negative examples. \(\theta \in [\mathrm{0,1}]\)
\(p\)	The pixel value of a certain point in a pathological image	\(\mu \)	Thresholds for randomized enhanced contrast
\(M\)	Thresholds for randomized enhanced contrast	\(e\)	Eccentricity of the nucleus
\({D}_{m}\)	Data matrix	\(\mathcal{l}\)	Customed block list
\(\mathcal{M}\)	The initial network architecture	\(\mathcal{N}\)	Initial student network architecture
\(\mathcal{y}\)	The label matrix	\(vl\)	Customed ViT list
\(\tau \)	A hyperparameter is used to suppress the background class.\(\tau \in [\mathrm{0,1}]\)	\(\mu \)	A hyperparameter is used to determine the weight of the two loss functions. \(\mu \in \left[\mathrm{0,1}\right]\)
\({L}_{maf}\)	Asymmetric focal loss for minority class enhancement	\({T}_{i,j}\)	The relationship between point \({x}_{i}\) and the features of the surrounding points
\({L}_{amF}\)	Loss value of the merged pathological images with unified parameters	\(\Phi \)	The difference between the feature maps of the HRMED_T and HRMED_S
\(T\)	The distilled temperature	\(\widehat{M}\)	The predicted value matrix for different categories
\(y\)	The output of the model’s prediction	\(\widehat{y}\)	The ground truth label of the pathological image
\(U({x}_{i})\)	The pixel points around point \({x}_{i}\)	\({L}_{maFT}\)	Asynchronous Focal Tversky Loss after modification
\({S}_{i}\)	The features of the point \({x}_{i}\) itself	\(Z(x)\)	A normalized function

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

System model design

Image pre-processing module

Image segmentation model

MEDAttn

Other detailed structures in HRMED_T

The branching structure of the HRMED_T

Model compression module

Channel distribution distillation

HRMED_S model structure

Specific details of the model distillation process

Quantitative module for segmentation results

Experiments and results

Experimental environment and dataset introduction

Result and discussion

Ablation experiments

Conclusion

Acknowledgements

Declarations

Conflict of interest

Informed consent

Institutional review board

Publisher's Note