Skip to main content
Erschienen in: Complex & Intelligent Systems 5/2023

Open Access 17.04.2023 | Original Article

A stereo spatial decoupling network for medical image classification

verfasst von: Hongfeng You, Long Yu, Shengwei Tian, Weiwei Cai

Erschienen in: Complex & Intelligent Systems | Ausgabe 5/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Deep convolutional neural network (CNN) has made great progress in medical image classification. However, it is difficult to establish effective spatial associations, and always extracts similar low-level features, resulting in redundancy of information. To solve these limitations, we propose a stereo spatial discoupling network (TSDNets), which can leverage the multi-dimensional spatial details of medical images. Then, we use an attention mechanism to progressively extract the most discriminative features from three directions: horizontal, vertical, and depth. Moreover, a cross feature screening strategy is used to divide the original feature maps into three levels: important, secondary and redundant. Specifically, we design a cross feature screening module (CFSM) and a semantic guided decoupling module (SGDM) to model multi-dimension spatial relationships, thereby enhancing the feature representation capabilities. The extensive experiments conducted on multiple open source baseline datasets demonstrate that our TSDNets outperforms previous state-of-the-art models.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Deep learning has always been the top priority of research in the field of medical image analysis, which effectively relieves the pressure of medical experts. In recent years, convolutional neural network (CNN) has been proposed and widely used in many real-world medical image analysis scenarios due its outstanding performance in classifying medical images, such as skin cancer image classification, and X-ray classification, etc. Recently, some variants of CNN model (e.g., DCNN, Resnet, Densenet, and multi-scale CNN [10, 18, 25]) have been attracting increasing attention to capture and exploit high-level discriminative images features. To better optimize the learning ability of the convolutional neural network, Khan M A et al. [9] fused the features of AlexNet and VGG16 in parallel and realized optimization at the same time to obtain the optimal features. To better detect the focus area of breast cancer, Irfan et al. [8] used DenseNet201 as the basic model, and fused 24 sets of convolutional feature vectors to obtain various semantic information of migration, and evaluated the proposed method through 10 fold cross validation. The accuracy of the proposed algorithm in breast cancer has reached 98. 9\(\%\), which proves the feasibility of feature fusion. Although these methods have powerful feature representation capabilities, their efficiency and prediction accuracy are hindered by a large number of redundant features generated during model learning, while they are usually over-parameterized and computationally expensive. Figure 1 shows the parameters and running time of some representative models. Furthermore, attention mechanism has become an important research hotspot for medical image segmentation tasks. The concept of feature screening also plays a positive role in other fields [7, 29].
Many existing studies [26, 30] have shown that the attention mechanism can effectively enhance the representational ability of key features in the feature map by ignoring the irrelevant redundant information. However, for medical images, a specific disease object often appears in different imaging directions. When the pixel intensity of similar target objects changes slightly, it is difficult for the attention mechanism to distinguish the differences between pixels from a single direction. Therefore, it is necessary to conduct effective exploration from different directions. Some studies [4, 6, 16] have shown that use attention to automatically extract the appropriate feature from two directions, thereby better capturing the dependencies between features. However, they ignore the problem of redundant features, which may cause the model to fail to learn useful information, increasing the difficulty of model learning.
In previous studies [21], the shallow features are directly introduced into the deep features. Although the introduction of shallow features can bring more semantic information to the deep features, the introduction of too much information with weak correlation tends to reduce the quality of the deep feature maps. Therefore, reasonable selection of feature points can not only improve the sparsity of feature maps, but also avoid feature reuse and improve model efficiency.
In this paper, we propose a stereo spatial decoupling network (TSDNets) for medical image classification that utilizes three sets of attention mechanisms to assign three sets of weights to each feature point. Moreover, we adopt a feature screening strategy to suppress redundant features and build gating strategies to improve the quality of features.
  • We propose a stereo spatial decoupling network (TSDNets) to explore the spatial guidance relationship of the object from three directions of horizontal, vertical, and depth of the medical image. Specifically, the attention in this paper is more accurate than the traditional one-way attention mechanism screening features from multiple perspectives. At the same time, the attention mechanism in this paper plays a role in filtering features, and it is not a feature fusion directly, so compared with the traditional attention mechanism [27, 33], the parameter operation is less.
  • We developed a cross-feature screening module (CFSM) that uses a two-gate threshold screening strategy to generate three types of features, namely important features, secondary features, and redundant features, and targets them for deep feature fusion.
  • We constructs a semantic guided decoupling module (SGDM), which implements feature selection by setting different gate thresholds for shallow features and deep features respectively, thereby extracting more discriminative features.
In recent years, deep learning has received significant research attention from academia and industry. A widely used method in medical image classification is the convolution neural network (CNN) [2, 12]. However, traditional medical image datasets are usually small and complex, especially when there are specific targets or specific regions in the image that are similar to the surrounding background. Directly applying traditional CNN to medical image classification may be difficult to establish effective spatial associations between features or similar images. To improve the feature representation, various solutions [4, 6, 21] have recently been proposed to force CNN to focus on specific regions, and rely on domain knowledge to obtain information [12, 20]. For example, in [17, 32], authors obtain important relevant area information according to the gray value of the medical image. However, in most cases, domain knowledge is limited (only suitable for specific tasks). Moreover, to describe the targets from multiple levels with rich feature representations, [12, 20] designed a multi-scale feature extractor to describe the classification objects from multiple aspects using different scale convolution kernels, and then used a simple linear fusion method to represent the features. Although these methods demonstrate the effectiveness of multi-scale features in medical image classification, the extracted redundant information can easily reduce the expression of key features when multi-scale information flows are transferred between layers.
To reduce irrelevant redundant information, the attention mechanism has been widely used in medical image classification, which can suppress redundant features through the weights generated by attention mechanism. There have been several attempts to improve the feature representation ability of model using local or global based attention mechanism [6, 13, 26, 30, 35], which allows the network to focus on the most key features or areas, thereby suppressing redundant features. Moreover, the disease regions in a medical image are similar to the surrounding background, which makes it difficult to describe the key regions by a single-direction attention mechanism and establish effective spatial relationships. To address this limitation, various attention-based methods [6, 16] have been proposed to locate key features to better establish the relationship between features. Although these methods [24, 28] can improve the quality of feature maps, they ignore the relationship between spatial semantic information and feature map semantic information, and lack in-depth analysis of redundant information. In this paper, we reconstruct the connection between redundant features to assist important semantic establishment and enhance spatial semantic information.
In medical image analysis, introducing shallow features into high-order feature maps is an important fusion method, where shallow features can assist higher-order features to capture finer semantic information. However, most of existing studies ignore the influence of redundant features on key features, which weakens the representation ability of key features. Feature selection methods [15, 19] have been widely used in many areas, which can be used to improve the model performance effectively. Some simple linear or filtering methods can also be used to filter features [3, 11, 15], such as point mutual information (PMI), PCA, Gaussian filtering, etc. These methods are mainly applied in the image pre-processing. To solve this limitation, Dropout [31, 34] and regularization techniques have been used to prevent the neural network from falling into a local optimum. Moreover, although the attention mechanisms can suppress the redundant features in the feature selection process, most attention based methods do not directly discard these feature points, which weakens the representation ability of key features. In this paper, we propose a spatial decoupling and cross-feature screening strategy based feature selection scheme to minimize the influence of redundant information on key features.

TSDNets

The proposed TSDNets primarily consists of a cross-feature screening module (CFSM) and a semantic guided decoupling module (SGDM). The TSDNets first captures the shallow features of the objects by the convolutional operations, and then uses three different attention mechanisms to decompose the shallow features into three corresponding weight matrices \(a_H\), \(a_V\), and \(a_D\) from the horizontal, vertical, and depth directions. Three new matrices of \(a_{HV}\), \(a_{HD}\), and \(a_{VD}\) are generated by combining the three weight matrices of \(a_H\), \(a_V\), and \(a_D\) in pairs. Based on the dual-gating threshold, CFSM divides each of weight matrices, \(a_{HV}\), \(a_{HD}\), and \(a_{VD}\), into three levels: important, secondary and redundant. Then, three new feature matrices (\(f_1\), \(f_2\), and \(f_3\)) are obtained by fusing the feature information of the corresponding levels. Meanwhile, the SGDM extracts the optimal shallow global semantic features \(f_{cg}\) of \(f_1\), as well as the optimal deep local semantic features \(f'_{cg}\) of the secondary features \(f_2\). Considering that redundant features only contain less useful semantic information. We compress the redundant features into a one-dimensional dense vector to reconstruct the relationship between them. The overall flow of the proposed algorithm is shown in Fig. 2.
In this part, we propose a stereo decoupling attention mechanism, which can extract the key features by assigning weights to each feature, and decompose shallow features from three directions of horizontal, vertical, and depth. As a result, we can obtain the horizontal attention weight matrix \(a_H\), the vertical attention weight matrix \(a_V\), and the depth attention weight matrix \(a_D\). The horizontal attention weight matrix \(a_H\) can be described by
$$\begin{aligned} {\left\{ \begin{array}{ll} \displaystyle a_{i}=\sum _{i=1}^{n} \frac{\exp \left( e_{i, j}\right) }{\sum _{k=1}^{n} \exp \left( e_{i k}\right) } h_{j} \\ \displaystyle a_{H}=\left\{ a_{1}, \ldots , a_{i-1}, a_{i}\right\} \end{array}\right. } \end{aligned}$$
(1)
where \(e_{i,j}\) is the weight coefficient assigned by the attention mechanism; \(h_j\) represents the hidden layer; \(a_i\) represents the horizontal attention weight coefficient of the i-th feature. Similarly, we can get \(a_V\) and \(a_D\).

Cross feature screening module (CFSM)

This cross-feature screening module (CFSM) is mainly divided into two parts: the dual gate threshold screening and the feature aggregation. In the dual gate threshold screening, the above three weight matrix \(a_H\), \(a_V\), and \(a_D\) are divided into three levels: important, secondary, and redundant, according to two thresholds \(T_1\) and \(T_2\). After that, we can obtain three new weight matrices \(a_{HV}\), \(a_{HD}\), and \(a_{VD}\) by reconstructing \(a_H\), \(a_V\), and \(a_D\) in pairs. Then, we average the three weight matrices \(a_{HV}\), \(a_{HD}\), and \(a_{VD}\) to obtain three mean matrices \(\frac{a_{HV}+a_{HD}}{2}\), \(\frac{a_{HD}+a_{VD}}{2}\), and \(\frac{a_{VD}+a_{HV}}{2}\), where each weight value in the three mean matrices is compared to \(T_1\) and \(T_2\). When the feature points are graded, they are more stable if the mean of any two groups of weights is calculated. If the value is greater than \(T_1\), it means that the feature i contains the important semantic information. Analogously, if the value is less than \(T_2\), it means that the feature i only contains redundant information. Next, we divide the weight matrix \(a_{H,V}\) into three 0-1 matrices (where 0 represents eliminating the feature i and 1 represents retaining the feature i), namely important matrix \({\hat{a}}_{H V_{1}}\), secondary matrix \({\hat{a}}_{H V_{2}}\) and redundant matrix \({\hat{a}}_{HV_{3}}\).
$$\begin{aligned} {\left\{ \begin{array}{ll} \displaystyle {\widehat{a}}_{H V_{1}}^{i}=\left( \varsigma _{H V}^{i}(1), \frac{a_{HV}+a_{HD}}{2}>T_{1} \Vert \varsigma _{H V}^{i}(0), \frac{a_{HV}+a_{HD}}{2}<T_{1}\right) \\ \displaystyle {\widehat{a}}_{H V_{2}}^{i}=\left( \varsigma _{H V}^{i}(1), T_{2}<\frac{a_{HV}+a_{HD}}{2}<T_{1} \Vert \varsigma _{H V}^{i}(0), \right. \\ \displaystyle \qquad \qquad \left. \left\{ \frac{a_{HV}+a_{HD}}{2}<T_2, T_1<\frac{a_{HV}+a_{HD}}{2} \right\} \right) \\ \displaystyle {\widehat{a}}_{H V_{3}}^{i}= \left( \varsigma _{H V}^{i}(0), \frac{a_{HV}+a_{HD}}{2}>T_{2} \Vert \varsigma _{H V}^{i}(1), \frac{a_{HV}+a_{HD}}{2}<T_{2}\right) \end{array}\right. } \end{aligned}$$
(2)
Analogously, we can obtain the corresponding reconstructed 0-1 matrices, \({{\widehat{a}}_{HD_1}}\), \({{\widehat{a}}_{HD_2}}\), \({{\widehat{a}}_{HD_3}}\), \({{\widehat{a}}_{VD_1}}\), \({{\widehat{a}}_{VD_2}}\), \({{\widehat{a}}_{VD_3}}\). Then, we multiply the feature matrix with three sets of 0-1 matrices to get a new three sets of feature matrices, \(f_{H V_{1}}, f_{H V_{2}}\), \(f_{H V_{3}}\).
$$\begin{aligned} {\left\{ \begin{array}{ll} f_{H,V_{1}}=\varphi ({\widehat{a}}_{H, V_{1}} \cdot f_{x})\\ f_{H, V_{2}}=\varphi ({\widehat{a}}_{H,V_{2}} \cdot f_{x}) \\ f_{H,V_{3}}=\varphi ({\widehat{a}}_{H,V_{3}} \cdot f_{x}) \end{array}\right. } \end{aligned}$$
(3)
For \(a_{HD}\) and \(a_{VD}\), we can obtain the corresponding feature matrices \(f_{H D_{1}},\) \( f_{H D_{2}},\) \( f_{H D_{3}},\) \( f_{V D_{1}},\) \( f_{V D_{2}}\), \(f_{V D_{3}}\). Next, we compute the middle-level feature \(f_1\), \(f_2\), \(f_3\) as follows.
$$\begin{aligned} {\left\{ \begin{array}{ll} f_{1}={\text {Cat}}(f_{H V_{1}}, f_{H D_{1}}, f_{V D_{1}}) \\ f_{2}={\text {Cat}}(f_{H V_{2}}, f_{H D_{2}}, f_{VD_{2}}) \\ f_{3}={\text {Cat}}(f_{H V_{3}}, f_{H D_{3}}, f_{V D_{3}}) \end{array}\right. } \end{aligned}$$
(4)
where \({\text {Cat}}\) represents the concatenate operation.

Semantic guided decoupling module(SGDM)

To improve the quality of features, we develop a semantic guided decoupling module (SGDM), which introduce high-quality shallow features in deep feature fusion. By taking the square root of the sum of the squares of the weight matrices in each direction (\(a_H\), \(a_V\) and \(a_D\)), we can get two new sets of shallow global semantic feature’s weight matrices \({S}_{H V D}\) and deep local semantic feature’s weight matrices \({D}_{H V D}\). The purpose is that when the weight of any direction of the feature point is greater than the set threshold, it has stronger discriminability, so that the model retains richer semantic information.
$$\begin{aligned} {\left\{ \begin{array}{ll} S_{H V D}=\sqrt{(a_{H})^{2}+(a_{V})^{2}+(a_{D})^{2}}\\ D_{H V D}=\sqrt{(a_{H})^{2}+(a_{V})^{2}+(a_{D})^{2}} \end{array}\right. } \end{aligned}$$
(5)
Then, SGDM performs feature screening for the shallow global features \(f_x\) and deep local features \(f_y\). For the shallow global features \(f_{x}\) (as shown in Fig. 2), we compare the weight value in \(S_{H V D}\) with the threshold \(T_1\) to generate the corresponding shallow global 0–1 matrix \({\widehat{S}}_{H V D}\). Similarly, for deep local features \(f_{y }\) (as shown in Fig. 2), the corresponding deep local 0–1 matrix \({\widehat{D}}_{H V D}\) can be generated by comparing the \(D_{H V D}\) with the threshold \(T_3\). Specifically, the feature screening is the same as in Sect. “Cross feature screening module (CFSM)”, which can be described as follows.
$$\begin{aligned} {\left\{ \begin{array}{ll} {\widehat{S}}_{H V D}^{i} =(S_{H V D}^{i}(1), S_{H V D}>T_{1} \Vert S_{H V D}^{i}(0), S_{H V D}<T_{1}) \\ {\widehat{D}}^{i}_{H V D} =(D_{H V D}^{i}(1), D_{H V D}>T_{3} \Vert D_{H V D}^{i}(0), D_{H V D}<T_{3}) \end{array}\right. } \end{aligned}$$
(6)
Finally, a set of optimal shallow global semantic features \(f_{cg}\) and a set of optimal deep local semantic features \(f'_{cg}\) are generated.
$$\begin{aligned} {\left\{ \begin{array}{ll} f_{c g}=\varphi ({\widehat{S}}_{H V D} \cdot f_{x}) \\ f_{c g}'=\varphi ({\widehat{D}}_{H V D} \cdot f_{y}) \end{array}\right. } \end{aligned}$$
(7)
where \(\varphi ()\) represents the matrix multiplication operation.

Feature fusion

Feature fusion can improve the accuracy of medical image classification. To improve the feature representation ability of redundant features \(f_{3}\), we project the two-dimensional features \(f_{3}\) into a one-dimensional vector space to reconstruct the feature relationships. Then, the shallow global features and important features are fused to obtain fused features \(f_{m}\). This process can be described as follows:
$$\begin{aligned} {\left\{ \begin{array}{ll} f_{3}'={\text {Flatten}}({\text {Dense}}(f_{3})) \\ f_{1}'={\text {Flatten}}({\text {Dense}}({\text {Conv2D}} (f_{1}))) \\ f_{m}={\text {Add}}(f_{3}', f_{1}', {\text {Dense}}(\text {Flatten}(f_{c g})) \end{array}\right. } \end{aligned}$$
(8)
After that, we fuse \(f_{c g}\), \(f_{m}\) and the high-order features \(f_{c g}^{\prime }\) to obtain the deep features. This process can be described as follows:
$$\begin{aligned} {\left\{ \begin{array}{ll} f_{c g}^{\prime \prime }={\text {Reshape}}({\text {Dense}}({\text {Flatten}}(f_{c g}))) \\ f_{m}^{\prime }={\text {Conv1D}} ({\text {Reshepe}}(f_{m})) \\ {\hat{f}}_{c g}={\text {Reshape}}(f_{c g}^{\prime }) \\ f_{i}={\text {Cat}}(f_{c g^{\prime }}^{\prime \prime }, f_{m}^{\prime }, {\hat{f}}_{c g}) \end{array}\right. } \end{aligned}$$
(9)
Next, we can get the final output feature O, as follows.
$$\begin{aligned} O={\text {Dense}}({\text {Conv1 D}} (f_{i})) \end{aligned}$$
(10)
Finally, the SoftMax classifier is used to output the classification probability.

Feature visualization for CFSM and SGDM

In this part, we validates the correctness on theoretical deduction as well as obtains some important conclusions by case study and visualization analysis. Figure 3 shows the visual analysis results of 5 \(\times \) 5 deep features with their weight coefficients, where color denotes to the weight of the attention matrix. The red cell indicates that the feature point contains more semantic information, and the blue cell indicates that the feature point has a greater weight. Figure 3a is the original image. Figure 3b represents the deep feature; Fig. 3c–e represent the weight coefficients generated by vertical attention, horizontal attention, and deep attention, respectively; Fig. 3f represents the weight coefficients generated by the horizontal attention and vertical attention in CFSM module; Fig. 3g represents the weight coefficients generated by the horizontal attention and the depth attention in CFSM module; Fig. 3h represents the weight coefficients generated by the vertical attention and the depth attention in CFSM module; Fig. 3i represents the weight coefficients generated by these three attention mechanisms in the SGDM module.
As can be seen from Fig. 3c–e, the weight visualization results show that if using only one attention mechanism, the effect of weight assignment performs the worst. As shown in Fig. 3f–h, when we further use two attention mechanisms in CFSM module, this obviously produces information gain. From Fig. 3i, we use three attention mechanisms in SGDM module, the generated features contain more semantic information. This shows the proposed TSDNets is effective at guiding the attention weights to select the useful features.

Experiments and results

Datasets

We verify the MAS-Net on three datasets which are briefly described in the following paragraphs.
Shenzhen dataset—Chest X-ray database [5]. This dataset was constructed by the National Library of Medicine of Maryland, USA, and the Third People’s Hospital of Shenzhen, China. This dataset consists of 662 chest X-rays images, including 336 tuberculosis disease images (TB) and 326 normal medical images (Nor). The size of each image is 3000 \(\times \) 2900 \(\sim \) 3000.
COVID-19 radiography database [1]. This dataset was jointly produced by researchers from Qatar University and Dhaka University. There are only 219 COVID-19 images in the original dataset. To balance this dataset, the number of images reached 1200 through post-supplementation. The data consist of three types of medical chest X-ray images: viral pneumonia (VP, 1345 images), normal (Nor, 1341 images), and COVID (COVID, 1200 images). The size of each image is 1024 \(\times \) 1024.
ISIC2018 dataset [22]. This dataset is the largest public dataset of skin diseases, consisting of 10,015 skin disease images. Different types of skin disease images in this dataset were acquired using different types of imaging equipments. The dataset contains 7 types of skin diseases: E MELANOMA (MEL, 1113), MELANOCYTIC NEVUS (NV, 6705), BASAL CELL CARCINOMA (BCC, 514), ACTINIC KERATOSIS (AKIEC, 327), BENIGN KERATOSIS (BKL, 1099), DERMATO FIBROMA (DF, 115) and VASCULAR LESIONS (VASC, 142). The size of each image is 600 \(\times \) 450.
For training and testing purposes for our model, we have our data broken down into three distinct datasets with a ratio of 7:1:2, i.e, the training set, the validation set and the test set. All medical images are compressed to 256 \(\times \) 256. We train our model using the deep learning framework Keras on the desktop computer (Tesla V100 with 16 G of RAM). We set the learning rate as 0.0001, and use Adam as the optimizer.

Evaluation metrics

We evaluate each model and perform various ablation experiments using the average accuracy (AA), overall accuracy (OA) and Kappa coefficient (Kappa) metrics.
$$\begin{aligned} \text {AA}&=\frac{1}{S}\left( \frac{n_{1}}{m_{1}}+\frac{n_{2}}{m_{2}}+\cdots +\frac{n_{s}}{m_{s}}\right) \end{aligned}$$
(11)
$$\begin{aligned} \text {OA}&=\frac{n_{1}+n_{2}+\ldots +n_{s}}{m_{1}+m_{2}+\cdots +m_{s}}\end{aligned}$$
(12)
$$\begin{aligned} \text{ Kappa }&=\frac{OA-\frac{\left( n_{1} \times m_{1}+n_{2} \times m_{2}+\cdots +n_{s} \times m_{s}\right) }{S \times S}}{1--\frac{\left( n_{1} \times m_{1}+n_{2} \times m_{2}+\cdots +n_{s} \times m_{s}\right) }{S \times S}} \end{aligned}$$
(13)
where \(n_i\) represents the number of samples belonging to the ith class, which are perfectly classified; \(m_i\) represents the total number of ith class; and S represents the number of classes.

Comparison to other state-of-the-art methods

For fair comparison, in this experiment, we compare the proposed TSDNets classification framework with other state-of-the-art medical image classification models. Table 1 lists the experimental results on different data sets. We can see that the proposed TSDNets achieves the best performance on the three benchmark datasets. Compared with SRC-MT, the Kappa value on the three datasets is increased by 4.54%, 0.02% and 2.13%, respectively. This may be because the proposed TSDNets can enrich the shallow global semantics of the target objects by the semantic guided decoupling module (SGDM), while providing the effective prior information for feature extraction. Moreover, CFSM can capture more favorable local features, making the network pay more attention to subtle changes in the objects. Overall, the proposed TSDNets method is effective for medical image classification.
Table 1
Experimental results of different classification models
Model
ChinaSet
COVID-19
ISIC
ChinaSet
OA
AA
Kappa
OA
AA
Kappa
OA
AA
Kappa
Parameters
FLOPs
DenseNet-121
0.8863
0.8902
0.7730
0.9818
0.9822
0.9726
0.7540
0.6008
0.5059
7.2 M
14.9 M
ResNet-50
0.8566
0.8655
0.7122
0.9870
0.9874
0.9804
0.7605
0.6190
0.4929
23.5 M
47.4 M
VGG-16
0.8712
0.8730
0.7426
0.9779
0.9783
0.9668
0.7202
0.5146
0.3863
14.8 M
29.6 M
Xception
0.8787
0.8870
0.7580
0.9844
0.9847
0.9765
0.7404
0.5044
0.4424
21.2 M
42.8 M
InceptionV3
0.8712
0.8750
0.7427
0.9831
0.9837
0.9746
0.7852
0.6053
0.5670
22.0 M
44.2 M
AlexNet
0.8787
0.8870
0.7580
0.9792
0.9795
0.9687
0.7973
0.6533
0.5835
4.3 M
8.7 M
ReLSNet
0.8787
0.8870
0.7580
0.9870
0.9874
0.9804
0.7888
0.6418
0.5878
18.6 M
37.1 M
SRC-MT
0.8939
0.8892
0.7882
0.9870
0.9874
0.9804
0.7958
0.6795
0.5792
25.9 M
52.1 M
Transformer
0.8712
0.8750
0.7427
0.9870
0.9874
0.9804
0.7864
0.6506
0.5824
16.8 M
33.7 M
Swin-Transformer
0.8787
0.8870
0.7580
0.9883
0.9889
0.9824
0.8012
0.6931
0.5903
28.9 M
57.4 M
TSDNets
0.9167
0.9238
0.8336
0.9883
0.9889
0.9824
0.8044
0.7144
0.6005
32.3 M
65.2 M
Due to the large number of categories in ISIC and the small differences between the target categories, all the baseline methods have achieved poor classification performance on the ISIC dataset. For example, the kappa values of AlexNet and ReLSNet are only 0.5835 and 0.5878, which are 1.7% and 1.26% lower than TSDNets, respectively. Transformer [14] Tand Swin Transformer [23] Tdo not work well in small data ChinaSet, but their effects are significantly improved when the number of data set samples is increased. Our proposed model TSDNets is better applicable to all data sets.
When the category and data size are small (i.e., the ChinaSet dataset has 662 images in 2 categories), the proposed TSDNets gives the best performance under all metrics. This further demonstrates that TSDNets has good robustness and generalization. Specifically, Fig. 4 shows the confusion matrix of TSDNets on the three datasets.

Ablation study

Effects of features

In the experiment, the selection of features has an important influence on the experimental results. We verified the impact of the different features \(f_{cg}\), \(f_1\), \(f_2\), \(f_3\) (see Section III for details) on the classification performance, as shown in Table 2. We can see that the \(Model\_nof_{cg}\)) is closest to our method in all metrics on the three benchmark datasets. The performance of \(Model\_nof_{cg}\) in the Kappa is 9.08%, 3.02%, and 7.56% higher than the other three methods \(Model\_nof_{1}\), \(Model\_nof_{2}\), and \(Model\_nof_{3}\), respectively. This shows that the shallow global features \(f_{cg}\) have smaller advantages compared to other features. We observe that the performance of \(Model\_nof_{3}\) was worse than that of other models. This shows that although the redundant feature \(f_{3}\) contains only a small amount of useful semantic information. Moreover, this also demonstrates that redundant features in a specific dimensional space could improve the classification performance.
Table 2
Experimental results of different features comparisons
Dataset
Model
OA
AA
Kappa
COVID-19
\(Model\_nof_{cg}\)
0.9850
0.9844
0.9766
\(Model\_nof_{1}\)
0.9825
0.9818
0.9727
\(Model\_nof_{2}\)
0.9810
0.9805
0.9707
\(Model\_nof_{3}\)
0.9733
0.9727
0.9590
TSDNets
0.9889
0.9883
0.9824
ChinaSet
\(Model\_nof_{cg}\)
0.9084
0.9015
0.8034
\(Model\_nof_{1}\)
0.8623
0.8560
0.7126
\(Model\_nof_{2}\)
0.8966
0.8864
0.7732
\(Model\_nof_{3}\)
0.8715
0.8636
0.7278
TSDNets
0.9238
0.9167
0.8336
ISIC
\(Model\_nof_{cg}\)
0.7962
0.6850
0.5927
\(Model\_nof_{1}\)
0.7903
0.6605
0.5641
\(Model\_nof_{2}\)
0.7913
0.6492
0.5833
\(Model\_nof_{3}\)
0.7686
0.5553
0.5314
TSDNets
0.8044
0.7144
0.6006
Table 3
Experimental results of different gated thresholds comparisons. \(T_1\), \(T_2\), and \(T_3\) indicate different gated thresholds respectively
Dataset
Model
OA
AA
Kappa
COVID-19
\(T_1=0.5\), \(T_2=0.3\), \(T_3=0.5\)
0.9876
0.9870
0.9805
\(T_1=0.6\), \(T_2=0.2\), \(T_3=0.5\)
0.9813
0.9805
0.9707
\(T_1=0.6\), \(T_2=0.2\), \(T_3=0.4\)
0.9863
0.9857
0.9785
\(T_1=0.6\), \(T_2=0.3\), \(T_3=0.6\)
0.9848
0.9844
0.9766
\(T_1=0.6\), \(T_2=0.4\), \(T_3=0.5\)
0.9850
0.9844
0.9766
\(T_1=0.7\), \(T_2=0.5\), \(T_3=0.5\)
0.9812
0.9805
0.9707
ChinaSet
\(T_1=0.5\), \(T_2=0.3\), \(T_3=0.5\)
0.8931
0.8864
0.7731
\(T_1=0.6\), \(T_2=0.2\), \(T_3=0.5\)
0.8804
0.8788
0.7573
\(T_1=0.6\), \(T_2=0.2\), \(T_3=0.4\)
0.8968
0.8939
0.7881
\(T_1=0.6\), \(T_2=0.3\), \(T_3=0.6\)
0.8968
0.8939
0.7881
\(T_1=0.6\), \(T_2=0.4\), \(T_3=0.5\)
0.8903
0.8864
0.7730
\(T_1=0.7\), \(T_2=0.5\), \(T_3=0.5\)
0.8799
0.8787
0.7577
ISIC
\(T_1=0.5\), \(T_2=0.3\), \(T_3=0.5\)
0.6174
0.7757
0.5371
\(T_1=0.6\), \(T_2=0.2\), \(T_3=0.5\)
0.6134
0.7767
0.5563
\(T_1=0.6\), \(T_2=0.2\), \(T_3=0.4\)
0.6398
0.8034
0.5883
\(T_1=0.6\), \(T_2=0.3\), \(T_3=0.6\)
0.6533
0.8120
0.5933
\(T_1=0.6\), \(T_2=0.4\), \(T_3=0.5\)
0.6189
0.7857
0.5703
\(T_1=0.7\), \(T_2=0.5\), \(T_3=0.5\)
0.6573
0.7868
0.5512

Effects of gated thresholds

To verify the effectiveness of the gated threshold screening strategy, we conducted a large number of experiments. Table 3 shows the results of the proposed TSDNets with different gated thresholds. In the COVID-19 dataset, when \(T_2\) and \(T_3\) are fixed, as the gated threshold \(T_1\) increases, the classification performance increases, which then decreases gradually. With the increase of \(T_2\), the classification performance shows the same trend when \(T_1\) and \(T_3\) are fixed. The reason may be that when the gating threshold is low, irrelevant redundant information cannot be effectively filtered, thereby reducing the overall classification performance of the model. Moreover, when the gating threshold increases to a certain peak value, it can effectively filter irrelevant redundant information while preserving spatial semantic details. However, when the gating threshold continues to increase, some useful features may be filtered out, resulting in insufficient feature representation. From Table 3, we can see that when \(T_1=0.5\), \(T_2=0.3\), \(T_3=0.5\), the proposed TSDNets achieves the best performance.

Conclusion

This paper proposes a stereo spatial decoupling network (TSDNets) for medical image classification. We use the semantic guided decoupling module (SGDM) to obtain effective shallow global features, which provides favorable prior information for feature representation. Moreover, we use the cross-feature screening module (CFSM) with the dual gate control threshold strategy to enhance the interaction between feature, which further improves the feature representation. Finally, we evaluate the proposed TSDNets on three benchmark datasets. The experimental results show that our method achieves new state-of-the-art results in the classification performance with significant improvements over existing approaches.
The proposed TSDNets can extract the semantics of spatial details with high performance and efficiency. In the future, we will consider reducing the number of model parameters for a more concise and efficient spatial decoupling network. At the same time, feature screening is not sufficient, so it is necessary to further classify features according to importance: useful features, general features, redundant features.

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
1.
Zurück zum Zitat Aslan MF, Unlersen MF, Sabanci K, Durdu A (2021) Cnn-based transfer learning-bilstm network: a novel approach for covid-19 infection detection. Appl Soft Comput 98:106912–106912CrossRef Aslan MF, Unlersen MF, Sabanci K, Durdu A (2021) Cnn-based transfer learning-bilstm network: a novel approach for covid-19 infection detection. Appl Soft Comput 98:106912–106912CrossRef
2.
Zurück zum Zitat Bhanumathi V, Sangeetha R (2019) Cnn based training and classification of mri brain images. In: 2019 5th international conference on advanced computing & communication systems (ICACCS), pp 129–133 Bhanumathi V, Sangeetha R (2019) Cnn based training and classification of mri brain images. In: 2019 5th international conference on advanced computing & communication systems (ICACCS), pp 129–133
3.
Zurück zum Zitat Borvornvitchotikarn T, Kurutach W (2016) A taxonomy of mutual information in medical image registration. In: 2016 international conference on systems, signals and image processing (IWSSIP), pp 1–4 Borvornvitchotikarn T, Kurutach W (2016) A taxonomy of mutual information in medical image registration. In: 2016 international conference on systems, signals and image processing (IWSSIP), pp 1–4
4.
Zurück zum Zitat Cai L, Gao J, Zhao D (2020) A review of the application of deep learning in medical image classification and segmentation. Ann Transl Med 8(11):713–713CrossRef Cai L, Gao J, Zhao D (2020) A review of the application of deep learning in medical image classification and segmentation. Ann Transl Med 8(11):713–713CrossRef
5.
Zurück zum Zitat Candemir S, Jaeger S, Palaniappan K, Musco JP, Singh RK, Xue Z, Karargyris A, Antani S, Thoma G, McDonald CJ (2014) Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans Med Imaging 33(2):577–590CrossRef Candemir S, Jaeger S, Palaniappan K, Musco JP, Singh RK, Xue Z, Karargyris A, Antani S, Thoma G, McDonald CJ (2014) Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans Med Imaging 33(2):577–590CrossRef
6.
Zurück zum Zitat Guo X, Yuan Y (2019) Triple anet: adaptive abnormal-aware attention network for wce image classification. In: 22nd international conference on medical image computing and computer-assisted intervention, MICCAI 2019, pp 293–301 Guo X, Yuan Y (2019) Triple anet: adaptive abnormal-aware attention network for wce image classification. In: 22nd international conference on medical image computing and computer-assisted intervention, MICCAI 2019, pp 293–301
7.
Zurück zum Zitat Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141 Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
8.
Zurück zum Zitat Irfan R, Almazroi AA, Rauf HT, Damaševičius R, Nasr EA, Abdelgawad AE (2021) Dilated semantic segmentation for breast ultrasonic lesion detection using parallel feature fusion. Diagnostics 11(7):1212CrossRef Irfan R, Almazroi AA, Rauf HT, Damaševičius R, Nasr EA, Abdelgawad AE (2021) Dilated semantic segmentation for breast ultrasonic lesion detection using parallel feature fusion. Diagnostics 11(7):1212CrossRef
9.
Zurück zum Zitat Khan MA, Alhaisoni M, Tariq U, Hussain N, Majid A, Damaius R, Maskelinas R (2021) Covid-19 case recognition from chest ct images by deep learning, entropy-controlled firefly optimization, and parallel feature fusion. Sensors 21(21):7286CrossRef Khan MA, Alhaisoni M, Tariq U, Hussain N, Majid A, Damaius R, Maskelinas R (2021) Covid-19 case recognition from chest ct images by deep learning, entropy-controlled firefly optimization, and parallel feature fusion. Sensors 21(21):7286CrossRef
10.
Zurück zum Zitat Lai Z, Deng H (2018) Medical image classification based on deep features extracted by deep model and statistic feature fusion with multilayer perceptron. Comput Intell Neurosci 2018:2061516–2061516CrossRef Lai Z, Deng H (2018) Medical image classification based on deep features extracted by deep model and statistic feature fusion with multilayer perceptron. Comput Intell Neurosci 2018:2061516–2061516CrossRef
11.
Zurück zum Zitat Lai Z, Deng H (2018) Medical image classification based on deep features extracted by deep model and statistic feature fusion with multilayer perceptron. Comput Intell Neurosci 2018:2061516–2061516CrossRef Lai Z, Deng H (2018) Medical image classification based on deep features extracted by deep model and statistic feature fusion with multilayer perceptron. Comput Intell Neurosci 2018:2061516–2061516CrossRef
12.
Zurück zum Zitat Li L, Xu M, Wang X, Jiang L, Liu H (2019) Attention based glaucoma detection: a large-scale database and cnn model. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10571–10580 Li L, Xu M, Wang X, Jiang L, Liu H (2019) Attention based glaucoma detection: a large-scale database and cnn model. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10571–10580
13.
Zurück zum Zitat Liu M, Yang J (2021) Image classification of brain tumor based on channel attention mechanism. J Phys Conf Ser 2035(1):12029CrossRef Liu M, Yang J (2021) Image classification of brain tumor based on channel attention mechanism. J Phys Conf Ser 2035(1):12029CrossRef
14.
Zurück zum Zitat Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: CVF international conference on computer vision (ICCV), pp 9992–10002 Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: CVF international conference on computer vision (ICCV), pp 9992–10002
15.
Zurück zum Zitat Mateen M, Wen J, Nasrullah Song S, Huang Z (2018) Fundus image classification using vgg-19 architecture with pca and svd. Symmetry 11(1):1CrossRef Mateen M, Wen J, Nasrullah Song S, Huang Z (2018) Fundus image classification using vgg-19 architecture with pca and svd. Symmetry 11(1):1CrossRef
16.
Zurück zum Zitat Mou L, Zhao Y, Fu H, Liu Y, Cheng J, Zheng Y, Su P, Yang J, Chen L, Frangi AF et al (2021) Cs2-net: deep learning segmentation of curvilinear structures in medical imaging. Med Image Anal 67:101874CrossRef Mou L, Zhao Y, Fu H, Liu Y, Cheng J, Zheng Y, Su P, Yang J, Chen L, Frangi AF et al (2021) Cs2-net: deep learning segmentation of curvilinear structures in medical imaging. Med Image Anal 67:101874CrossRef
17.
Zurück zum Zitat Ni X, Yan Z, Wu T, Fan J, Chen C (2018) A region-of-interest-reweight 3d convolutional neural network for the analytics of brain information processing. In: International conference on medical image computing and computer-assisted intervention, pp 302–310 Ni X, Yan Z, Wu T, Fan J, Chen C (2018) A region-of-interest-reweight 3d convolutional neural network for the analytics of brain information processing. In: International conference on medical image computing and computer-assisted intervention, pp 302–310
18.
Zurück zum Zitat Ning D, Liu G, Jiang R, Wang C (2019) Attention-based multi-scale transfer resnet for skull fracture image classification. In: Fourth international workshop on pattern recognition, vol 11198, pp 63–67 Ning D, Liu G, Jiang R, Wang C (2019) Attention-based multi-scale transfer resnet for skull fracture image classification. In: Fourth international workshop on pattern recognition, vol 11198, pp 63–67
19.
Zurück zum Zitat Raj RJS, Shobana SJ, Pustokhina IV, Pustokhin DA, Gupta D, Shankar K (2020) Optimal feature selection-based medical image classification using deep learning model in internet of medical things. IEEE Access 8:58006–58017CrossRef Raj RJS, Shobana SJ, Pustokhina IV, Pustokhin DA, Gupta D, Shankar K (2020) Optimal feature selection-based medical image classification using deep learning model in internet of medical things. IEEE Access 8:58006–58017CrossRef
20.
Zurück zum Zitat Ravi D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, Yang GZ (2017) Deep learning for health informatics. IEEE J Biomed Health Inform 21:4–21CrossRef Ravi D, Wong C, Deligianni F, Berthelot M, Andreu-Perez J, Lo B, Yang GZ (2017) Deep learning for health informatics. IEEE J Biomed Health Inform 21:4–21CrossRef
21.
Zurück zum Zitat Schlemper J, Oktay O, Schaap M, Heinrich MP, Kainz B, Glocker B, Rueckert D (2019) Attention gated networks: Learning to leverage salient regions in medical images. Med Image Anal 53:197–207CrossRef Schlemper J, Oktay O, Schaap M, Heinrich MP, Kainz B, Glocker B, Rueckert D (2019) Attention gated networks: Learning to leverage salient regions in medical images. Med Image Anal 53:197–207CrossRef
22.
Zurück zum Zitat Tschandl P, Rosendahl C, Kittler H (2018) The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data 5(1):180161–180161CrossRef Tschandl P, Rosendahl C, Kittler H (2018) The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data 5(1):180161–180161CrossRef
23.
Zurück zum Zitat Verma P (2021) Attention is all you need? good embeddings with statistics are enough: audio understanding without convolutions/transformers/berts/mixers/attention/rnns or p 2110 Verma P (2021) Attention is all you need? good embeddings with statistics are enough: audio understanding without convolutions/transformers/berts/mixers/attention/rnns or p 2110
24.
Zurück zum Zitat Wang H, Wang S, Qin Z, Zhang Y, Li R, Xia Y (2021) Triple attention learning for classification of 14 thoracic diseases using chest radiography. Med Image Anal 67:101846–101846CrossRef Wang H, Wang S, Qin Z, Zhang Y, Li R, Xia Y (2021) Triple attention learning for classification of 14 thoracic diseases using chest radiography. Med Image Anal 67:101846–101846CrossRef
25.
Zurück zum Zitat Wang M, Gong X (2020) Metastatic cancer image binary classification based on resnet model. In: 2020 IEEE 20th international conference on communication technology (ICCT), pp 1356–1359 Wang M, Gong X (2020) Metastatic cancer image binary classification based on resnet model. In: 2020 IEEE 20th international conference on communication technology (ICCT), pp 1356–1359
26.
Zurück zum Zitat Wen Y, Chen L, Chen H, Tang X, Deng Y, Chen Y, Zhou C (2021) Non-local attention learning for medical image classification. In: 2021 IEEE international conference on multimedia and expo (ICME), pp 1–6 Wen Y, Chen L, Chen H, Tang X, Deng Y, Chen Y, Zhou C (2021) Non-local attention learning for medical image classification. In: 2021 IEEE international conference on multimedia and expo (ICME), pp 1–6
27.
Zurück zum Zitat Woo S, Park J, Lee JY, Kweon IS (2018) Cbam convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19 Woo S, Park J, Lee JY, Kweon IS (2018) Cbam convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
28.
Zurück zum Zitat Xie H, Zeng X, Lei H, Du J, Wang J, Zhang G, Cao J, Wang T, Lei B (2021) Cross-attention multi-branch network for fundus diseases classification using slo images. Med Image Anal 71:102031–102031CrossRef Xie H, Zeng X, Lei H, Du J, Wang J, Zhang G, Cao J, Wang T, Lei B (2021) Cross-attention multi-branch network for fundus diseases classification using slo images. Med Image Anal 71:102031–102031CrossRef
29.
Zurück zum Zitat Xing H, Xiao Z, Zhan D, Luo S, Dai P, Li K (2022) Selfmatch robust semisupervised time-series classification with self-distillation. Int J Intell Syst 37(11):8583–8610 Xing H, Xiao Z, Zhan D, Luo S, Dai P, Li K (2022) Selfmatch robust semisupervised time-series classification with self-distillation. Int J Intell Syst 37(11):8583–8610
30.
Zurück zum Zitat Xu L, Huang J, Nitanda A, Asaoka R, Yamanishi K (2020) A novel global spatial attention mechanism in convolutional neural network for medical image classification. arXiv:2007.15897 Xu L, Huang J, Nitanda A, Asaoka R, Yamanishi K (2020) A novel global spatial attention mechanism in convolutional neural network for medical image classification. arXiv:​2007.​15897
31.
Zurück zum Zitat Yadav SS, Jadhav SM (2019) Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data 6(1):1–18CrossRef Yadav SS, Jadhav SM (2019) Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data 6(1):1–18CrossRef
32.
Zurück zum Zitat Yan H, Yu M, Xia J, Zhu L, Zhang T, Zhu Z, Sun G (2020) Diverse region-based cnn for tongue squamous cell carcinoma classification with Raman spectroscopy. IEEE Access 8:127313–127328CrossRef Yan H, Yu M, Xia J, Zhu L, Zhang T, Zhu Z, Sun G (2020) Diverse region-based cnn for tongue squamous cell carcinoma classification with Raman spectroscopy. IEEE Access 8:127313–127328CrossRef
33.
Zurück zum Zitat You H, Yu L, Tian S, Ma X, Xing Y (2023) Medical image segmentation based on dual-channel integrated cross-layer residual algorithm. Multimed Tools Appl 82(4):5587–5603 You H, Yu L, Tian S, Ma X, Xing Y (2023) Medical image segmentation based on dual-channel integrated cross-layer residual algorithm. Multimed Tools Appl 82(4):5587–5603
34.
Zurück zum Zitat Zhang Q, Bai C, Liu Z, Yang LT, Yu H, Zhao J, Yuan H (2020) A gpu-based residual network for medical image classification in smart medicine. Inf Sci 536:91–100MathSciNetCrossRef Zhang Q, Bai C, Liu Z, Yang LT, Yu H, Zhao J, Yuan H (2020) A gpu-based residual network for medical image classification in smart medicine. Inf Sci 536:91–100MathSciNetCrossRef
35.
Zurück zum Zitat Zhao M, Hamarneh G (2019) Retinal image classification via vasculature-guided sequential attention. In: 2019 IEEE/CVF international Conference on Computer Vision Workshop (ICCVW) Zhao M, Hamarneh G (2019) Retinal image classification via vasculature-guided sequential attention. In: 2019 IEEE/CVF international Conference on Computer Vision Workshop (ICCVW)
Metadaten
Titel
A stereo spatial decoupling network for medical image classification
verfasst von
Hongfeng You
Long Yu
Shengwei Tian
Weiwei Cai
Publikationsdatum
17.04.2023
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems / Ausgabe 5/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-023-01049-9

Weitere Artikel der Ausgabe 5/2023

Complex & Intelligent Systems 5/2023 Zur Ausgabe

Premium Partner