nach oben

Complex & Intelligent Systems

Open Access 06.05.2024 | Original Article

An attention mechanism module with spatial perception and channel information interaction

verfasst von: Yifan Wang, Wu Wang, Yang Li, Yaodong Jia, Yu Xu, Yu Ling, Jiaqi Ma

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

In the field of deep learning, the attention mechanism, as a technology that mimics human perception and attention processes, has made remarkable achievements. The current methods combine a channel attention mechanism and a spatial attention mechanism in a parallel or cascaded manner to enhance the model representational competence, but they do not fully consider the interaction between spatial and channel information. This paper proposes a method in which a space embedded channel module and a channel embedded space module are cascaded to enhance the model’s representational competence. First, in the space embedded channel module, to enhance the representational competence of the region of interest in different spatial dimensions, the input tensor is split into horizontal and vertical branches according to spatial dimensions to alleviate the loss of position information when performing 2D pooling. To smoothly process the features and highlight the local features, four branches are obtained through global maximum and average pooling, and the features are aggregated by different pooling methods to obtain two feature tensors with different pooling methods. To enable the output horizontal and vertical feature tensors to focus on different pooling features simultaneously, the two feature tensors are segmented and dimensionally transposed according to spatial dimensions, and the features are later aggregated along the spatial direction. Then, in the channel embedded space module, for the problem of no cross-channel connection between groups in grouped convolution and for which the parameters are large, this paper uses adaptive grouped banded matrices. Based on the banded matrices utilizing the mapping relationship that exists between the number of channels and the size of the convolution kernels, the convolution kernel size is adaptively computed to achieve adaptive cross-channel interaction, enhancing the correlation between the channel dimensions while ensuring that the spatial dimensions remain unchanged. Finally, the output horizontal and vertical weights are used as attention weights. In the experiment, the attention mechanism module proposed in this paper is embedded into the MobileNetV2 and ResNet networks at different depths, and extensive experiments are conducted on the CIFAR-10, CIFAR-100 and STL-10 datasets. The results show that the method in this paper captures and utilizes the features of the input data more effectively than the other methods, significantly improving the classification accuracy. Despite the introduction of an additional computational burden (0.5 M), however, the overall performance of the model still achieves the best results when the computational overhead is comprehensively considered.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

The attention mechanism allocates different weights to different input elements, enabling the model to focus more accurately on important information, thereby improving the model performance. In practical applications, attention mechanisms are widely used in fields such as natural language processing [1], computer vision [2], and image processing [3].

The channel attention mechanism [4‐7] can adjust the feature weight based on the importance of different channels, making the model more focused on features that are beneficial to the task. The first proposed channel attention mechanism module, SE-Net [4], performs channel attention on the channel dimension relative to global operations. Based on the SE-Net, ECA-Net [5] and GCT [6] add interrelationships between channels. FCA-Net [7] extends the global average pooling operation in the channel attention mechanism to the two-dimensional discrete cosine transform form, increasing feature diversity.

The computational cost of the channel attention mechanism is relatively low, but the importance of spatial location is ignored, resulting in insufficient attention given to local features.

The spatial attention mechanism [8‐10] assigns different weights to information from different positions, and the model focuses more on regions that are more important to the task. RAM [8] builds a spatial attention mechanism based on recurrent neural networks to enable the model to focus on the most relevant parts when processing the input sequence. STN [9] improves the accuracy of image classification via adaptive spatial transformation. CCNet [10] enhances the semantic information of features by establishing remote dependency relationships based on spatial attention. The spatial attention mechanism ignores the channel information of the input sequence or image, focusing too much on local information and ignoring the global context. Additionally, when processing natural language text with unstructured data, it is not as effective as when processing structured data such as images.

The mixed attention mechanism [11‐17] combines channel and spatial information to capture multiscale features and improve model representational competence. BAM [11] and HAM [12] cascade channels and spatial attention mechanisms. The parallel channel and spatial attention mechanism of CBAM [13] and scSE [14] enhance the feature expression ability of convolutional neural networks. SA-Net [15] and EPSA-Net [16] use convolutional kernels of different sizes to process input feature maps before parallel connection, allowing the model to have multiscale feature information. Coordinated attention (CA) [17] cleverly embeds spatial dimension information into the channel dimension and aggregates features in two spatial directions while preserving accurate positional information and spatial information from different directions.

In summary, the channel attention mechanism [4‐7] and the spatial attention mechanism [8‐10] focus on information from different dimensions, respectively, which can affect the attention given to local features. The mixed attention mechanism [11‐17] enhances the representational competence of the model through parallel or cascading; however, the introduction of mixed attention increases model complexity and requires carefully choosing how channels and space are combined.

To solve the above problems, this paper proposes a novel and effective attention mechanism module, namely, the space and channel mutually embedded attention mechanism. By designing a cascade with a space embedded channel module and a channel embedded space module, this paper comprehensively considers horizontal and vertical information in the channel dimension. By introducing an adaptive channel interaction mode, the channel correlation in different spatial dimensions is strengthened, thus solving the problem of channel and spatial attention mechanisms focusing on information from different dimensions. The design of mutual embedding not only enhances the model's comprehensive representation of local features, but also avoids the complexity of introducing mixed attention and eliminates the need to finely select how channels and space are combined.

In the space embedded channel module, first, the input intermediate feature tensor is split into two feature tensors in the horizontal (C × 1 × W) and vertical directions (C × H × 1), which alleviates the loss of position information when performing 2D pooling. Next, to introduce the features extracted by different pooling operations to increase the model's diversity for the input data, in this paper, the two feature maps are sequentially through 2D maximum and average pooling to obtain four sets of feature tensors ((C × 1 × W)_avg, (C × 1 × W)_max, (C × H × 1)_avg and (C × H × 1)_max). Then, to fuse the features extracted in the horizontal and vertical directions to capture the information of the input data more comprehensively, this paper merges the features in the horizontal and vertical directions according to different pooling methods to obtain two sets of feature tensors ((C × 1 × (W + H))_avg and (C × 1 × (W + H))_max). Finally, to preserve the different spatial dimension information in the location, the feature tensors ((C × 1 × (W + H))_avg and (C × 1 × (W + H))_max) are split and subsequently merged in different directions to obtain two sets of feature tensors. Moreover, each set of feature tensors is enabled to focus on the important regions of the module while enhancing the semantic information of the spatial location features.

In the channel embedded space module, first, the feature vectors of the horizontal and vertical dimensions output from the space embedded channel module are considered as two parallel inputs, and tensor transformation is performed to convert the two-dimensional tensor into a one-dimensional tensor. Next, to strengthen the correlation between channel dimensions in different spatial dimensions, banded matrices are introduced on the basis of grouped convolution to reduce the number of parameters, and the convolution kernel setting is approximated to be consistent with the input and output dimensions of the ECA-Net method. Subsequently, the two sets of feature tensors output from the channel adaptive interaction module of the 1D banded matrix are extended to the spatial dimension by unsqueezing to obtain the channel features. Afterwards, a Sigmoid operation is performed to adjust the attention weights, limiting their range to between (0,1) to generate the attention weights of the channel dimensions in different spatial dimensions. Then, the expansion operation is used to expand in the height and width dimensions to match the size of the input original feature tensor in preparation for subsequent elementwise multiplication operations. Finally, an elementwise multiplication operation is used to output the result of the input original tensor adjusted with attention weights in the horizontal and vertical directions, allowing the model to more adaptively focus on information from different positions and spatial dimensions in the input tensor.

Main contributions:

(1)

To our knowledge, this paper is the first to interactively embed channel and spatial attention. The interaction between channel and spatial attention enhances the model's representational competence.

(2)

A space embedded channel module is constructed to enhance the representational competence for objects of interest. This module embeds the horizontal and vertical directions into the channel dimension, smooths image information and highlights local features through global maximum and average pooling, thus comprehensively considering feature information from different directions.

(3)

A channel embedded space module is constructed, using an adaptive grouped banded matrix to enhance the correlation between channels in different spatial dimensions. The attention weights of the generated channel dimensions in different spatial dimensions are utilized to multiply elementwise with the original feature tensor to adjust the input tensor and make the model more adaptive in focusing on information from different channels and spatial dimensions.

In this section, a brief overview of the image classification network architecture based on convolutional neural networks is provided. A detailed review of the algorithm inspiration source in this paper, the CA [17] attention mechanism. The algorithm in this paper is proposed based on the analysis of the algorithm shortcomings.

Network engineering

“Network engineering” plays an important role in visual research, and algorithms based on convolutional neural networks such as LeNet [18], AlexNet [19], VGG [20], Inception [21], ResNet [22] and MobileNet [23‐25] are commonly used in tasks such as image classification, object detection, and image segmentation in visual research. The LeNet [18] algorithm demonstrates the effectiveness of convolutional neural networks in image classification tasks. AlexNet [19] introduces a deeper network architecture and adopts ReLU activation functions and Dropout regularization to avoid gradient vanishing and overfitting problems. Its design and innovation lay the foundation for more complex networks such as VGG [20] and ResNet [22]. The VGG [20] network structure is relatively simple, with a 16–19 layer deep model. ResNet [22] solves the gradient vanishing and exploding problems in deep network training and can still achieve better performance even when the network depth exceeds 100 layers. Typical examples include ResNet18, ResNet56, ResNet110 and ResNet152. The MobileNet [23‐25] series of algorithms is suitable for efficient and lightweight neural networks in mobile devices and embedded systems to achieve better performance when computing resources are limited. MobileNetV1 [23] replaces the convolutions in VGG [20] with deep separable convolutions, using ReLU6 as the activation function. MobileNetV2 [24] adds shortcut connections and expands the dimensionality, and the convolution uses linear activation instead of ReLU on the output pointwise. MobileNetV3 [25] introduces the inverted residual module and squeeze-and-excitation module based on V2 [24], using Hard-swish as the activation function.

In the experimental section of this paper, MobileNetV2 and various depths of ResNet are chosen to validate the attention module, evaluate its applicability to lightweight networks (such as mobile devices) and more powerful networks (such as servers), and determine its effectiveness in different environments.

Review and problem analysis of CA

The channel attention mechanism [4‐7, 26] is an expression of feature abstraction, while the spatial attention mechanism [8‐10, 27] is an enrichment of positional information. A single attention mechanism cannot simultaneously satisfy the acquisition of channel and positional information. Therefore, researchers have proposed a mixed attention mechanism [28], which generates multiple attention feature maps from multiple attention mechanisms and then concatenates them [11, 12, 29] or parallels them [13, 14, 30] to obtain richer feature representations. CA [17] cleverly attaches spatial information to channels, which can be plug-and-played on lightweight classification networks [31] with negligible computational overhead. CA [17] provides a new approach for mixed attention mechanisms.

The CA [17] algorithm first converts a three-dimensional tensor F = [F₁, F₂,…,F_C] ∈ R^H×W×C into two two-dimensional tensors $Z_{C}^{H} \in {\mathbb{R}}^{C \times H \times 1}$ and $Z_{C}^{W} \in {\mathbb{R}}^{C \times 1 \times W}$ and computes the correlation between different dimensions in Eqs. (1) and (2).

$$ Z_{C}^{H} (H) = \frac{1}{W}\sum\limits_{0 \le i \le W} {X_{C} (H,i)} ,\;Z_{C}^{H} \in {\mathbb{R}}^{C \times H \times 1} $$

(1)

where C denotes the number of channels, H denotes the height, and 1 denotes the width. Each channel corresponds to a feature at a different vertical position. In height, each channel represents different spatial information at different horizontal positions.

$$ Z_{C}^{W} (W) = \frac{1}{H}\sum\limits_{0 \le j \le H} {X_{C} (j,W)} ,\;Z_{C}^{W} \in {\mathbb{R}}^{C \times 1 \times W} $$

(2)

where 1 denotes the height and W denotes the width. Each channel corresponds to a feature at a different horizontal position. At the width position, each channel represents different spatial information at different horizontal positions.

The results are concatenated after global average pooling of $Z_{C}^{H} \in {\mathbb{R}}^{C \times H \times 1}$ and $Z_{C}^{W} \in {\mathbb{R}}^{C \times 1 \times W}$, then split backward to obtain the new $Z_{C}^{H} \in {\mathbb{R}}^{C \times H \times 1}$ and $Z_{C}^{W} \in {\mathbb{R}}^{C \times 1 \times W}$ after performing dimensionality reduction on the channels. Two feature maps are obtained through Eq. (3), and the output of the attention mechanism is the product of the feature maps.

$$ \left\{ \begin{gathered} g_{H} = Sigmoid(F_{H} ({\text{Re}} LU(F_{1*1} (Z_{C}^{H} ,Z_{C}^{W} )) \hfill \\ g_{W} = Sigmoid(F_{W} ({\text{Re}} LU(F_{1*1} (Z_{C}^{H} ,Z_{C}^{W} )) \hfill \\ \end{gathered} \right. $$

(3)

In the CA algorithm, the idea of splitting spatial dimension information into two dimensions and embedding them into channels is very novel and effective. However, the following issues remain:

(1)

Only global average pooling is performed after spatial dimension splitting, and the loss of detailed local feature information in the feature map makes it more difficult for the network to capture the local structure in the image.

(2)

In the paper, only spatial dimension information is embedded into the channels. Can channel information be embedded in the spatial dimension?

(3)

Processing channel information by dimensionality reduction leads to data loss, while focusing primarily on local interchannel relationships fails to capture longer-range dependencies.

In response to the above issues, this paper proposes an attention mechanism module with channel and spatial dimension information interaction. A combination of multiple pooling units is used to provide greater feature richness than single pooling; channels are divided into multiple groups using banded matrices, treating each group as a specific type of channel information. Keeping the spatial dimension unchanged, each pixel can capture channel information from different groups, thus embedding channel information into the spatial dimension and avoiding the data loss problem that would be caused by processing channel information in a reduced-dimensional way through banded matrices. In promoting information exchange between different dimensions, more correlations are introduced into the model to enhance feature representation competence.

Methodology

Assuming F = [F₁, F₂,…, F_C] ∈ R^C×H×W is used as the intermediate feature map of the input tensor, the output is F’ = [F₁’, F₂’,…, F_C’] ∈ R^C×H×W. A schematic diagram of the attention mechanism proposed in this paper is shown in Fig. 1. The SPCII achieves interaction between channel and spatial dimension information through two steps: embedding spatial dimension information into the channel dimension and embedding channel dimension information into the spatial dimension. The following provides a detailed description of SPCII.

Space embedded channel module

As shown in Fig. 1, the red solid line on the left side shows the spatial dimension information embedded in the channel module. Referring to the CA [17] algorithm, $Z_{C}^{H}$ and $Z_{C}^{W}$ are obtained, and “where” is embedded into "what". That is, the channel dimension remains unchanged and still represents different channel features; the spatial dimensions of the horizontal and vertical directions are compressed to 1, and different position information is embedded into the channel dimension.

Because only average pooling is used in reference [17] to preserve smooth image information, this paper refers to the CBAM [13] attention mechanism to add maximum pooling on the basis of average pooling, such as Eqs. (4) and (5). The combination of the two helps to highlight local features while preserving spatial smooth information.

$$ \left\{ {\begin{array}{*{20}c} {F_{{A{\text{vg}}}}^{H} = AvgPool(Z_{C}^{H} (H))} \\ {F_{Max}^{H} = MaxPool(Z_{C}^{H} (H))} \\ \end{array} } \right. $$

(4)

$$ \left\{ {\begin{array}{*{20}c} {F_{Avg}^{W} = AvgPool(Z_{C}^{W} (W))} \\ {F_{Max}^{W} = MaxPool(Z_{C}^{W} (W))} \\ \end{array} } \right. $$

(5)

Then, two feature tensors obtained by the same pooling method are combined in the horizontal and vertical directions to form two new feature tensors $Concat(F_{Avg}^{W} ,F_{Avg}^{H} ) \in {\mathbb{R}}^{C \times 1 \times (H + W)}$ and $Concat(F_{Max}^{W} ,F_{Max}^{H} ) \in {\mathbb{R}}^{C \times 1 \times (H + W)}$. The model simultaneously considers feature information from different directions.

After Eq. (6) is applied, $F_{Avg}^{HW\prime }$ and $F_{Max}^{HW\prime }$ are obtained through 2D convolution, batch normalization and activation function processing to obtain a richer output with higher-level feature representation.

$$ \left\{ {\begin{array}{*{20}c} {F_{Avg}^{HW\prime } = {\text{Re}} LU(BN(Conv2D(Concat(F_{Avg}^{W} ,F_{Avg}^{H} ))))} \\ {F_{Max}^{HW\prime } = {\text{Re}} LU(BN(Conv2D(Concat(F_{Max}^{H} ,F_{Max}^{H} ))))} \\ \end{array} } \right. $$

(6)

Finally, $F_{Avg}^{HW\prime }$ and $F_{Max}^{HW\prime }$ are segmented to obtain $F_{{A{\text{vg}}}}^{H\prime }$, $F_{Avg}^{W\prime }$, $F_{Max}^{H\prime }$ and $F_{Max}^{W\prime }$. Note that the dimensions of $F_{Avg}^{W\prime }$ and $F_{Max}^{W\prime }$ are transposed, and the horizontal and vertical dimensions are swapped. The channels are then summed to improve the ability of each branch to capture image features in Eq. (7).

$$ \left\{ {\begin{array}{*{20}c} {f_{H} = F_{{A{\text{vg}}}}^{H\prime } + F_{Max}^{H\prime } } \\ {f_{W} = F_{Avg}^{W\prime } + F_{Max}^{W\prime } } \\ \end{array} } \right. $$

(7)

In Eq. (7), the output of f_H is (C, (H_A + H_M), 1) and the output of f_W is (C, 1, (W_A + W_M)).

Channel embedded space module

As shown in Fig. 1, the blue solid line on the right side shows the channel embedded space module. This module takes the feature vectors in the horizontal and vertical dimensions output by the space embedded channel module as two parallel inputs.

The inputs f_H and f_W are both two-dimensional tensors. To embed the channels into different spatial dimensions, taking f_H as an example, this paper defines a new tensor, new_h = (H_A + H_M) × 1, that shares data storage with the original tensor, where each channel contains elements at the corresponding positions in the original f_H. The same goes for f_W.

A tensor view transformation operation using Eq. (8) adapts the input requirements of a one-dimensional convolution (C, new_h) with a convolution kernel of k, ensuring that the input shape is correct.

$$ \left\{ \begin{gathered} f_{H} .view\left( {f_{H} .size\left( 0 \right),new\_h} \right) \hfill \\ f_{W} .view\left( {f_{W} .size\left( 0 \right),new\_w} \right) \hfill \\ \end{gathered} \right. $$

(8)

The f_H.view(…) tensor view transformation operation in PyTorch is used and the view function acts to change the tensor shape without changing the tensor elements. In Eq. (8), f_H.size(0) denotes the channel dimension and new_h = (H_A + H_M) × 1 denotes the height dimension.

This paper obtains the number of channels by C = f_H.size(0), which is used as the number of input and output channels for 1D convolution.

With respect to the convolution kernel k, the 1D banded matrix for adaptive cross-channel interactions (ACCI) from the ECA-Net [4] algorithm is used to strengthen the correlation between channel dimensions. The 1D convolution is changed into a grouped convolution, which is divided into several blocks, and each block is fully connected internally with a parameter C². There is no cross-channel connection between groups, and the parameter is larger (the parameter is C²/G), where G represents the number of groups in the grouped convolution. Therefore, the banded matrix is introduced as in Eq. (9), which reduces the parameter size (the parameter at this point is k × C) while keeping the input and output dimensions consistent and enhancing the correlation between channels.

$$ w_{G} = \left[ {\begin{array}{*{20}c} {w^{1,1} } & \cdots & {w^{1,k} } & 0 & 0 & \cdots & \cdots & 0 \\ 0 & {w^{2,2} } & \cdots & {w^{2,k + 1} } & 0 & \cdots & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ 0 & \cdots & 0 & 0 & \cdots & {w^{C,C - k + 1} } & \cdots & {w^{C,C} } \\ \end{array} } \right] $$

(9)

There exists a mapping between C and k in Eq. (10), and the relationship C = 2^r×k−b is defined in reference [4]. To ensure that the convolution kernel has a center point, the convolution kernel size is forced to be an odd number and the convolution kernel size is adaptively computed as k by Eq. (10).

$$ \left\{ {\begin{array}{*{20}c} {k\% 2 = 0 \Rightarrow k = \psi (C) = \left| {\frac{{\log_{2} (C)}}{\gamma } + \frac{b}{\gamma }} \right| - 1} \\ {k\% 2 = 1 \Rightarrow k = \psi (C) = \left| {\frac{{\log_{2} (C)}}{\gamma } + \frac{b}{\gamma }} \right|} \\ \end{array} } \right. $$

(10)

where r = 2 and b = 1.

Taking (C,k,1) as an input to the ACCI outputs a new one-dimensional tensor C, namely, the relationship between the channels.

In this paper, to extend the results of the ACCI to the spatial dimension to achieve a channel embedded spatial dimension, we use.unsqueeze(-1) to add a dimension to the last dimension, i.e., (C, (H_A + H_M), 1), matching the original two-dimensional tensor. In the f_W part, we use.unsqueeze( – 2) to add a dimension in the second to last position, i.e., (C, 1, (W_A + W_M)). The attention weights are adjusted by a Sigmoid operation (the Sigmoid activation function in Eq. (11)), which adjusts the range of the attention weights and limits them to between (0, 1). The attention weights of the channel dimension in different spatial dimensions are generated. The correlation between different channels is dynamically adjusted by the weights of the banded matrix, allowing the model to focus more flexibly on important channel information in different spatial dimensions.

$$ \left\{ \begin{gathered} g_{H} = Sigmoid(Conv1D\_k(f_{H} )) \hfill \\ g_{W} = Sigmoid(Conv1D\_k(f_{W} )) \hfill \\ \end{gathered} \right. $$

(11)

The outputs g_H and g_W are the channel information about which positions the model should focus on horizontally and vertically, respectively; that is, different channel information is embedded in the spatial dimension. The.expand operation is used to expand g_H and g_W in the height and width dimensions to match the dimension of the input F_C for subsequent elementwise multiplication operations. The output of the attention block $F_{C}^{\prime } (i,j)$ is given by Eq. (12).

$$ F_{C}^{\prime } (i,j) = F_{C} (i,j) \times g_{C}^{H} (i) \times g_{C}^{W} (j) $$

(12)

The output of the attention block is the result of the input tensor F_C being adjusted by the attention weights g_H and g_W in the horizontal and vertical directions, respectively. The attention block can more adaptively focus on information from different positions and spatial dimensions in the input tensor to improve performance in image processing tasks.

SPCII attention generation

Figure 2a shows the SPCII attention mechanism module proposed in this paper, which can be plug-and-played in any CNN architecture, where XAP and YAP are the average pooling operations in the horizontal and vertical directions, XMP and YMP are the maximum pooling operations in the horizontal and vertical directions, and X_AP, Y_AP, X_MP and Y_MP are the tensors in the horizontal and vertical directions after splitting. The ACCI is the module for adaptive cross-channel interactions in Sect. "Channel embedded space module". Figure 2b shows the integration schematic of SPCII and MobileNetV2 [24], and Fig. 2c shows the integration schematic of SPCII and ResNet [22] (BasicBlock for example).

Experiments

Experiment setup

This paper implements experiments using the PyTorch toolkit on the VScode platform. To evaluate the algorithms of this paper, experiments are performed on standard datasets CIFAR-10, CIFAR-100, and STL-10. During training, a standard SGD optimizer is used with a decay rate of 0.9, a decay weight of 4E-5, and an initial learning rate of 0.05. MobileNetV2 is used as the baseline with 200 epochs and a batchsize of 64.

The CIFAR-10 [32] and CIFAR-100 [32] datasets both contain 60,000 RGB images at 32 × 32 resolution. Among them, 83.33% are used as the training set, and 16.67% are used as the test set. The CIFAR-10 dataset contains 10 categories, and the CIFAR-100 dataset contains 100 categories.

The STL-10 [33] dataset contains 113,000 RGB images with 96 × 96 resolution from ImageNet, and includes 10 categories. The training set contains 5000 images, the test set contains 8000 images, and the unlabeled set contains 100,000 unlabeled images. Only the training and test sets of STL-10 are used in the experiments of this paper, and the unlabeled dataset is not used.

To evaluate the performance advantages and disadvantages of the method proposed in this paper, five methods are compared: SE-Net (channel attention mechanism) [4], ECA-Net (lightweight channel attention mechanism) [5], CBAM (channel + spatial attention mechanism) [13], CA (space embedded channel attention mechanism) [17] and SPCII (the mechanism proposed in this paper). In addition, since no precedent using the same public dataset for validation has been found in previous research, this paper achieves consistency in the hardware environment and the network location of module insertion, and retrains the model. The Parameters, GFLOPs, and error rates reported in Tables 1, 2, 3, 4, 5, 6 are the average results of 10 runs in the same environment.

Table 1

Comparison between different CNN architectures on the CIFAR-10 and CIFAR-100 datasets

Description	Parameters (M)	GFLOPs	CIFAR-10 Error (%)	CIFAR-100 Error (%)
MobileNetV2(Baseline)	2.237	0.326284	17.98	48.19
MobileNetV2 + SE	2.258_+0.94%	0.328205_+0.59%	17.53_–2.50%	48.45_+0.54%
MobileNetV2 + ECA	2.273_+1.609%	0.352.068_+7.90%	17.27_–3.95%	48.02_–0.35%
MobileNetV2 + CBAM	2.811_+25.66%	0.345825_+5.99%	17.60_–2.11%	47.65_–1.12%
MobileNetV2 + CA	2.682_+19.89%	0.336191_+3.04%	17.80_–1.00%	47.24_–1.97%
MobileNetV2 + Ours	2.682_+19.89%	0.339166_+3.95%	17.20_–4.33%	47.07_–2.32%
ResNet18	11.182	1.824	14.77	44.61
ResNet18 + SE	11.277_+0.85%	1.824_+0.00%	14.09_–4.60%	43.02_–3.56%
ResNet18 + ECA	12.577_+12.48%	1.826_+0.11%	15.27_+3.39%	43.30_–2.94%
ResNet18 + CBAM	11.256_+0.66%	1.826_+0.11%	13.93_–5.69%	42.29_–5.20%
ResNet18 + CA	11.256_+0.66%	1.826_+0.11%	13.93_–5.69%	42.22_–5.36%
ResNet18 + Ours	11.256_+0.66%	1.827_+0.16%	13.46_–8.87%	42.06_–5.72%