nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 27.03.2023 | Original Article

Differentiable channel pruning guided via attention mechanism: a novel neural network pruning approach

verfasst von: Hanjing Cheng, Zidong Wang, Lifeng Ma, Zhihui Wei, Fawaz E. Alsaadi, Xiaohui Liu

Erschienen in: Complex & Intelligent Systems | Ausgabe 5/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Neural network pruning offers great prospects for facilitating the deployment of deep neural networks on computational resource limited devices. Neural architecture search (NAS) provides an efficient way to automatically seek appropriate neural architecture design for compressed model. It is observed that, for existing NAS-based pruning methods, there is usually a lack of layer information when searching the optimal neural architecture. In this paper, we propose a new NAS approach, namely, differentiable channel pruning method guided via attention mechanism (DCP-A), where the adopted attention mechanism is able to provide layer information to guide the optimization of the pruning policy. The training process is differentiable with Gumbel-softmax sampling, while parameters are optimized under a two-stage training procedure. The neural network block with the shortcut is dedicatedly designed, which is of help to prune the network not only on its width but also on its depth. Extensive experiments are performed to verify the applicability and superiority of the proposed method. Detailed analysis with visualization of the pruned model architecture shows that our proposed DCP-A learns explainable pruning policies.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Deep neural networks (DNNs) have achieved remarkable accomplishments in a variety of applications such as pattern recognition [5, 18, 31, 41, 60], and have also shown sustained superiorities in comparison to other methods. However, the large amount of model parameters and high performance demand on GPUs have also brought about great challenges on storage and time costs. Therefore, much research attention has been devoted to the operation problem of DNNs on computationally limited devices such as mobile equipments and embedded devices. As a rather popular approach, neural network pruning offers a great prospect for facilitating the deployment of DNNs on computational-resource-limited devices. In general, the widely applied neural network pruning approaches can be divided into two categories, namely, weight pruning [8, 10, 20, 63, 64] and channel pruning [4, 6, 12, 21, 61, 62]. Since weight pruning cannot harvest obvious acceleration for modern networks due to its unstructured operation manner, we focus on channel pruning in this paper.

There are two types of channel pruning methods, i.e., criterion- and NAS-based channel pruning methods. The main procedure of the criterion-based channel pruning techniques is to first determine the basic criterion and then prune filters hierarchically, which would require us to manually set the pruning ratio for each layer. In practice, the pruning ratio is usually set to be equal for each layer so as to simplify the entire process. Unfortunately, such a simplification could lead to poor performance due to the fact that different layers possess different redundancies. On the other hand, neural architecture search provides a powerful tool to automatically seek efficient neural architecture. So far, extensive studies have been conducted on the neural architecture search problem with the aim to explore the optimal network structures in a large design space while taking into account the trade-off among the model size, the speed, and the accuracy. Note that, when utilizing traditional NAS-based methods, we usually confront difficulty in searching a large space with unacceptable searching speed. Although some effort has been devoted to reducing the searching space [12, 22, 24, 28, 38], the layer information, i.e., the information of filters in one layer, has seldom been taken into consideration when it comes to the optimization of the pruning policy. Basically, most criterion-based pruning methods fail to take the correlation among layers into consideration while NAS-based methods usually ignore the information of individual filters in a layer.

Inspired by the above discussion, in this paper, we propose a new differentiable channel pruning framework guided via attention mechanism (DCP-A), shown in Fig. 1, where certain policy is used to determine the pruning decision and the importance scores in a layer are used to guide the optimization of the policy logit. The importance scores can be obtained by any pruning criterion. In this paper, we choose the importance scores obtained by $l_1$ norm, $l_2$ norm and attention mechanism. Here, the policy logit guided by the attention mechanism shows the best experiment result. The attention mechanism is a concept derived from cognitive psychology that allows models to devote limited resources to more important channels [35]. Pruning policy of pruning-or-not is sampled from the policy logit which is defined for each filter in the network. To obtain layer information, attention score with attention-guided loss is adopted to regulate the optimization of policy logit. Hence, the attention score provides the correlation of filters in layer and, meanwhile, the attention guided loss limits the searching space for pruning policy. Moreover, a two-stage training procedure is proposed to ensure that the introduced attention modules are well-trained and easily removed (without increase of the final FLOPs of pruned network).

The main contributions of this paper can be highlighted as follows:

(1)

a new NAS-based differentiable channel pruning framework is proposed, where importance scores obtained by different mechanisms (including the attention mechanism) are adopted to provide a layer information for the optimization of pruning policy logit;

(2)

a two-stage training procedure with designed training objectives is proposed to optimize the network parameters, the policy logits and the attention modules;

(3)

for networks with shortcut structure (e.g. ResNet), the proposed DCP-A algorithm is capable of pruning networks not only on the width but also on the depth;

(4)

the proposed DCP-A can be easily extended into the multi-model case;

(5)

via extensive experiments, the effectiveness and efficiency of the proposed DCP-A framework are demonstrated in different databases, and detailed analysis is provided through structure visualization to show that the pruning policies learned by DCP-A are explainable.

The remainder of this paper is organized as follows. In “Related work” section, we introduce the related works of model pruning, neural architecture search and attention mechanism. In “Methodology” section, we describe our DCP-A framework in detail. The experimental study and the corresponding analysis are presented in “Experiments” section. “Conclusion” section gives the conclusions of this paper.

In terms of its objectives, the model pruning can be generally classified into two categories, namely, weight pruning and channel pruning. On one hand, weight pruning directly removes connections in filters, which might lead to unstructured sparsity and, furthermore, make it difficult to accelerate the inference with general-purpose hardware. On the other hand, channel pruning prunes entire filters to deploy existing basic linear algebra subprograms (BLASs) libraries, thereby achieving better acceleration. Considering how to design the pruning policy, we can roughly divide channel pruning methods into criterion-based pruning and NAS-based pruning.

Criterion-based pruning

Generally, criterion-based pruning methods assess the importance of filters by utilizing filter weights or filter activations. In [21], the importance of a filter has been calculated by the corresponding absolute weights sum, according to which the unimportant filters have been pruned. Filters with small $l_2$ norm have been slightly pruned in [13]. In [14], filters near geometric median have been pruned with the most replaceable contribution. In [4], three criteria have been utilized to find the important filters for satisfying the least replacing loss, the diversity and the high entropy of weights. It is worth noticing that all the aforementioned criterion-based methods use manual settings for the pruning ratio for layers.

NAS-based pruning

In early results concerning NAS, the optimal network structures have been found by resorting to the reinforcement learning [68] or evolutionary algorithms [55] which would consume substantial computation costs. Gradient-based NAS methods [29, 53, 54] have been exploited to reduce the cost by making the searching mechanism differentiable or approximately differentiable to enhance the searching efficiency. In [25], a partial order pruning method has been developed to automatically search the architectures with the best trade-off between speed and accuracy. In [28], channel number in each layer has been searched based on the artificial bee colony algorithm. In [26], the designed hypernetwork has taken the latent vectors as the input and generated the weight parameters of the backbone network. It should be pointed out that, however, the aforementioned methods only take global network information into account, and there is still a lack of layer information when conducting the searching.

Attention mechanism

In [16, 48], attention modules have been proposed to help DNNs focus on important channels and achieve a better performance. Recently, the attention mechanism has been considered in model pruning as an importance evaluation criterion of filters. In [52], an attention module has been embedded into model to generate scaling factors for channels that are considered as channels importance scores. In [6], a long short-term memory has been introduced to generate a strategy indicating the number of pruning filters for each layer. In this strategy, attention blocks have been embedded in the network, and filters with less attention scores have been forbidden in a feed-forward manner. In both the methods mentioned above, the attention score has been used directly to rank the filters in a layer.

Methodology

Approach overview

For a network that needs to be pruned, it is our goal to learn a pruning policy that determines the filter to be pruned with the least performance loss. Attention module with an attention score is utilized to evaluate the importance of filters in the layer. Note that attention modules are not expected to directly influence the optimization of network parameters because they will be removed from the pruned network to avoid increasing the FLOPs. Therefore, we define pruning policies for all filters in the whole network and use the attention module as a guided tool only.

In Fig. 2, an overview of our proposed DCP-A training approach is illustrated, which consists of two stages in the training epochs: (1) the stage of training parameters of the network, and (2) the stage of optimizing attention modules (Squeeze-and-Excitation block used in this paper) as well as policy logits. To be more specific, such a two-stage approach is explained as follows.

(a)

Stage one: In the first stage, policy logits and parameters of attention blocks are fixed, while the parameters of network are free (to be optimized). It should be mentioned that attention modules do not participate in feed-forward in this stage, and only the average attention score of each attention block is recorded.

(b)

Stage two: In the second stage, the parameters of network are fixed, while the parameters of attention blocks and policy logits are set to be free (to be optimized). Here, attention modules are activated for updating parameters. Attention scores obtained in the previous stage will be utilized as a guidance for optimizing the policy logits.

By repeating two stages alternately during training, optimal pruning pattern can be learned, resulting in a well-pruned network. Gumbel-softmax trick is utilized to make the training process differentiable. The details of our approach will be described in the following.

Attention mechanism

In this paper, the Squeeze-and-Excitation (SE) block proposed in [16] is employed to obtain the attention scores. The SE module (also known as the channel attention module) is able to select the most useful feature among channels, thereby improving the effectiveness of the feature representations. Moreover, SE block is an effective attention block that can be flexibly embedded into most existing network structures and, consequently, the SE block has been widely used in computer vision applications [45]. An SE block contains two parts in its structure, namely, squeeze and excitation.

(a)

Squeeze: In the squeeze part, the global information of each feature channel is obtained by an average pooling layer. Assume that the input of lth SE block is $X_l=[x_l^1,x_l^2,\ldots ,x_l^C] \in \mathbb {R}^{H\times W\times C}$, then the average global information of each channel is defined as

$$\begin{aligned} z^k_l =\mathcal {A}(x_l^k)=\frac{1}{H\times W}\sum ^H_{i=1}\sum ^W_{j=1}x_l^k(i,j) \end{aligned}$$

(1)

where $\mathcal {A}(\cdot )$ is the global average pooling function, and $x_l^k(i,j)$ represents the pixel value.

(b)

Excitation: In the excitation part, the global information are fused as follows to obtain the attention score $S_l$ of each channel:

$$\begin{aligned} S_l=\delta (W_2\sigma (W_1z_l)) \end{aligned}$$

(2)

where $W_1 \in \mathbb {R}^{\frac{C}{r}\times C\times 1\times 1}$ and $W_2 \in \mathbb {R}^{C\times \frac{C}{r}\times 1\times 1}$ are the correlation of channels; r is the reduction ratio; $\sigma $ represents the activation function ReLU; and $\delta $ denotes the activation function Sigmoid.

In the literature, it has been shown that the SE block possesses the ability to generate importance scores for channels and, therefore, enhancing the network performance. As shown in Fig. 3, pruning filters in one layer can be performed based on the attention scores. For example, we can set the threshold to be 0.5, and prune nearly half of filters with attention scores less than 0.5. However, in the whole network, such a technique is not applicable anymore as the attention score only reflects the relationship of filters in the same layer. In Fig. 4, it can be seen that attention scores of different layers are extremely separated, while those in the same layer are relatively concentrated within a very small area. Obviously, the network pruning would fail if we were to directly set a threshold for attention scores to prune the whole network.

With the purpose of conquering the above-mentioned difficulty, we define a policy of pruning-or-not for each filter in the network.

Network pruning policy

Assume that a neural network has L layers with weights $W_l\in \mathbb {R}^{K\times K\times C_l^I\times C_l^O}$, where K is the kernel size, $C_l^I$ and $C_l^O$ represent the sizes of input and output channels, respectively.

For kth filter $f_{l,k}$ in lth layer, we introduce a binary-valued variable $u_{l,k}$ to determine pruning or not. It should be mentioned that the probability of pruning $f_{l,k}$ is sampled from a discrete probability distribution, and the back-propagation is not allowed because of non-differentiability problem. Hence, we employ the Gumbel-Softmax trick [17] to substitute the original non-differentiable sample (from a discrete distribution) with a differentiable sample (from a corresponding Gumbel-Softmax distribution) [12, 44].

We use $\pi _{l,k}=[1-\alpha _{l,k},\alpha _{l,k}]$ to represent the distribution vector of $u_{l,k}$, where the logit $\alpha _{l,k}$ indicates the possibility of pruning $f_{l,k}$. Then, in Gumbel-softmax sampling, $u_{l,k}$ is generated as

$$\begin{aligned} u_{l,k} =\mathop {\arg \max }_{j\in \{0,1\}}\{\log \pi _{l,k}(j)+G_{l,k}(j)\} \end{aligned}$$

(3)

where

$$\begin{aligned} G_{l,k}=-\log (-\log U_{l,k}) \end{aligned}$$

is a standard Gumbel distribution with $U_{l,k}$ sampled from a uniform i.i.d. distribution $\mathcal {U}(0,1)$. Then, the one-hot vector of $u_{l,k}$ is reformulated to the soft decision $v_{l,k}$ with reparameterization trick as follows:

$$\begin{aligned} v_{l,k}(j) =\frac{\exp ((\log \pi _{l,k}(j)+G_{l,k}(j))/\tau )}{\sum _{i\in \{0,1\}}\exp ((\log \pi _{l,k}(i)+G_{l,k}(i))/\tau )} \end{aligned}$$

(4)

where $j\in \{0,1\}$ and $\tau $ is the softmax temperature. When $\tau \rightarrow \infty $, the Gumbel-softmax distribution is smooth and $\alpha _{l,k}$ can be optimized with gradient descent. When $\tau \rightarrow 0$, $v_{l,k}$ becomes one-hot.

Training objectives

For training objectives, training losses of accuracy contain $\mathcal {L}(\theta _W)$, $\mathcal {L}(\theta _{SE})$ and $\mathcal {L}(\theta _{\pi })$, which represent accuracy losses of optimizing parameters in network, SE block and policy logit, respectively.

In consideration of pruning mission, sparsity regularization $\mathcal {L}_{sparsity}(\theta _{\pi })$ is adopted to ensure the possibility of pruning filters, which is defined as

$$\begin{aligned} \mathcal {L}_{sparsity}(\theta _{\pi }) =\frac{1}{L} \sum _l \left( w_l \sum _i (1-\alpha _{l,i})\right) \end{aligned}$$

(5)

where $w_l$ represents the influence imposed by lth layer on FLOPs of pruning filters.

In most existing techniques, only $\mathcal {L}_{sparsity}(\theta _{\pi })$ and $\mathcal {L}(\theta _{\pi })$ are used to optimize $\theta _{\pi }$. Since attention score is introduced to provide the layer information in this paper, the attention score guided loss should be taken into consideration as an objective, which is defined as follows:

$$\begin{aligned} \mathcal {L}_{guided}(\theta _{\pi })=\frac{1}{L} \sum _l dist(S_l,1-\alpha _l) \end{aligned}$$

(6)

where $S_l \in \mathbb {R}^{1\times 1\times C}$ is the average attention score of SE block obtained in the first stage, and $dist(\cdot )$ measures the cosine distance as follows:

$$\begin{aligned} dist(u,v) = 1-\frac{u\cdot v}{\Vert u\Vert \Vert v \Vert } \end{aligned}$$

(7)

The exhibition of training objectives in stage two is shown in Fig. 5.

Finally, the total loss function is defined as

$$\begin{aligned} \begin{aligned} \mathcal {L}_{total} =&\mathcal {L}(\theta _W)+\mathcal {L}(\theta _{SE})+\mathcal {L}(\theta _{\pi })\\&+\lambda _1\mathcal {L}_{sparsity}(\theta _{\pi })+\lambda _2\mathcal {L}_{guided}(\theta _{\pi }) \end{aligned} \end{aligned}$$

(8)

where $\lambda _1$ and $\lambda _2$ control the weights of $\mathcal {L}_{sparsity}$ and $\mathcal {L}_{guided}$, respectively, and $\theta _W$, $\theta _{SE}$ and $\theta _{\pi }$ will be optimized alternately during training.

Consequently, we describe the whole DCP-A framework in Algorithm 1.

Architectural design

Since network block with shortcut has been widely used nowadays, in this paper, two types of block architecture (basic block and bottleneck block) are considered with special design.

As shown in Fig. 6, for basic block consisting of two convolutional layers and a shortcut, we use the same policy logit for layers in the same block. For a bottleneck block with three convolutional layers, we use the same policy for the input and middle layer, and a new policy for the output layer. The pruning ratios of layers in the same block are set to be the same. Note that shortcut is protected in our method. Due to the special architecture of shortcut, the output equals the input if the policies are zero vectors, which is equivalent to skipping the whole block. Hence, protecting shortcut will help DCP-A skip network block and change the depth of the network.

Extension to multi-model pruning

As shown in Fig. 7, “Widen-Compression” is provided in DCP-A for multi-model pruning case. Assuming that the original layer in one model has 4 filters, the policy logit will be widened to 8 (doubled). When both strategies A and B choose to reserve a filter in the same position (e.g. the 7th position in Fig. 7), this filter will be shared in pruned models. Hence, DCP-A can help design the shared structure in multi-model pruning.

Experiments

Our implementation is in PyTorch [37] with an NVIDIA 2080Ti GPU. Experiments on different databases have proved the effectiveness of our method. We also exhibit various details of pruned model visually to further explore the rationality of DCP-A.

Experimental settings

Databases

We evaluate our established DCP-A framework on the following databases: (1) CIFAR-10 and CIFAR-100 [19] that contain 60,000 color images in each database, with 50,000 training images and 10,000 testing images; (2) ILSVRC-2012 [39] (ImageNet) which is a large-scale dataset containing 1.28 million training images and 50,000 validation images of 1,000 classes; and (3) NYU-v2 [42] which is comprised of 1,449 video sequences from a variety of indoor scenes as recorded by both the RGB and Depth cameras, and include 795 images for training and 654 images for validation. We use 40-class annotation for semantic segmentation. During the training, we resize the input images to $224 \times 224$ and test on the full resolution $256 \times 512$.

Performance metrics

To evaluate the network compression and testing performance, the following measures are applied:

Acc.: The accuracy of testing on image classification. Acc. $\downarrow $ (%) is the accuracy drop between pruned and the baseline models. The smaller, the better. For CIFAR-10, top-1 accuracy is provided, while for ILSVRC-2012, both top-1 and top-5 accuracies are reported.

FLOPs: The overall floating point operations (FLOPs) is used as an indicator of computation costs. We use FLOPs $\downarrow $ (%) to describe the percentage of reduced FLOPs.

Pixel Acc.: Pixel Accuracy (Pixel Acc) on semantic segmentation. The higher, the better. It is defined as follows:

$$\begin{aligned} PA = \frac{\sum ^k_{i=0}p_{ii}}{\sum ^k_{i=0}\sum ^k_{j=0}p_{ij}} \end{aligned}$$

(9)

where $p_{ij}$ means the number of pixels belonging to ith class but predicted to be in jth class; k is the number of classes.

mIoU: Mean Intersection over Union (mIoU) on semantic segmentation. The higher, the better. It is defined as follows:

$$\begin{aligned} mIoU = \frac{1}{k+1}\sum ^k_{i=0}\frac{p_{ii}}{\sum ^k_{j=0}p_{ij}+\sum ^k_{j=0}p_{ji}-p_{ii}} \end{aligned}$$

(10)

$\Delta \mathcal {T}$: Following [36, 44], a single relative performance with respect to the baseline is defined for semantic segmentation of multiple metrics M as follows:

$$\begin{aligned} \Delta \mathcal {T} = \frac{1}{|M|}\sum _{j=0}^{|M|}(M_{\mathcal {T},j}-M_{baseline,j})/M_{baseline,j} \end{aligned}$$

(11)

where |M| represents the number of metrics.

Network architecture

We mainly focus on pruning ResNet [11] which has less redundancy than VGG-net [43]. An illustration of pruned MobileNet structure has also been provided in “Pruned result visualization” section.

Training setting

For image classification, we train the parameters of network and attention blocks with optimizer (Stochastic Gradient Descent algorithm, SGD), initial learning rate (0.1), momentum (0.9), batch size (256) and weight decay (0.0005). Following [44, 53], Adam is used for optimizing policy logit and the constant learning rate is set to be 0.01. $\tau $ is initialized as 5 and then decayed to near 0. The loss constraint weights $\lambda _1$ and $\lambda _2$ are both set to be 0.5. On CIFAR, the network is trained for 50 epochs to learn the policy logit and the value is 10 for ImageNet. After training, we can obtain the optimal policy logit of network. Then, we prune the network according to the limit on FLOPs. Attention blocks will be removed from the pruned network, hence they will not increase the FLOPs. Pruned models will be trained for 200 epochs on CIFAR. Pre-trained model is used on ImageNet and the total epoch is 100. Baseline training schedule follows [14]. The learning rate is divided by $5\times $ at epoch 60, 120 and 160.

For segmentation, the learning rate of network parameters is set to be constant (0.001) with weight decay (0.0001), batch size (8), training epoch (50) for optimizing and training epoch (50) for warm-up. $\lambda _1$ and $\lambda _2$ are set to be 0.01 and 0.1, respectively. The total re-training epoch is 300. $\tau $ is also initialized as 5.

At training time, we randomly split the original training database into two sub-training databases for two stages.

We compare DCP-A with existing state-of-art pruning algorithms, namely, MIL [7], PFEC [21], SFP [13], FPGM [14], TAS [9], MetaPruning [30], ChannelSelection [15], LSTM-SEP [6] and MFIS [4]. Among them, there exist NAS-based methods as well as hierarchically pruning methods with criterion.

Table 1

Different guidance on DCP

Depth	Method	Acc. (%)	Acc.$\downarrow $ (%)	FLOPs$\downarrow $ (%)
56	DCP-WOL	93.12	0.47	53.2
	DCP-L2-norm	93.28	0.31	53.4
	DCP-L1-norm	93.34	0.25	53.1
	DCP-A	93.56	0.03	53.9