nach oben

Open Access 20.03.2024 | Original Article

Knowledge distillation based on projector integration and classifier sharing

verfasst von: Guanpeng Zuo, Chenlu Zhang, Zhe Zheng, Wu Zhang, Ruiqing Wang, Jingqi Lu, Xiu Jin, Zhaohui Jiang, Yuan Rao

Erschienen in: Complex & Intelligent Systems

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Knowledge distillation can transfer the knowledge from the pre-trained teacher model to the student model, thus effectively accomplishing model compression. Previous studies have carefully crafted knowledge representation, targeting loss function design, and distillation location selection, but there have been few studies on the role of classifiers in distillation. Previous experiences have shown that the final classifier of the model has an essential role in making inferences, so this paper attempts to narrow the gap in performance between models by having the student model directly use the classifier of the teacher model for the final inference, which requires an additional projector to help match features of the student encoder with the teacher's classifier. However, a single projector cannot fully align the features, and integrating multiple projectors may result in better performance. Considering the balance between projector size and performance, through experiments, we obtain the size of projectors for different network combinations and propose a simple method for projector integration. In this way, the student model undergoes feature projection and then uses the classifiers of the teacher model for inference, obtaining a similar performance to the teacher model. Through extensive experiments on the CIFAR-100 and Tiny-ImageNet datasets, we show that our approach applies to various teacher–student frameworks simply and effectively.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

Deep neural networks have achieved notable results in recent years in the fields of image classification [1‐3], object detection [4‐6], and natural language processing (NLP) [7‐9], among other areas of application. On the other hand, improving the performance of networks often means using deeper and larger models, resulting in more parameters [11]. While edge devices have limited computational power, state-of-the-art algorithms are hard to deploy efficiently on edge devices [12]. Methods such as pruning [13], quantization [14], distillation [15], and low-rank decomposition [16] have been developed to address this issue in the hope of making the model smaller without loss of accuracy.

The key idea behind knowledge distillation is to transfer information from large or embedded models to smaller student models. Current distillation methods are categorized into Response-based, Feature-based, and Relationship-based distillation [17]. Response-based distillation improves student performance by emulating the teacher's final prediction, which is a simple and efficient approach. However, it does not consider the information in the middle layer of the teacher model [15]. Feature-based approaches improve student performance by emulating the output of either the middle or the last layer, which provides a clear direction to improve student performance [18]. Previous research has proposed various methods for optimizing knowledge extraction in the middle layer. These methods have carefully designed the knowledge representation in such a way that the student's performance would be continually improved, but balancing multiple pieces of knowledge requires complex tuning parameterizations and the increasing complexity of the knowledge makes explaining student performance difficult [18‐23]. For the last layer, in a recent study by Baruch et al. By sharing classifiers between the teacher and student models, they could extract the dark knowledge within it and achieve competitive results [25]. However, the feature dimensions of the student network are not always the same as the teacher network. Sharing classifiers requires projector-based feature projection to resolve feature scale mismatch. Previous studies have largely neglected the critical role of the projector. Therefore, our study focuses more on the process of feature matching prior to classifier sharing with the projector integration strategy.

In this paper, student inference is performed using discriminative classifiers from pre-trained teacher models in anticipation of improved model performance. However, given that the feature dimensions of features extracted from the teacher and student models are usually different, a simple projector has been designed to be added after the student feature encoder for dimensional matching. In accordance with ensemble learning theory [25‐27], the integration of an ensemble strategy plays a pivotal role in fostering model generalization, and successful applications in the field of fault detection have demonstrated the effectiveness of integration strategies [28]. Simultaneously, it is imperative to observe that projectors subjected to diverse initializations will engender disparate feature transformations. Therefore, we devise a simple projection integration method to enhance the performance of the student model further. Figure 1 illustrates the distillation framework of this paper. Not only does our proposed method lead to better performance, but it also eliminates the need for complex knowledge representation and hyperparameter tuning procedures. A large number of experiments have shown that the performance of the student model is significantly improved with the use of the teacher classifier and that performance continues to improve in parallel with increasing numbers of integrated projections.

The main contributions of the present paper are as follows.

This paper proposes a generalized distillation method that has only a single loss term and does not require hyperparameter tuning to balance multiple losses;

A simple and effective projector integration method is designed for feature alignment, and student performance can be consistently improved in conjunction with increasing the number of integrated projections. The size of the projector parameters and the size of the final classifier in different models are also investigated;

We have conducted a large number of experiments on the CIFAR-100 and Tiny-ImageNet datasets, and the method proposed in this paper is always competitive in terms of both accuracy and convergence speed compared with the latest methods.

Knowledge distillation (KD) is an essential method of model compression since it can improve accuracy and speed for smaller models while maintaining performance similar to that of larger models. Hinton et al. [15] were the first to transfer knowledge from large complex models (teachers) to small, simple models (students). As shown in Fig. 2a, the architecture is as follows: During the training process, soft predictions from the teacher model are used as an additional training objective to guide the training of the student model, resulting in an improvement in the performance of the student model.

The symbolic description of the Vanilla Knowledge Distillation method is as follows. First define the notation used in this section, given a one-hot lable $y$ and training samples $x$ in an $N$-class categorical classification dataset. The encoded features of the second last layer of the student model are represented as $f^{s} = \{ s_{1} ,s_{2} ,s_{3} ,...,s_{i} ,...s_{b} \} \in R^{d \times b}$, where $d$ and $b$ are the student’s features dimensions and batch size. The feature $f^{s}$ is delivered to the classifier with weights $W^{s} \in R^{N \times d \times b}$ to obtain logits $g^{s} = W^{s} f^{s} \in R^{N}$ and the classification prediction $p^{s} = \sigma (g^{s} /T) \in R^{N}$ after softmax function σ(.) at temperature $T$; The corresponding teacher features are $f^{t} = \left\{ {t_{1} ,t_{2} ,t_{3} ,...,t_{i} ,...t_{b} } \right\} \in R^{m \times b}$ where $m$ is the feature dimension of the teacher [15].

$$ L_{KD} = L_{CE} (y,p^{s} ) + T^{2} L_{KL} (p^{t} ,p^{s} ) $$

(1)

As shown in Eq. (1), the loss function used in the vanilla knowledge distillation methods consists of two parts. The first part is the traditional $L_{CE}$ loss Cross-Entropy for learning the predictive capability of the student model, where the Cross-Entropy loss measures the discrepancy between the student model output and the true labeling, where the temperature $T$ is 1; the other part is $L_{KL}$, the additional supervised signals contributed by the soft targets of the teacher model, which are probability distributions of the teacher model's output, may provide richer information and assist student models in learning more, where the temperature $T$ is typically greater than 1 [14, 35].

Hinton's success has led to various logit-based improvements being proposed. In contrast, response-based knowledge typically relies on the output from the last layer, and feature-based knowledge from the middle layer is a reasonable extension of logits-based knowledge [17]. To further combat performance degradation in distillation and make KD more practical in model compression, more information from the middle layer is starting to be used.

FitNets [18] minimizes the L2 distance of the feature map between the students and the teachers generated by the middle layer of the network. The middle layer output of the teacher model feature extractor is extracted as hints to the students, and feature distillation is realized for the first time. Several studies have further optimized FitNets [23, 29] based knowledge extraction. Subsequent studies have devised different knowledge representations to transfer knowledge, such as sample relations encoded by pairwise similarity matrices [21, 30], the maximization of mutual information [18], or modeled by contrastive learning [20]. To fully use these intermediate features of teacher models, recent approaches focus on feature association [31, 32]. Figure 2b shows the structure of these previously described methods; while these methods elaborate knowledge representation, there is no unified representation of how the introduction of the knowledge above has a positive impact on the student model, along with iterative hyperparameter tuning that complicates the distillation process. Furthermore, these studies have shown that the last feature in the network can better distillation [21, 32]. One possible explanation is that the last feature representation is directly related to the model classifier and will directly affect the model's performance [32].

Projection techniques have been introduced in some feature distillation methods, as demonstrated in Fig. 2c, where a simple 1 × 1 convolutional kernel or fully connected layer is used for the transformation of the features after the student encoder [21, 32]. However, the key to the success of their methods lies not in the significant role played by the projector. A few studies attempted to improve the projector system, such as Factor Transfer (FT) [34] and Overhaul of Feature Distillation (OFD) [35], but failed to produce any competitive results. Thus, the critical role of projection has been overlooked for some time. We also note that when the model is asked to process multiple tasks with different data distributions, the essential operation is to freeze the shallow shared information as an encoder to refine the final classifier. Therefore, the distinctiveness contained in the teacher's classifier is just as important as the feature extraction portion of the encoder. Moreover, the success of the SH-KD method in HeadSharingKD [25] also provides evidence that it is efficient to introduce teacher classifiers into the learner model. This paper draws on theories such as HeadSharingKD with integrated learning to propose a simple distillation framework [21, 25‐28].

Figure 1 shows the distillation framework we propose in this paper, which improves upon the original distillation framework by using two key operations: Teacher-Classifier-Share (abbreviated as TCS to simplify the description) and Projector Integration. TCS is to replace the classifier of the student model with the classifier of the more powerful teacher model, and for the student model to use the better-performing classifier, the projector alignment function must be added after the encoder of the student model. The more similar the features of the student and the teacher are, the smaller the gap between their performances will be; thus, the projector performance is crucial in the methodology proposed in this paper. However, considering that different initialized projectors provide different feature transformations, we propose the Projector Integration method, which further improves the projectors' performance to improve the student model's performance.

The proposed method

Improved distillation with projector integration

To enable students to reason using the teacher classifier, a projector $Proj( \cdot )$ is needed to convert student or teacher features. This is usually done by applying the projector to the student; if the projector is applied to the teacher model in alignment with the student model, the original rich divisions of features from the teacher will be destroyed. So add the projector $Proj(s_{i} ) = \tau (Ws_{i} )$ where $\tau ( \cdot )$ is the ${\text{ReLU}}$ function, and $W \in R^{m \times d}$ is a weighting matrix. First, let us discuss a projection case where the feature-based and logit-based losses are combined in SRRL to improve distillation performance. However, the impact of hyperparameter tuning in the distillation results is enormous, and the additional tuning of coefficients will increase the computational cost. To simplify the computational complexity in distillation, we employ the simple Direction Alignment (DA) loss [23, 36, 37]:

$$ L_{DA} = \frac{1}{2b}\sum\limits_{i = 1}^{b} {\left\| {\frac{{{\text{Proj}}(s_{i} )}}{{\left\| {{\text{Proj}}(s_{i} )} \right\|_{2} }} - \frac{{t_{i} }}{{\left\| {t_{i} } \right\|_{2} }}} \right\|}_{2}^{2} = 1 - \frac{1}{b}\sum\limits_{i = 1}^{b} {\frac{{\left\langle {{\text{Proj}}(s_{i} ),t_{i} } \right\rangle }}{{\left\| {{\text{Proj}}(s_{i} )} \right\|_{2} \left\| {t_{i} } \right\|_{2} }}} $$

(2)

where $\left\| \cdot \right\|_{2}$ denotes the L2-norm and $\left\langle {.,.} \right\rangle$ denotes the inner product of two vectors. However, different initialized projectors provide different transformed features, and the integration of multiple projectors theoretically provides a better transformation of the features; on the other hand, since we are using the ReLU function to make the projectors allow for non-linear feature extraction, the projected student features may contain zeros but with the average pooling operation commonly used in CNNs, teacher features cannot be zero, and integrating sets of multiple projectors effective avoids this problem. To verify the above statement, we have integrated the projector using the following method:

$$ {\text{Proj}}_{{{\text{Int}}}} (s_{i} ) = \frac{1}{{{q}}}\sum\limits_{{{{K}} = {1}}}^{{{q}}} {\text{Proj}}_{{{K}}} (s_{i})$$

(3)

where $q$ is the number of projectors and ${\text{Proj}}_{{\text{K}}} ( \cdot )$ denotes the transformed features of the k-th projector. And measured differences between student and teacher characteristics that integrated different numbers of projectors as follows:

$$ M_{DA} = 1 - \frac{1}{b}\sum\limits_{i = 1}^{b} {\frac{{\left\langle {s_{i} ,t_{i} } \right\rangle }}{{\left\| {s_{i} } \right\|_{2} \left\| {t_{i} } \right\|_{2} }}} $$

(4)

Figure 3a depicts the experimental results of $M_{DA}$ obtained for the model of students without projectors, and the model of students integrating different number of projectors, using different random seeds for the experiment. The results demonstrate that the $M_{DA}$ results for students without projectors are significantly lower than those for students equipped with projectors, and, furthermore, the $M_{DA}$ values gradually increase as the number of projectors increases. On the other hand, we measured the cosine similarity between classes in the student feature space:

$$ M_{BC} = 1 - \frac{1}{b}\sum\limits_{i = 1}^{b} {\sum\limits_{j = 1}^{{c_{i} }} {\frac{{\left\langle {s_{i} ,s_{j} } \right\rangle }}{{c_{i} \left\| {s_{i} } \right\|_{2} \left\| {s_{j} } \right\|_{2} }}} } $$

(5)

where $s_{i}$ is the j-th sample belonging to a class different from $s_{i}$ and $c_{i}$ is the total number of $s_{j}$ corresponding to $s_{i}$. Figure 3b shows the test results using three different tests with random seeds integrating different number of projectors $M_{BC}$. The experimental results show that students' ability to differentiate features gradually increases as the number of integrated projections increases.

In summary, with only a single projector, there is still a gap in the distribution of features between teachers and students; therefore, this paper uses Projector Integration. By introducing projector Integration, the Modified Direction Alignment (MDA) loss is as follows:

$$ L_{{{\text{MDA}}}} = 1 - \frac{1}{b}\sum\limits_{i = 1}^{b} {\frac{{\left\langle {{\text{Proj}}_{{{\text{Int}}}} \left( {s_{i} } \right),t_{i} } \right\rangle }}{{\left\| {{\text{Proj}}_{{{\text{Int}}}} \left( {s_{i} } \right)} \right\|_{2} \left\| {t_{i} } \right\|_{2} }}} . $$

(6)

Improved distillation with teacher-classifier-share

One of the most critical operations in our approach is Teacher-Classifier-Share (TCS), which directly borrows a pre-trained teacher classifier for student reasoning instead of training a new one. TH-KD has shown that linking the teacher's classifier to the student's backbone and freezing its parameters makes it easier to characterize the extraction process, leading to improvements [25]. In addition, when a model is asked to handle multiple tasks with different data distributions, a basic approach is to freeze or share some shallow layers as feature extractors. In contrast, the last layer is fine-tuned to learn task-specific information [38‐40].

In this multitasking setting, existing work assumes that task-invariant information can be shared, while task-specific information needs to be identified independently, usually by the final classifier; similarly, we assume that most capability-specific information is contained in deep layers, and reusing these layers enables direct access to this task-specific information. Thus, Fig. 1. shows the final distillation architecture, where we directly borrow the pre-trained teacher classifier for student inference instead of training a new classifier. This removes the need for label information to calculate the cross entropy loss and leaves the feature alignment loss as the sole source for gradient generation, where $\alpha$ is a hyperparameter. The details of our method are shown in Algorithm 1.

$$ L_{{{\text{total}}}} = \alpha L_{{{\text{MDA}}}} $$

(7)

Experiments

We performed many experiments to demonstrate our proposed method's efficacy and superiority. “Implementation details” section details the implementation of the experiments as well as the baseline model; in “Main results” section, an experimental comparison using several representative methods on a publicly available dataset demonstrates the superiority of our proposed method. We then experimentally verify the important role of projector integration and Teacher-Classifier-Sharing. “Hyperparameter effect” section discusses the effects of different loss terms and hyperparameters on the distillation results.

Implementation details

Datasets. Two representative benchmark datasets were chosen, CIFAR-100 [42] and Tiny-ImageNet [42], for performance evaluation. The CIFAR-100 dataset contains 50,000 training images and 10,000 test images totaling 100 classes with 32 × 32 image sizes; The Tiny-ImageNet dataset contains 100,000 training images and 10,000 validation images totaling 200 classes with an image size of 64 × 64, and we changed the image size to 32 × 32 in our experiments. Standard image enhancement is used for both datasets, and all images are normalized by channel mean and standard deviation.

Baselines. To demonstrate the advanced performance of our proposed methods. Details, we have chosen different types of representative advanced distillation methods for comparison are given below:

KD [15]: This logit-based approach uses KL-dispersion to aggregate softened softmax outputs from teachers and students to transfer knowledge. FitNet [18]: It extracts the single intermediate representations learned by the teacher as knowledge and uses them to guide the students' learning at the single intermediate level. AT [24]: This approach uses the Attention Map learned from the teacher's network as distilled knowledge in the student's network, resulting in improved student performance. SP [22]: This approach states that students do not need to mimic the teacher's representation space but are encouraged to maintain pairwise similarity in their own representation space. VID [19]: This approach describes knowledge transfer as the maximization of mutual information between a network of teachers and students. RKD [25]: To optimize student performance using relational information to transfer knowledge from teacher to student. CRD [21]: It maps the linear projection of students' features to the mapping of the teacher space and trains the students to get more information from the teacher's description of the data. SRRL [23]: It decouples representation learning from classification and uses a convolutional kernel to transform the student features.

Training details. All training followed the settings of the previous work [21], using the same hyperparameters as their counterparts in the various distillation methods. The authors supplied the hyperparameter settings for SRRL [23]. A variety of representative networks were selected to form teacher–student pairs [10, 15, 43‐46]. On the CIFAR-100 dataset, we use an SGD optimizer with 0.9 Nesterov momentum, with Batchsize set to 64, and train a total of 240 epochs; the initial learning rate was set to 0.01 on MobileNet and ShuffleNet, and 0.05 on the other models, with the learning rate multiplied by 0.1 at the 150th, 180th, and 210th, and the distillation temperature $T$ is set to 4; on the Tiny-ImageNet dataset, an SGD optimizer with 0.9 Nesterov momentum was used, and to obtain a pre-trained teacher model, the Batchsize was set to 64, a total of 120 epochs were trained, the initial learning rate was set to 0.05, and the learning rate was multiplied by 0.1 at the 30th, 60th, and 90th epochs, respectively. The other settings in the CIFAR-100 dataset are the same. In addition to the results reported in a single experiment on Tiny-ImageNet, we conducted experiments on the CIFAR-100 dataset using different random seeds. Our method was run three times, and the average accuracy was reported. The optimal experimental results are bolded in the table. The number of integrated projectors was set to 3, the hyperparameters were set to 400, and the hyperparameters of the different methods were fixed in all experiments for a fair comparison. All experiments were done using the same NVIDIA GeForce RTX 3080Ti GPU.

Main results

Comparison of test accuracy

Tables 1 and 2 give the test accuracies based on 12 different teacher–student pairs on the CIFAR-100 dataset. The teacher models in Table 1 are all ResNet-32 × 4, and the student models include models that are similar in structure to them and entirely different models including representative lightweight models. Experiments are performed on these network combinations using different distillation methods, and their accuracy is reported. Observing the test accuracy in the table shows that our method consistently outperforms the competition on the CIFAR-100 dataset. On the two teacher–student combinations of “ResNet-32 × 4&MobileNetV2” and “ResNet-32 × 4&ResNet-8 × 4”, our method has a performance improvement of 2.97% and 2.79% over the KD method, an improvement of, respectively. Performance, and even compared to the best-performing SRRL method, our method still has an accuracy improvement of 1.59% and 1.30%. Furthermore, in the network pair “VGG13&VGG8”, the accuracy of the student model exceeds that of the teacher model after distillation using our method. According to the self-distillation method, we give a more reasonable explanation that the student model becomes more robust in aligning the features and thus yields better results [47, 48].

Table 1

Top-1 test accuracy of ResNet32 × 4 as a teacher for a variety of knowledge distillation methods on the CIFAR-100 dataset

Teacher	ResNet-32 × 4	ResNet-32 × 4	ResNet-32 × 4	ResNet-32 × 4	ResNet-32 × 4	ResNet-32 × 4
Teacher	79.42	79.42	79.42	79.42	79.42	79.42
Student	ResNet-8 × 4	ResNet-110	VGG-8	ShuffleNetV1	ShuffleNetV2	MobileNetV2
Student	72.88 ± 0.36	74.41 ± 0.16	70.44 ± 0.33	71.46 ± 0.13	72.54 ± 0.12	65.12 ± 0.15
KD	74.42 ± 0.06	76.25 ± 0.32	72.74 ± 0.21	74.30 ± 0.15	75.51 ± 0.20	66.55 ± 0.44
FitNet	74.43 ± 0.08	76.08 ± 0.11	72.81 ± 0.18	74.52 ± 0.09	75.90 ± 0.18	66.48 ± 0.26
AT	75.07 ± 0.04	76.67 ± 0.25	72.03 ± 0.14	75.55 ± 0.19	75.46 ± 0.17	66.37 ± 0.21
SP	74.29 ± 0.07	76.43 ± 0.33	73.08 ± 0.11	74.69 ± 0.33	75.71 ± 0.15	66.08 ± 0.25
VID	74.55 ± 0.12	76.17 ± 0.24	73.48 ± 0.29	74.76 ± 0.24	75.29 ± 0.12	66.29 ± 0.37
RKD	71.90 ± 0.11	74.25 ± 0.19	71.59 ± 0.21	72.15 ± 0.17	73.21 ± 0.13	64.45 ± 0.45
CRD	75.51 ± 0.08	76.86 ± 0.10	73.66 ± 0.19	75.11 ± 0.24	75.65 ± 0.57	68.57 ± 0.24
SRRL	75.62 ± 0.25	76.75 ± 0.20	73.26 ± 0.16	75.12 ± 0.36	76.20 ± 0.29	68.22 ± 0.47
Ours	77.21 ± 0.18	76.82 ± 0.15	74.21 ± 0.13	76.01 ± 0.22	77.33 ± 0.30	69.52 ± 0.21

Table 2

Top-1 test Accuracy of different knowledge distillation methods on the CIFAR-100 dataset using different network pairs

Teacher	WRN-40–2	ResNet-110 × 2	WRN-40–2	ResNet-32 × 4	VGG13	ResNet-110 × 2
Teacher	76.56	78.38	76.56	79.42	74.64	78.38
Student	WRN-16–2	ResNet-110	MobileNetV2	WRN-40–2	VGG8	ShuffleNetV2
Student	73.31 ± 0.19	74.41 ± 0.16	65.12 ± 0.15	76.56 ± 0.18	70.36 ± 0.33	72.54 ± 0.12
KD	75.10 ± 0.18	76.33 ± 0.28	68.47 ± 0.47	77.55 ± 0.13	72.98 ± 0.16	76.35 ± 0.32
FitNet	73.52 ± 0.24	76.12 ± 0.19	68.33 ± 0.25	77.69 ± 0.23	71.12 ± 0.14	76.52 ± 0.23
AT	74.67 ± 0.18	76.77 ± 0.25	67.84 ± 0.32	78.45 ± 0.21	72.66 ± 0.14	76.55 ± 0.20
SP	73.71 ± 0.17	76.43 ± 0.36	68.82 ± 0.19	78.33 ± 0.09	72.82 ± 0.19	75.98 ± 0.22
VID	73.94 ± 0.15	76.32 ± 0.28	68.77 ± 0.33	78.05 ± 0.37	72.67 ± 0.18	76.45 ± 0.27
RKD	73.35 ± 0.15	75.25 ± 0.07	66.56 ± 0.42	76.16 ± 0.21	71.53 ± 0.08	73.51 ± 0.25
CRD	75.54 ± 0.33	76.94 ± 0.09	69.12 ± 0.31	78.25 ± 0.16	73.98 ± 0.25	76.81 ± 0.27
SRRL	75.69 ± 0.14	76.76 ± 0.26	69.25 ± 0.26	78.05 ± 0.18	73.52 ± 0.10	76.71 ± 0.30
Ours	75.75 ± 0.25	77.42 ± 0.22	70.78 ± 0.39	78.60 ± 0.12	74.66 ± 0.13	77.23 ± 0.24

To exemplify the effectiveness of our proposed method, we conduct experiments using the large, complex dataset Tiny-ImageNet and introduce representative network pairs to evaluate the performance of our method. Experimental results from different distillation methods on the Tiny-ImageNet dataset are presented in Table 3. As can be seen, the partial distillation methods perform poorly on the Tiny-ImageNet dataset. In contrast, CRD, SRRL, and our method still perform well on the Tiny-ImageNet dataset. However, whereas CRD requires a transformation of features for both the student and teacher networks, the method we use maps student features to teacher space for feature matching, which is one reason our method achieves better performance. For distilled ShuffleNetV2 (8.2M), we achieve nearly the same accuracy as ResNet32 × 4 (29.6M) but with only one-quarter the number of ResNet32 × 4 parameters.

Table 3

Test accuracy of different knowledge distillation methods on Tiny-ImageNet dataset a ResNet-32 × 4&ResNet8 × 4, b ResNet-110 × 2&VGG8, c ResNet-110&MobileNetV2 × 2, d WRN-40–4&ShuffleNetV2

Pair	Accuracy	Student	KD	FitNet	AT	SP	CRD	SRRL	Ours	Teacher
(a)	Top-1	52.41	54.07	53.81	52.61	53.66	54.44	54.50	56.27	62.01
(a)	Top-5	77.40	78.64	77.84	77.39	78.84	79.32	79.17	80.24	83.54
(b)	Top-1	51.16	52.93	52.89	52.79	53.31	53.45	53.07	54.32	61.21
(b)	Top-5	75.40	78.10	77.73	77.63	78.27	78.58	78.18	78.84	83.01
(c)	Top-1	51.18	53.55	53.90	52.80	53.85	54.21	54.54	55.38	55.68
(c)	Top-5	76.20	78.91	78.79	78.84	78.71	79.01	79.35	79.55	79.83
(d)	Top-1	58.44	60.71	60.68	59.47	61.29	61.63	61.35	61.73	62.84
(d)	Top-5	81.20	83.06	82.96	82.20	83.41	83.32	83.39	83.42	84.02

In Fig. 4a, b, the previous test accuracy for different training periods is shown. Compared to other methods, our method converges more rapidly and consistently outperforms the other distillation methods in the experiment in all epochs.

Analysis of projector integration

This section presents an analysis of the performance of different projector types to compare the effect of the horizontal and vertical integration methods on model performance and finally discuss the effect of the addition of projectors on the number of model parameters. All experimental results are obtained on the CIFAR-100 dataset using the combination of “ResNet-32 × 4&ResNet-8 × 4”. In some feature distillation methods, projectors are used to make student and instructor features as similar as possible, and most of these projectors transform the features via simple $1 \times 1$ convolution kernels or linear methods. Table 4 shows the test accuracy and final Test Loss for final inference using the teacher model after feature mapping using different types of single projectors on the student model. The experimental results show that compared to the kernel method; the Linear projection method shows a strong potential, outperforming its competitors by 1.16% accuracy and 0.70% accuracy, respectively. Moreover, using a single Linear projector is already better than the 1 × 1Conv-1 × 1Conv method in terms of the results, so we opt for the Linear projector for the projection integration.

Table 4

Comparison of accuracy of different types of single projectors

Projector	Test loss (L_MDA)	Accuracy (%)
1 × 1 Conv	3.346	75.25
1 × 1 Conv-1 × 1 Conv	3.342	75.71
Linear	2.219	76.41

Horizontal integration of projectors. We integrate a linear projector in the horizontal direction in the layer before the classifier and take the classifier of the teacher model directly to the student model. Table 5 shows the Top-1 classification Accuracy for different numbers of projectors integrated horizontally. The model using multiple projectors for lateral integration was significantly better than a single projector in terms of Top1 Accuracy, and, along with the increase in the number of projectors distilled, student accuracy was able to improve consistently to the point where there was no significant increase in accuracy when using 4 projectors. The results of this experiment demonstrate the simplicity and efficiency of the horizontally integrated projector method, which can significantly improve distillation performance.

Table 5

Top-1 classification accuracy on the CIFAR-100 using different teacher–student pairs in the Horizontally Integrated Projector.1-Proj, 2-Proj, 3-Proj, 4-Proj, 5-Proj denote the number of linear projectors integrated in a layer

Pair	Student	1-Proj	2-Proj	3-Proj	4-Proj	5-Proj	Teacher
VGG13-VGG8	70.36	74.45	74.56	74.66	74.92	74.63	74.64
ResNet32 × 4-ResNet8 × 4	72.88	76.41	77.01	77.21	77.60	77.43	79.42

Vertical integration of projectors. Integration can also be achieved by cascading projector depth to increase model depth; we attempt to integrate in the depth direction to determine whether integrating the projectors in the depth direction yields better results. Table 6 shows the change in distillation performance by stepwise stacking of linear projections. In this table, 2-MLP, 3-MLP, 4-MLP, and 5-MLP are multilayer perceptrons, where each layer outputs m-dimensional features followed by ReLU activation. From the results in Table 6, it can be seen that simply integrating the projectors in the depth direction not only fails to improve the model's accuracy but also reduces the projection efficiency, resulting in degraded performance of the student model. Therefore, we ultimately choose the lateral integration projection for our approach.

Table 6

Top-1 Classification Accuracy on the CIFAR-100 using different teacher–student pairs in the Vertically Integrated Projector. 1-MLP, 2-MLP, 3-MLP, 4-MLP, 5-MLP indicate the number of linear projectors integrated in the depth direction

Pair	Student	1-MLP	2-MLP	3-MLP	4-MLP	5-MLP	Teacher
VGG13-VGG8	70.36	74.45	74.42	74.26	73.81	73.25	74.64
ResNet32 × 4-ResNet8 × 4	72.88	76.41	75.64	75.12	74.78	74.57	79.42

Adding a projector after the Encoder student model adds additional parameters to the student model, and the projector size is different for different combinations of teacher and student. The sizes of the different projectors in various teacher–student combinations are shown in Table 7. Although they differ in size, the overall increase in the number of parameters is manageable, allowing for more advanced performance at a lower cost. Moreover, our proposed integration strategy has the flexibility to change the number of integrations to ensure the balance between the additional parameters and performance. Due to the trade-off between increasing the number of parameters and performance, the number of integrations is typically chosen as 3 rather than the 4 projectors that gave the best results in the horizontal integration experiments. Although only 3 projectors are integrated, based on the results in Tables 1, 2, and 3, our method is also superior to the other methods in terms of performance and convergence speed.

Table 7

Projector and classifier size for different teacher–student combinations

Pair	Teacher classifier size (kb)	Projector size (kb)	Student classifier size (kb)
ResNet32 × 4 & ResNet8 × 4	101.9	258.6	101.9
ResNet32 × 4 & ResNet110	101.9	66.6	26.8
ResNet32 × 4 & VGG8	101.9	514.6	201.9
ResNet32 × 4 & ShuffleNetV1	101.9	962.6	376.6
ResNet32 × 4 & ShuffleNetV2	101.9	1026.6	401.9
ResNet32 × 4 & MobileNetV2	101.9	1282.6	502.1
WRN-40–2 & WRN-16–2-	51.8	34.1	26.8
ResNet32 × 4 & WRN-40–2	101.9	130.6	51.8
VGG13 & VGG8	201.9	1027.6	201.9
ResNet110 & ShuffleNetV2	26.8	257.6	401.9

Analysis of teacher-classifier-share

The purpose of this section is to discuss the critical role of teacher classifiers in distillation. To this end, we have designed a few experiments to test the critical role of teacher classifiers in distillation. Similar to the previous experimental setup, we performed projector integration in the horizontal direction of the first layer of the classifier and then used the classifiers of the training students to perform the final classification without TCS.

Tables 8 and 9 report the Top-1 Accuracy for integrating the projectors in the horizontal plane, reasoning with the teacher classifier, and retraining the student classifier to reason. With the same number of projections integrated, the accuracy of students reasoning with the teacher's classifier is significantly better than that of retraining the student's classifier. VGG8 achieves 74.45% accuracy even with a single projector after using the classifier trained on the teacher, which is 0.1% higher than the best result of 74.35% obtained after retraining the trainee classifier to incorporate three projectors. The difference in accuracy between the models with and without TCS operation for the integration of the same number of projectors is in the region of 1%. Finally, we note that the models using the teacher classifier have a higher potential to consistently improve the performance of the student models as the number of integrated projections increases.

Table 8

Using the “VGG13&VGG8” network combination on the CIFAR-100 dataset

VGG13-VGG8	Student	0-Proj	1-Proj	2-Proj	3-Proj	4-Proj	5-Proj	Teacher
TCS	70.36	–	74.45	74.56	74.66	74.92	74.63	74.64
Without-TCS	70.36	73.76	73.84	74.21	74.35	74.18	74.01	74.64

Table 9

Using the “ResNet-32 × 4&ResNet-8 × 4” network combination on the CIFAR-100 dataset

ResNet32 × 4-ResNet8 × 4	Student	0-Proj	1-Proj	2-Proj	3-Proj	4-Proj	5-Proj	Teacher
TCS	72.88	–	76.41	77.01	77.21	77.60	77.43	79.42
Without-TCS	72.88	73.66	75.14	75.66	76.08	75.93	75.86	79.42

Hyperparameter effect

This section discusses the effect of hyperparameters on the performance of the student model. In our initial experiments, we do not have only one loss term, and as with other distillation methods [14, 20, 22], we train the student model using a joint training approach. L_Initial is the initial loss, which consists of the vanilla distillation loss $L_{KD}$ and the integrated projection direction alignment loss $L_{MDA}$, with $\alpha$ and $\beta$ as two hyperparameters to balance the two losses. The formula is as follows:

$$ {{L}}_{{{\text{Initial}}}} = \alpha L_{MDA} + \beta {{L}}_{KD} $$

(8)

We used “ResNet-32 × 4&ResNet-8 × 4” and “VGG13&VGG8” 2 groups of teacher–student pair to conduct experiments on the CIFAR-100 dataset. Fixing the value of $\alpha$ as 50 and $\beta$ taking different values for training. The experiment results are shown in Fig. 5b. Increasing the value of $\beta$ does not improve the performance of the model; instead, as $\beta$ increases it leads to a continuous decrease in the accuracy of the model, and the continuous decrease in accuracy suggests that the use of projection integration alone to improve the performance of distillation is partial. When $L_{KD}$ does not exist, i.e., when $\beta$ takes on the value 0, 2 groups of teacher–student pairs obtained the best distillation results. Therefore we drop the continued use of the $L_{KD}$ loss so that the final loss function simplifies to Eq. 7.

We chose two student–teacher pairs, “ResNet-32 × 4&ResNet-8 × 4” and “WRN-40-2&MobileNetV2”, to observe the effects of different values of $\alpha$ on distillation on the CIFAR-100 dataset, with the values of $\alpha$ ranging from 25 to 600 in the experiment. The experimental results are shown in Fig. 5a, Observing the images shows that our method is not sensitive to the hyperparameter $\alpha$ and good distillation results are obtained when it takes any value from 50 to 500. In “ResNet-32 × 4&ResNet-8 × 4” teacher–student combination, the optimal distillation result is obtained when $\alpha$ is taken as 400, and the accuracy of the distilled student is obtained as 77.21%, and the accuracy continues to decrease after the value is taken as more than 500. But $\alpha$ cannot take too small a value; when his value is too small, the model collapses during training. However, the performance of the method proposed in this paper can outperform other methods as long as it is in our reasonable interval.

Conclusion and future work

In this paper, we explore a simple and effective distillation method to explore the value of teacher classifiers in distillation by simple parameter reuse. We investigate the critical role of feature matching in distillation, due to the mismatch of unavailability of feature sizes, in most cases, students cannot directly use the teacher classifiers with better performances, so feature matching plays a crucial role in our method, and the merits of the projection method will directly affect the final performance of the students. Based on the positive role of the integrated projector and teacher classifier in distillation, we conducted extensive experiments on the CIFAR-100 and Tiny-ImageNet datasets, and our method performs competitively with other state-of-the-art methods with different combinations of teachers and students.

On the other hand, the integration of projectors increases the complexity of the model, and, despite the controllable overall number of parameters added to the integrated projections, the development of an efficient projection-free distillation scheme remains a challenging study. Combining the unstructured pruning technique with projector pruning is a worthwhile research method to reduce the number of parameters brought by adding projectors. In addition, the method proposed in this paper can only be used for supervised distillation, such as image classification, machine translation, and data prediction. Developing a distillation method for unsupervised learning scenarios is also a challenging area of research.

Acknowledgements

This research was funded by the Key Research and Development Project of Anhui Province (202204c06020022, 202104a06020012, 201904a06020056), Independent Project of Anhui Key Laboratory of Smart Agricultural Technology and Equipment (APKLSATE2019X001).

Declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791CrossRef

Krizhevsky A, Sutskever I, Hinton GE (2017) ImageNet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386CrossRef

Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd international conference on learning representations (ICLR 2015), pp 1–14. https://doi.org/10.48550/arXiv.1409.1556

Redmon J, Divvala S, Girshick R et al (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788. https://doi.org/10.1109/CVPR.2016.91

Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448. https://doi.org/10.48550/arXiv.1504.08083

Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 2015:28. https://doi.org/10.1109/TPAMI.2016.2577031CrossRef

Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 2017:30. https://doi.org/10.48550/arXiv.1706.03762CrossRef

Ouyang L, Wu J, Jiang X et al (2022) Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst 35:27730–27744. https://doi.org/10.48550/arXiv.2203.02155CrossRef

Malakar S, Ghosh M, Bhowmik S et al (2020) A GA based hierarchical feature selection approach for handwritten word recognition. Neural Comput Appl 32:2533–2552. https://doi.org/10.1007/s00521-018-3937-8CrossRef

10.

Zagoruyko S, Komodakis N (2016) Wide residual networks. Preprint arXiv:1605.07146. https://doi.org/10.48550/arXiv.1605.07146

11.

Wang R, Wan S, Zhang W et al (2023) Progressive multi-level distillation learning for pruning network. Complex Intell Syst 9:5779–5791. https://doi.org/10.1007/s40747-023-01036-0CrossRef

12.

Liu Z, Li J, Shen Z et al (2017) Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision, pp 2736–2744. https://doi.org/10.48550/arXiv.1708.06519

13.

Gholami A, Kim S, Dong Z et al (2022) A survey of quantization methods for efficient neural network inference. In: Low-power computer vision. Chapman and Hall/CRC, London, pp 291–326. https://doi.org/10.48550/arXiv.2103.1363

14.

Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. Preprint arXiv:1503.02531. https://doi.org/10.48550/arXiv.1503.02531

15.

Zhang X, Zhou X, Lin M et al (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856. https://doi.org/10.1109/CVPR.2018.00716

16.

Gou J, Yu B, Maybank SJ et al (2021) Knowledge distillation: a survey. Int J Comput Vis 129:1789–1819. https://doi.org/10.1007/s11263-021-01453-zCrossRef

17.

Romero A, Ballas N, Kahou SE et al (2014) Fitnets: hints for thin deep nets. Preprint arXiv:1412.6550. https://doi.org/10.48550/arXiv.1412.6550

18.

Ahn S, Hu SX, Damianou A et al (2019) Variational information distillation for knowledge transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9163–9171. https://doi.org/10.48550/arXiv.1904.05835

19.

Chen D, Mei JP, Zhang Y et al (2021) Cross-layer distillation with semantic calibration. In: Proceedings of the AAAI conference on artificial intelligence, vol 35(8), pp 7028–7036. https://doi.org/10.48550/arXiv.2012.03236

20.

Tian Y, Krishnan D, Isola P (2019) Contrastive representation distillation. Preprint arXiv:1910.10699. https://doi.org/10.48550/arXiv.1910.10699

21.

Tung F, Mori G (2019) Similarity-preserving knowledge distillation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1365–1374. https://doi.org/10.1109/ICCV.2019.00145

22.

Yang J, Martinez B, Bulat A et al (2021) Knowledge distillation via softmax regression representation learning. In: International conference on learning representations (ICLR)

23.

Zagoruyko S, Komodakis N (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. Preprint arXiv:1612.03928. https://doi.org/10.48550/arXiv.1612.03928

24.

Ben-Baruch E, Karklinsky M, Biton Y et al (2022) It's all in the head: representation knowledge distillation through classifier sharing. Preprint arXiv:2201.06945. https://doi.org/10.48550/arXiv.2201.06945

25.

Zhou Z-H, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artif Intell 137(1–2):239–263. ISSN 0004-3702. https://doi.org/10.1016/S0004-3702(02)00190-X

26.

Wang X, Kondratyuk D, Christiansen E et al (2020) Wisdom of committees: an overlooked approach to faster and more accurate models. Preprint arXiv:2012.01988. https://doi.org/10.48550/arXiv.2012.01988

27.

Chen Z, Wang S, Li J et al (2020) Rethinking generative zero-shot learning: an ensemble learning perspective for recognising visual patches. In: Proceedings of the 28th ACM international conference on multimedia, pp 3413–3421. https://doi.org/10.48550/arXiv.2007.13314

28.

Li X, Zheng X, Zhang T et al (2023) Robust fault diagnosis of a high-voltage circuit breaker via an ensemble echo state network with evidence fusion. Complex Intell Syst 9:5991–6007. https://doi.org/10.1007/s40747-023-01025-3CrossRef

29.

Heo B, Lee M, Yun S et al (2019) Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: Proceedings of the AAAI conference on artificial intelligence, vol 33(01), pp 3779–3787. https://doi.org/10.48550/arXiv.1811.03233

30.

Park W, Kim D, Lu Y et al (2019) Relational knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3967–3976. https://doi.org/10.48550/arXiv.1904.05068

31.

Chen P, Liu S, Zhao H et al (2021) Distilling knowledge via knowledge review. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5008–5017. https://doi.org/10.48550/arXiv.2104.09044

32.

Yang J, Martinez B, Bulat A et al (2020) Knowledge distillation via softmax regression representation learning. In: International conference on learning representations

33.

Kim J, Park SU, Kwak N (2018) Paraphrasing complex network: network compression via factor transfer. Adv Neural Inf Process Syst. https://doi.org/10.48550/arXiv.1802.04977

34.

Heo B, Kim J, Yun S et al (2019) A comprehensive overhaul of feature distillation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1921–1930. https://doi.org/10.48550/arXiv.1904.01866

35.

Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86. https://doi.org/10.1214/aoms/1177729694MathSciNetCrossRef

36.

Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15750–15758. https://doi.org/10.48550/arXiv.2011.10566

37.

Grill JB, Strub F, Altché F et al (2020) Bootstrap your own latent—a new approach to self-supervised learning. Adv Neural Inf Process Syst, vol 33, pp 21271–21284. https://doi.org/10.48550/arXiv.2006.07733

38.

Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75MathSciNetCrossRef

39.

Donahue J, Jia Y, Vinyals O et al (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: International conference on machine learning. PMLR, pp 647–655. https://doi.org/10.48550/arXiv.1310.1531

40.

Li Z, Hoiem D (2017) Learning without forgetting. IEEE Trans Pattern Anal Mach Intell 40(12):2935–2947. https://doi.org/10.48550/arXiv.1606.09282

41.

Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images

42.

Le Y, Yang X (2015) Tiny imagenet visual recognition challenge. CS 231N 7(7):3

43.

He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.48550/arXiv.1512.03385

44.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Preprint arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556

45.

Sandler M, Howard A, Zhu M et al (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520. https://doi.org/10.48550/arXiv.1801.04381

46.

Ma N, Zhang X, Zheng HT et al (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131. https://doi.org/10.48550/arXiv.1807.11164

47.

Deng X, Zhang Z (2021) Learning with retrospection. In: Proceedings of the AAAI conference on artificial intelligence, vol 35(8), pp 7201–7209. https://doi.org/10.48550/arXiv.2012.13098

48.

Mobahi H, Farajtabar M, Bartlett P (2020) Self-distillation amplifies regularization in hilbert space. Adv Neural Inf Process Syst, vol 33, pp 3351–3361. https://doi.org/10.48550/arXiv.2002.05715

Titel: Knowledge distillation based on projector integration and classifier sharing
verfasst von: Guanpeng Zuo
Chenlu Zhang
Zhe Zheng
Wu Zhang
Ruiqing Wang
Jingqi Lu
Xiu Jin
Zhaohui Jiang
Yuan Rao
Publikationsdatum: 20.03.2024
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-024-01394-3

Springer Professional

Abstract

Publisher's Note

Introduction

Related work

The proposed method

Improved distillation with projector integration

Improved distillation with teacher-classifier-share

Experiments

Implementation details

Main results

Comparison of test accuracy

Analysis of projector integration

Analysis of teacher-classifier-share

Hyperparameter effect

Conclusion and future work

Acknowledgements

Declarations

Conflict of interest

Publisher's Note