Introduction
Related work
Network pruning
Knowledge distillation
Quantization
The proposed method
Progressive feature distillation
Output logits distillation learning
Total loss
Experiments
Implementation details
Models | Appropriate pruning rate |
---|---|
VGGNet-16 | [0.95], [0.5]*6, [0.9]*4, [0.8]*2 |
GoogLeNet | [0.10], [0.80]*5, [0.85], [0.80]*3 |
ResNet-56 | [0.1], [0.60]*35, [0.0]*2, [0.6]*6, [0.4]*3, [0.1], [0.4], [0.1], [0.4], [0.1], [0.4] |
Model | Params | FLOPs | |
---|---|---|---|
VGGNet-16 | Original | 14.98 M | 313.73 M |
60% pruning rate | 5.88 M | 126.42 M | |
70% pruning rate | 4.48 M | 95.16 M | |
APR | 2.79 M | 109.09 M | |
GoogLeNet | Original | 6.15 M | 1.52B |
60% pruning rate | 2.83 M | 0.73B | |
70% pruning rate | 2.33 M | 0.59B | |
APR | 1.77 M | 0.45B | |
ResNet-56 | Original | 0.85 M | 125.49 M |
60% pruning rate | 0.33 M | 52.32 M | |
70% pruning rate | 0.25 M | 39.21 M | |
APR | 0.47 M | 62.72 M |
Main results
CIFAR-10/100
Model | VGGNet-16 | GoogLeNet | ResNet-56 | |
---|---|---|---|---|
Teacher | 93.85 | 95.21 | 94.23 | |
60% pruning rate | Baseline | 91.95 | 94.67 | 92.04 |
KD [25] | 92.34 | 94.71 | 92.04 | |
FitNet [26] | 92.10 | 94.63 | 92.13 | |
AT [27] | 92.38 | 94.89 | 92.15 | |
SP [29] | 92.40 | 94.90 | 92.03 | |
Ours | 92.62 (+ 0.67) | 95.17 (+ 0.5) | 92.32 (+ 0.28) | |
70% pruning rate | Baseline | 91.10 | 94.16 | 90.97 |
KD [25] | 90.95 | 94.45 | 91.09 | |
FitNet [26] | 91.06 | 94.32 | 88.56 | |
AT [27] | 91.07 | 94.49 | 91.30 | |
SP [29] | 91.22 | 94.51 | 91.35 | |
Ours | 91.55 (+ 0.45) | 95.01 (+ 0.85) | 91.49 (+ 0.52) | |
ARP | Baseline | 92.28 | 93.87 | 92.46 |
KD [25] | 92.46 | 93.87 | 92.50 | |
FitNet [26] | 92.37 | 94.07 | 92.56 | |
AT [27] | 92.21 | 94.27 | 92.75 | |
SP [29] | 92.45 | 94.21 | 92.65 | |
Ours | 92.91 (+ 0.63) | 94.80 (+ 0.93) | 93.10 (+ 0.64) |
Model | VGGNet-16 | GoogLeNet | ResNet-56 | |
---|---|---|---|---|
Teacher | 73.95 | 80.49 | 73.34 | |
60% pruning rate | Baseline | 69.22 | 78.42 | 67.43 |
KD [25] | 69.55 | 77.73 | 67.67 | |
FitNet [26] | 69.54 | 78.09 | 67.45 | |
AT [27] | 70.93 | 78.92 | 67.88 | |
SP [29] | 70.10 | 78.91 | 67.59 | |
Ours | 71.31 (+ 2.09) | 79.20 (+ 0.78) | 68.04 (+ 0.61) | |
70% pruning rate | Baseline | 66.93 | 77.41 | 64.69 |
KD [25] | 66.68 | 77.56 | 64.61 | |
FitNet [26] | 66.77 | 77.65 | 64.82 | |
AT [27] | 67.40 | 78.12 | 64.71 | |
SP [29] | 67.13 | 78.11 | 64.82 | |
Ours | 67.50 (+ 0.57) | 78.33 (+ 0.92) | 65.10 (+ 0.41) | |
ARP | Baseline | 67.44 | 75.60 | 70.39 |
KD [25] | 67.72 | 75.62 | 70.44 | |
FitNet [26] | 67.30 | 75.84 | 71.09 | |
AT [27] | 67.63 | 76.12 | 71.10 | |
SP [29] | 66.76 | 76.13 | 71.20 | |
Ours | 68.10 (+ 0.66) | 76.63 (+ 1.01) | 71.62 (+ 1.23) |
Tiny-imageNet
Model | GoogLeNet | |
---|---|---|
Teacher | 61.81 | |
60% pruning rate | Baseline | 60.38 |
KD [25] | 60.25 | |
FitNet [26] | 60.13 | |
AT [27] | 60.34 | |
SP [29] | 60.10 | |
Ours | 60.78 (+ 0.4) | |
70% pruning rate | Baseline | 58.53 |
KD [25] | 58.38 | |
FitNet [26] | 58.44 | |
AT [27] | 58.54 | |
SP [29] | 58.64 | |
Ours | 59.05 (+ 0.52) |
Improving each pruning
Ablation study
Baseline | + Feature distillation | + Progressive learning (“Progressive feature distillation”) | + Output distillation (“Output logits distillation learning”) | |
---|---|---|---|---|
Accuracy (%) | 91.10 | 91.41 | 91.45 | 91.55 |
Diff | – | + 0.31 | + 0.35 | + 0.45 |