Abstract

In this paper, we propose an online ensemble distillation (OED) method to automatically prune blocks/layers of a target network by transferring the knowledge from a strong teacher in an end-to-end manner. To accomplish this, we first introduce a soft mask to scale the output of each block in the target network and enforce the sparsity of the mask by sparsity regularization. Then, a strong teacher network is constructed online by replicating the same target networks and ensembling the discriminative features from each target as its new features. Cooperative learning between multiple target networks and the teacher network is further conducted in a closed-loop form, which improves their performance. To solve the optimization problem in an end-to-end manner, we employ the fast iterative shrinkage-thresholding algorithm to fast and reliably remove the redundant blocks, in which the corresponding soft masks are equal to zero. Compared to other structured pruning methods with iterative fine-tuning, the proposed OED is trained more efficiently in one training cycle. Extensive experiments demonstrate the effectiveness of OED, which can not only simultaneously compress and accelerate a variety of CNN architectures but also enhance the robustness of the pruned networks.

Highlights

  • In recent years, convolutional neural networks (CNNs) have achieved remarkable success in many computer vision tasks, for instance image recognition [24], [39], [70], [72], object detection [13], [65], semantic segmentation [69], etc

  • Different from them, our method proposes a strong teacher network online without pre-training, providing more knowledge to improve the performance of the target network

  • Aiming to prune residual blocks, we add the soft mask after each block to determine its importance and propose online ensemble distillation to acquire more knowledge to improve the accuracy of the pruned network

Read more

Summary

INTRODUCTION

Convolutional neural networks (CNNs) have achieved remarkable success in many computer vision tasks, for instance image recognition [24], [39], [70], [72], object detection [13], [65], semantic segmentation [69], etc. To reduce the decrease in accuracy, learning-based structured pruning is proposed to train the networks from scratch, with sparse constraints on the weights [2], [75] or the scaling factors [31], [56], [78], using supervised class labels. Lin et al [52] proposed a global and dynamic pruning scheme to reduce the number of redundant filters by greedy alternative updating All these greedy-based pruning methods iteratively prune each filter or layer and retrain the remaining models in a multi-stage manner, which is prohibitively costly when compressing deeper networks. Where the partial LCE (W) with respect to Wl can be calculated by back-propagation and η is the learning rate

ONLINE ENSEMBLE DISTILLATION FOR RESIDUAL BLOCK PRUNING
6: Backward Pass
OPTIMIZATION
EXPERIMENTS
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call