Deep neural network (DNN) model compression is a popular and important optimization method for efficient and fast hardware acceleration. However, the compressed model is usually fixed, without the capability to tune the computing complexity (i.e., latency in hardware) on-the-fly, depending on dynamic latency requirements, workloads, and computing hardware resource allocation. To address this challenge, dynamic DNN with run-time adaption of computing structures has been constructed through training with a cross-entropy objective function consisting of multiple subnets sampled from the supernet. Our investigations in this work show that the performance of dynamic inference highly relies on the quality of subnet sampling. To construct a dynamic DNN with multiple high-quality subnets, we propose a progressive subnetwork searching framework, which is embedded with several proposed new techniques, including trainable noise ranking, channel-group sampling, selective fine-tuning, and subnet filtering. Our proposed framework empowers the target dynamic DNN with higher accuracy for all the subnets compared with prior works on both the Canadian Institute for Advanced Research dataset with 10 classes (CIFAR-10) and ImageNet datasets. Specifically, compared with United States-Neural Network (US-NN), our method achieves 0.9% average accuracy gain for Alexnet, 2.5% for ResNet18, 1.1% for Visual Geometry Group (VGG)11, and 0.58% for MobileNetv1, on the ImageNet dataset, respectively. Moreover, to demonstrate run-time tuning of computing latency of dynamic DNN in real computing system, we have deployed our constructed dynamic networks into Nvidia Titan graphics processing unit (GPU) and Intel Xeon central processing unit (CPU), showing great improvement over prior works. The code is available at https://github.com/ASU-ESIC-FAN-Lab/Dynamic-inference.
Read full abstract