Model quantization is widely used to realize the promise of ubiquitous embedded deep network inference. While mixed-precision quantization has shown promising performance, existing approaches often rely on time-consuming search process to determine the optimal bit configuration. To address this, we introduce Hessian-based Mixed-Precision Quantization Aware Training(HMQAT) to decrease the search overhead of bit configuration. By using the sensitivity metric of joint average Hessian trace and parameter size, HMQAT effectively guides the search process. We solve the optimization problem about bit configuration automatically using a Pareto frontier method. The lowest search overhead can be achieved through our scheme. Additionally, our approach incorporates quantization transition aware fine-tuning of scale factor. This strategy consistently ensures optimal inference performance along the accuracy-size Pareto frontier across multiple models. We extensively evaluated our method on ImageNet and CIFAR10. In particular, we show that compared to the baseline, HMQAT achieves a 10.34× reduction in model size while retaining 99.81% of the Top-1 accuracy on ResNet18 for ImageNet. Moreover, HMQAT surpassed the state-of-the-art mixed-precision quantization methods in compressing neural networks with reduced search cost and achieving a satisfying trade-off between size and accuracy. This study paves the way of deploying neural networks on lightweight devices.
Read full abstract