Abstract
Uniform quantization is widely taken as an efficient compression method in practical applications. Despite its merit of having a low computational overhead, uniform quantization fails to preserve sensitive components in neural networks when applied with ultra-low bit precision, which could lead to a non-trivial accuracy degradation. Previous works have applied mixed-precision quantization to address this problem. However, finding the correct bit settings for different layers always demands significant time and resource consumption. Moreover, mixed-precision quantization is not well supported on current general-purpose machines such as GPUs and CPUs and, thus, will cause intolerable overheads in deployment. To leverage the efficiency of uniform quantization while maintaining accuracy, in this paper, we propose sensitivity-aware network adaptation (SANA), which automatically modifies the model architecture based on sensitivity analysis to make it more compatible with uniform quantization. Furthermore, we formulated four different channel initialization strategies to accelerate the quantization-aware fine-tuning process of SANA. Our experimental results showed that SANA can outperform standard uniform quantization and other state-of-the-art quantization methods in terms of accuracy, with comparable or even smaller memory consumption. Notably, ResNet-50-SANA (24.4 MB) with W4A8 quantization achieved 77.8% top-one accuracy on ImageNet, which even surpassed the 77.6% of the full-precision ResNet-50 (97.8 MB) baseline.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have