Fast Adjustable Threshold for Uniform Neural Network Quantization

Alexander Goncharenko,Andrey Denisov,Sergey Alyamkin

doi:10.1201/9781003162810-6

Abstract

The neural network quantization is highly desired procedure to perform before running neural networks on mobile devices. Quantization without fine-tuning leads to accuracy drop of the model, whereas commonly used training with quantization is done on the full set of the labeled data and therefore is both time- and resource-consuming. Real-life applications require simplification and acceleration of quantization procedure that will maintain the accuracy of full-precision neural network, especially for modern mobile neural network architectures like Mobilenet-v1, MobileNet-v2, and MNAS. Here we present two methods to significantly optimize the training with quantization procedure. The first one is introducing the trained scale factors for discretization thresholds that are separate for each filter. The second one is based on mutual rescaling of consequent depth-wise separable convolution and convolution layers. Using the proposed techniques, we quantize the modern mobile architectures of neural networks with the set of train data of only 10% of the total ImageNet 2012 sample. Such reduction of train dataset size and small number of trainable parameters allow to fine-tune the network for several hours while maintaining the high accuracy of quantized model (accuracy drop was less than 0.5%). Ready-for-use models and code are available at: https://github.com/agoncharenko1992/FAT-fast-adjustable-threshold. Take-aways Describes ways how to get an 8-bit quantized network. The main idea is that simple min/max quantization with calibration works poor because of outliers which spoils thresholds of quantization. We can adjust this thresholds by using Straight-Through Estimators. Using some tips such as Batch Normalization folding and, channel equalization (more details you can found in the paper) we can get solution as good as training with quantization from scratch but with less data and way faster.

Full Text