Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

Maarten Vandersteegen,Toon Goedemé,Kristof Van Beeck

doi:10.3390/electronics10222823

Abstract

Quantization of neural networks has been one of the most popular techniques to compress models for embedded (IoT) hardware platforms with highly constrained latency, storage, memory-bandwidth, and energy specifications. Limiting the number of bits per weight and activation has been the main focus in the literature. To avoid major degradation of accuracy, common quantization methods introduce additional scale factors to adapt the quantized values to the diverse data ranges, present in full-precision (floating-point) neural networks. These scales are usually kept in high precision, requiring the target compute engine to support a few high-precision multiplications, which is not desirable due to the larger hardware cost. Little effort has yet been invested in trying to avoid high-precision multipliers altogether, especially in combination with 4 bit weights. This work proposes a new quantization scheme, based on power-of-two quantization scales, that works on-par compared to uniform per-channel quantization with full-precision 32 bit quantization scales when using only 4 bit weights. This is done through the addition of a low-precision lookup-table that translates stored 4 bit weights into nonuniformly distributed 8 bit weights for internal computation. All our quantized ImageNet CNNs achieved or even exceeded the Top-1 accuracy of their full-precision counterparts, with ResNet18 exceeding its full-precision model by 0.35%. Our MobileNetV2 model achieved state-of-the-art performance with only a slight drop in accuracy of 0.51%.

Highlights

Quantization of neural networks dates back to the 1990s [1,2], where the discretization of models was a necessity to make their implementation feasible on the available hardware
We present an extensive literature overview of uniform and nonuniform quantization for fixed-point inference; A novel modification to a neural network compute engine is introduced to improve the accuracy of models with 4 bit weights and 8 bit activations, in conjunction with bit-shift-based scaling, through the aid of a lookup-table; A quantization-aware training method is proposed to optimize the models that need to run on our proposed compute engine; We are the first to make a fair empirical comparison between the performance of quantized models with full-precision and power-of-two scales with either per-layer or per-channel quantization using 4 bit weights; Our source code has been made publicly available https://gitlab.com/EAVISE/lutmodel-quantization
Since Cross-Layer Equalization (CLE) was applied prior to quantization and did not need additional training, we applied it to all our other MobileNetV2 experiments

Summary

Introduction

Quantization of neural networks dates back to the 1990s [1,2], where the discretization of models was a necessity to make their implementation feasible on the available hardware. Neural networks became popular again because of the ImageNet challenge [3] and the availability of powerful GPU hardware. This breakthrough started a new area of research with hundreds of new potential applications. One of the most effective ways to reduce latency, storage cost, memory-bandwidth, energy efficiency, and silicon area among popular compression techniques such as model pruning [4] and network architecture search [5] is model quantization [6]. The quantization of neural networks is a frequently visited research topic with numerous publications that mostly focus on reducing the number of bits per weight or activation as much as possible in order to achieve high compression rates [7,8,9,10,11]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronics	Publication Date: Nov 17, 2021
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Quantum Neural Network for Image Classification Using TensorFlow Quantum
J Arun Pandian ... K Kanchanadevi
-
J Arun Pandian, et. al.J Arun Pandian ... K Kanchanadevi
01 Jan 2023
01 Jan 2023

Two-Step Quantization for Low-bit Neural Networks
Peisong Wang ... Yang Liu
-
Peisong Wang, et. al.Peisong Wang ... Yang Liu
01 Jun 2018
01 Jun 2018

RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions
Sung-En Chang ... Sijia Liu
-
Sung-En Chang, et. al.Sung-En Chang ... Sijia Liu
01 Oct 2021
01 Oct 2021

Low-Bitwidth Convolutional Neural Networks for Wireless Interference Identification
Pengyu Wang ... Shaoqian Li
IEEE Transactions on Cognitive Communications and Networking | VOL. 8
Pengyu Wang, et. al.Pengyu Wang ... Shaoqian Li
01 Jun 2022
IEEE Transactions on Cognitive Communications and Networking | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics