Abstract

As model prediction becomes more and more accurate and the network becomes deeper and deeper, the amount of memory consumed by the neural network becomes a problem, especially on mobile devices. It is also very difficult to balance the tradeoff between computational cost and battery life, which makes mobile devices very hard as well to become smarter. Model quantification techniques provide the opportunity to tackle this tradeoff by reducing the memory bandwidth and storage and improving the system throughput and latency. This paper discusses and compares the state-of-the-art methods of neural network quantification methodologies including Post Training Quantization (PTQ) and Quantization Aware Training (QAT). PTQ directly quantizes the trained floating-point model. The implementation process is simple and does not require quantization during the training phase. QAT requires us to use simulated quantization operations to model the effect of the quantization, and forward and backward passes are usually performed in the floating-point model. Finally, as discussed in the experiments in this paper, we conclude that with the evolution of the quantization techniques, the accuracy gap between PTQ and QAT is shrinking.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call