Abstract

World today is exploding with enormous amounts of multimedia data every second and technologies are being developed to understand it and make use of it in a profound way. Deep learning is the best performing branch of artificial intelligence and highly used to solve complex problems. As the demand for deep learning usage is increasing across domains, also from cloud to edge devices, the need to optimize its implementation is a highly targeted area of research where compression of deep learning models is one among them. In this paper we have proposed performance improvements by incorporation of Quantization Aware Training (QAT) of Deep Neural Networks (DNN). Among the compression techniques, this paper proposes quantization aware training in 8-bit low precision setting. Further we will introduce our implementation of fake quantization during training and inference of a deep neural network in 8-bit setting and its performance improvements over the contemporary quantization techniques. We have exhibited our achievement of better and minimized quantization loss, inference time, memory footprint in a LENET on MNIST, CIFAR-10 datasets and MobileNet Architecture on ImageNet dataset. In this method we present inference improvement of up to 11%, accuracy improvement of upto 44.75% and memory footprint reduction of up to 0.5% than post-training quantization (calibration) technique. This paper also attempts at giving a roadmap in designing better deep learning systems by considering reducing memory footprint and latency while deploying DNN in resource constraint devices (edge devices) using QAT.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call