Abstract

Deep Neural Networks are one of the machine learning techniques which are increasingly used in a variety of applications. However, the significantly high memory and computation demands of deep neural networks often limit their deployment on embedded systems. Many recent works have considered this problem by proposing different types of data quantization schemes. However, most of these techniques either require post-quantization retraining of deep neural networks or bear a significant loss in output accuracy. In this paper, we propose a novel and scalable technique with two different modes for the quantization of the parameters of pre-trained neural networks. In the first mode, referred to as log_2_lead , we use a single template for the quantization of all parameters. In the second mode, denoted as ALigN , we analyze the trained parameters of each layer and adaptively adjust the quantization template to achieve even higher accuracy. Our technique significantly maintains the accuracy of the parameters and does not require retraining of the networks. Moreover, it supports quantization to an arbitrary bit-size. For example, compared to the single-precision floating-point numbers-based implementation, our proposed 8-bit quantization technique generates only $\sim 0.2\%$ and $\sim 0.1\%$ , loss in the Top-1 and Top-5 accuracies respectively for VGG-16 network using ImageNet dataset. We have observed similar minimal losses in the Top-1 and Top-5 accuracies for AlexNet and Resnet-18 using the proposed quantization scheme for the 8-bit range. Our proposed quantization technique also provides a higher mean intersection over union for semantic segmentation when compared with state-of-the-art quantization techniques. The proposed technique represents parameters in powers of 2, thereby eliminating the need for resource-computationally intensive multiplier units for the hardware accelerators of the neural networks. We also present a design for implementing the multiplication operation using bit-shifts and addition for the proposed quantization technique.

Highlights

  • Deep neural networks (DNNs) are the machine learning models which have achieved promising classification accuracies on different recognition problems such as images, speech, and natural language processing [1]–[3]

  • Log_2_lead Quantization Scheme: Based on our analysis, we present a novel and highly accurate quantization technique, log_2_lead (L2L), to quantize the parameters of pre-trained DNNs

  • QUANTIZED DNNs we present a brief overview of DNNs followed by a description of the usually employed techniques for the quantization of pre-trained DNNs

Read more

Summary

INTRODUCTION

Deep neural networks (DNNs) are the machine learning models which have achieved promising classification accuracies on different recognition problems such as images, speech, and natural language processing [1]–[3]. The BFloat, a subset of the single-precision Float, utilizes only 7 bits for storing the fraction (significand) [31] Most of these techniques represent the parameters of a trained network in low precision fixed-point number systems by utilizing different types of quantization schemes. Log_2_lead Quantization Scheme: Based on our analysis, we present a novel and highly accurate quantization technique, log_2_lead (L2L), to quantize the parameters of pre-trained DNNs. Our technique uses a unique template to store the most significant fractional bits. ALigN Quantization Scheme: We propose an adaptive layerwise variation, referred to as ALigN, of our L2L quantization scheme for pre-trained DNNs In this technique, we align the available quantization bit-width according to the occurrences of the leading 1’s in the trained parameters of each layer.

RELATED WORK
OVERVIEW OF DNNs
COMMONLY USED QUANTIZATION TECHNIQUES
PROPOSED QUANTIZATION TECHNIQUE-BASED MULTIPLIER
EXPERIMENTAL SETUP AND RESULTS
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call