Abstract

Convolutional neural networks (CNNs) have gained a huge attention for real-world artificial intelligence (AI) applications such as image classification and object detection. On the other hand, for better accuracy, the size of the CNNs’ parameters (weights) has been increasing, which in turn makes it difficult to enable on-device CNN inferences in resource-constrained edge devices. Though weight pruning and 5-bit quantization methods have shown promising results, it is still challenging to deploy large CNN models in edge devices. In this paper, we propose an encoding and hardware-based decoding technique which can be applied to 5-bit quantized weight data for on-device CNN inferences in resource-constrained edge devices. Given 5-bit quantized weight data, we employ arithmetic coding with range scaling for lossless weight compression, which is performed offline. When executing on-device inferences with underlying CNN accelerators, our hardware decoder enables a fast in-situ weight decompression with small latency overhead. According to our evaluation results with five widely used CNN models, our arithmetic coding-based encoding method applied to 5-bit quantized weights shows a better compression ratio by 9.6× while also reducing the memory data transfer energy consumption by 89.2%, on average, as compared to the case of uncompressed 32-bit floating-point weights. When applying our technique to pruned weights, we obtain better compression ratios by 57.5×–112.2× while reducing energy consumption by 98.3%–99.1% as compared to the case of 32-bit floating-point weights. In addition, by pipelining the weight decoding and transfer with the CNN execution, the latency overhead of our weight decoding with 16 decoding unit (DU) hardware is only 0.16%–5.48% and 0.16%–0.91% for non-pruned and pruned weights, respectively. Moreover, our proposed technique with 4-DU decoder hardware reduces system-level energy consumption by 1.1%–9.3%.

Highlights

  • Convolutional neural networks (CNNs) have been widely deployed in many artificial intelligence (AI) applications

  • For an in-situ weight decompression for edge devices which contain a convolutional neural networks (CNNs) accelerator or NPU, we propose a hardware decoder which can decompress the compressed weight with a small latency overhead

  • Considering that the main focus of our technique is resource-constrained edge devices, this small latency overhead is sufficiently acceptable as the benefits from the reduced memory and storage requirement and reduced memory energy consumption are much greater than the latency overhead

Read more

Summary

INTRODUCTION

Convolutional neural networks (CNNs) have been widely deployed in many artificial intelligence (AI) applications. As a more aggressive solution, several works have proposed to use 5-bit weight elements for deploying the CNN models in resource-constrained systems [5] [6]. Though, these works have shown successful results on reducing the weight data size, we could further reduce the weight size by applying the data encoding schemes such as Huffman coding or arithmetic coding. By only storing the encoded (the reduced size of) weight data in device’s memory and/or storage, we could enable more cost-efficient deployment of the CNN models in resource-constrained devices. We introduce an arithmetic coding-based 5bit quantized weight compression technique for on-device CNN inferences in resource-constrained edge devices.

RELATED WORK
BACKGROUND
ENTROPY-BASED CODING
ARITHMETIC CODING-BASED WEIGHT ENCODING AND DECODING WITH RANGE SCALING
Append N-1 ‘0’s at the end of BS
EVALUATION RESULTS
LATENCY OVERHEAD
LATENCY VERSUS RESOURCE USAGE TRADE-OFF
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call