Abstract

In collaborative intelligence applications, part of a deep neural network (DNN) is deployed on a lightweight device such as a mobile phone or edge device, and the remaining portion of the DNN is processed where more computing resources are available, such as in the cloud. This paper presents a novel lightweight compression technique designed specifically to quantize and compress the features output by the intermediate layer of a split DNN, without requiring any retraining of the network weights. Mathematical models for estimating the clipping and quantization error of ReLU and leaky-ReLU activations at this intermediate layer are developed and used to compute optimal clipping ranges for coarse quantization. We also present a modified entropy-constrained design algorithm for quantizing clipped activations. When applied to popular object-detection and classification DNNs, we were able to compress the 32-bit floating point intermediate activations down to 0.6 to 0.8 bits, while keeping the loss in accuracy to less than 1%. When compared to HEVC, we found that the lightweight codec consistently provided better inference accuracy, by up to 1.3%. The performance and simplicity of this lightweight compression technique makes it an attractive option for coding an intermediate layer of a split neural network for edge/cloud applications.

Highlights

  • W ITH the increasing ubiquity of intelligent applications in our daily lives, machine learning and artificial neural networks are rapidly finding their way into a wide range of systems and devices, from large-scale cloud computing systems all the way down to handheld and even miniature implanted devices

  • A subset of layers of the deep neural network (DNN) is computed inside the edge device, and the output of the last layer on the device is signaled to the cloud, to be used as input to the remaining layers of the DNN

  • The purpose of this paper is to extend the work of [10] by developing a mathematical model of the feature tensors output by a leaky rectified linear unit (ReLU) activation function whose input is asymmetric; using this model to estimate clipping and quantization error of the activations; determining how these error estimates behave with extremely coarse quantization; and applying these models to determine optimal clipping ranges for quantization

Read more

Summary

Introduction

W ITH the increasing ubiquity of intelligent applications in our daily lives, machine learning and artificial neural networks are rapidly finding their way into a wide range of systems and devices, from large-scale cloud computing systems all the way down to handheld and even miniature implanted devices. For devices located at the edge of a network, lightweight and mobile-friendly architectures [1]–[3] can facilitate the implementation of these DNNs. When a DNN that performs real-time inference or other compute-intensive operations is too complex to realize fully on an edge device, a collaborative intelligence [4], [5] paradigm can be used to split the DNN so that the bulk of the computations can be performed in the cloud. When a DNN that performs real-time inference or other compute-intensive operations is too complex to realize fully on an edge device, a collaborative intelligence [4], [5] paradigm can be used to split the DNN so that the bulk of the computations can be performed in the cloud In this case, a subset of layers of the DNN is computed inside the edge device, and the output of the last layer on the device is signaled to the cloud, to be used as input to the remaining layers of the DNN. The ideal location to split the DNN can be determined by examining both the available computational resources in the edge device and the size of the data such as feature tensors that need to be signaled [4]

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call