Abstract

Deep Neural Networks (DNNs) have achieved remarkable success in various Artificial Intelligence applications. Quantization is a critical step in DNNs compression and acceleration for deployment. To further boost DNN execution efficiency, many works explore to leverage the input-dependent redundancy with dynamic quantization for different regions. However, the sensitive regions in the feature map are irregularly distributed, which restricts the real speed up for existing accelerators. To this end, we propose an algorithm-architecture co-design, named Structured Dynamic Precision (SDP). Specifically, we propose a quantization scheme in which the high-order bit part and the low-order bit part of data can be masked independently. And a fixed number of term parts are dynamically selected for computation based on the importance of each term in the group. We also present a hardware design to enable the algorithm efficiently with small overheads, whose inference time mainly scales with the precision proportionally. Evaluation experiments on extensive networks demonstrate that compared to the state-of-the-art dynamic quantization accelerator DRQ, our SDP can achieve 29% performance gain and 51% energy reduction for the same level of model accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call