Abstract

For applications of speech and video, the consecutive inputs exhibit a high degree of similarity, hence, some results of previous execution can be reused. The technique of quantization can efficiently increase the similarity of consecutive inputs. However, when using quantization, the smaller the number of quantization bits the higher the similarity, as the inputs are constrained to a smaller set of values, but the larger the accuracy loss since input errors are increased. Therefore, we observe that existing reuse schema just applied unique the number of quantization bits in the entire network. If the number of quantization bits is too long, it will directly reduce the similarity between the inputs and thus reduce the reuse ratio. Hence, it is important that exploits the tradeoff among the number of quantization bits, reuse rate, and accuracy. There is an opportunity to significantly improve the performance and efficiency of DNN execution by use multiple quantization bits simultaneously according to the technique of neuron criticality analysis. To do so, we propose a novel reuse schema called Mquans based on neuron criticality analysis without accuracy loss. And evaluation results show that our proposed design achieves 2.7 speedups and 38% energy saving on average over the baseline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call