Abstract

Convolutional neural networks (CNNs) have achieved great success in numerous AI applications. To improve inference efficiency of CNNs, researchers have proposed various pruning techniques to reduce both computation intensity and storage overhead. These pruning techniques result in multi-level sparsity irregularities in CNNs. Together with that in activation matrices, which is induced by employment of ReLU activation function, all these sparsity irregularities cause a serious problem of computation resource under-utilization in sparse CNN accelerators. To mitigate this problem, we propose a method of load-balancing based on a workload stealing technique. We demonstrate that this method can be applied to two major inference data-flows, which cover all state-of-the-art sparse CNN accelerators. Based on this method, we present an accelerator, called Crane, which addresses all kinds of sparsity irregularities in CNNs. We perform a fair comparison between Crane and state-of-the-art prior approaches. Experimental results show that Crane improves performance by $27\%\sim 88\%$ 27 % ∼ 88 % and reduces energy consumption by $16\%\sim 48\%$ 16 % ∼ 48 % , respectively, compared to the counterparts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call