Abstract

Designing hardware accelerators for convolutional neural networks (CNNs) has recently attracted tremendous attention. Plenty of existing accelerators are built for dense CNNs or structured sparse CNNs. By contrast, unstructured sparse CNNs can achieve higher compression ratio with equivalent accuracy. However, their corresponding hardware implementations generally suffer from load imbalance and conflict access to on-chip buffers, which results in under utilization of processing elements (PEs). To tackle these issues, we propose a hardware/power-efficient and highly flexible architecture to support both unstructured and structured sparse CNNs with various configurations. Firstly, we propose an efficient weight reordering algorithm to preprocess compressed weights and balance the workload of PEs. Secondly, an adaptive on-chip dataflow, namely hybrid parallel (HP) dataflow, is introduced to promote weight reuse. Thirdly, the partial fusion scheme, which was first introduced in one of our prior works, is incorporated as the off-chip dataflow. Benefited from dataflow optimizations, the repetitive data exchanges between on-chip buffers and external memories are significantly reduced. We implement the design on the Intel Arria10 SX660 platform and evaluate with MobileNet-v2, ResNet-50, and ResNet-18 on ImageNet dataset. Compared to existing sparse accelerators on FPGAs, the proposed accelerator can achieve $1.35\sim 1.81\times $ improvement in power efficiency with the same sparsity. Compared to prior dense accelerators, this accelerator can achieve an improvement of $1.92\sim 5.84\times $ in DSP efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call