Deep learning acceleration on edge devices with algorithm/hardware co-design

Mengshu Sun

doi:10.17760/d20467285

Abstract

As deep learning has succeeded in a broad range of applications in recent years, there is an increasing trend towards deploying deep neural networks (DNNs) on edge devices such as FPGAs and mobile phones. However, there exists a significant gap between the extraordinary accuracy of state-of-the-art DNNs and efficient implementations on these low-power and resource-constrained devices, due to the high computation and memory intensity of DNNs. With the target of simultaneously accelerating the inference and maintaining the accuracy of DNNs, efficient implementations are investigated in this dissertation, by presenting algorithm/hardware co-design frameworks that incorporate hardware-friendly DNN compression algorithms with hardware design optimizations. First, the DNN compression algorithms are explored, leveraging quantization and pruning techniques. As for quantization, intra-layer mixed-precision/mixed-scheme weight quantization is proposed to boost utilization of heterogeneous FPGA resources and therefore improving the computation throughput, by assigning multiple precisions and/or multiple schemes at the filter level within each layer and maintaining the same ratio of filters across all the layers for each type of quantization assignment. As for weight pruning, novel structured and fined-grained sparsity schemes are proposed and obtained with the reweighted regularization pruning algorithm, and then incorporated into acceleration frameworks on FPGAs to make the acceleration rate of sparse models approach the pruning ratio of the number of operations. Second, the hardware implementations are studied, proposing an automatic DNN acceleration framework to generate DNN accelerators to satisfy a target frame rate (FPS). Unlike previous approaches that start from model compression and then optimizing the FPS for hardware implementations, this framework provides an estimation of the FPS with the FPGA resource utilization analysis and performance analysis modules, and the model quantization precisions are reduced until the target FPS is fulfilled. The mixing ratio between different quantization precisions is automatically determined to guide the quantization and the hardware accelerator implementation. A novel computing engine for DNNs is designed with various optimization techniques in support of DNN compression to improve the computation parallelism and resource utilization efficiency. Resource allocation analysis is performed through developing a resource utilization model to overcome the difficulty in estimating the LUT consumption.--Author's abstract

Full Text