Abstract

Large-scale deep neural networks (DNNs) are both computing and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, model size is an essential factor affecting performance, scalability, and energy efficiency. Non-structured Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning, which affects performance and throughput; 2) increased training complexity; and 3) lack of rigorous guarantee of compression ratio and inference accuracy. To overcome these limitations, this work presents CIRCNN, a principled approach to represent weights and process neural networks using block circulant matrices, a structured representation. CIRCNN utilizes Fast Fourier Transform (FFT)- based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from O(n2) to O(n log n) and storage complexity from O(n2) to O(n), with negligible accuracy loss. Compared to other approaches, CIRCNN is distinct due to its mathematical rigor: the DNNs based on CIRCNN can converge to the same "effectiveness" as DNNs without compression. We present the CIRCNN architecture, a universal DNN inference engine that can be implemented in various hardware/software platforms with configurable network architecture (layer type, size, scales, etc.). In CIRCNN architecture, due to recursive property, FFT can be used as the vital computing kernel, which ensures universal and small-footprint implementations. The compressed but regular network structure avoids the pitfalls of network pruning and facilitates high performance and throughput with a highly pipelined and parallel design. To demonstrate performance and energy efficiency, we test CIRCNN in field-programmable gate arrays (FPGAs), application-specific integrated circuit (ASIC), and embedded processors. Our results show that CIRCNN architecture achieves very high energy efficiency and performance with a small hardware footprint. Based on FPGA implementation and ASIC synthesis results, CIRCNN achieve a 6 - 102× improvement in energy efficiency compared with state-of-the-art results. To further accelerate computation and facilitate hardware implementation, we present CIRCNN-DR, a block circulant matrix-based DNN architecture with decoupled and reconfigurable computing modules. CIRCNN-DR advances the original CIRCNN architecture with three improvements. First, due to linear property, FFT and inverse FFT (IFFT) are decoupled: IFFTs are performed after accumulating element-wise multiplications results (i.e., FFT (wij ) ◦ FFT (xj )). FFT-IFFT decoupling reduces the amount of computation, and, more importantly, allows the same unified computing modules to be reconfigured to perform either FFTs/IFFTs or multiplications and additions (MACs) in three consecutive phases (i.e., FFT-MAC-IFFT), - the key for implementation with limited hardware resources. Second, CIRCNN-DR leverages batch processing to reduce pipeline bubbles and timing overhead in the deep pipeline. Third, we present an algorithm and hardware co-optimization framework. CIRCNN-DR is realized with end-to-end FPGA implementations and ASIC synthesis. Under the same accuracy level, our FPGA implementation achieves 31× and 70× energy efficiency improvement compared to state-of-the-art FPGA implementations and IBM TrueNorth processor, respectively. Compared with CIRCNN, the CIRCNN-DR framework achieves at least a 50% improvement in performance and energy efficiency. For ASIC implementations, the presented CIRCNN-DR exhibits impressive advantages in terms of power, throughput, and energy efficiency. Experimental results indicate that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices. Next, we present design automation on top of block circulant matrix-based weight repre- sentation. we present a design automation framework and implementation optimization for real-time implementation of sound recognition. We adopt the block circulant matrix method from prior work. We overcome limitations by developing an HLS automated framework to translate high-level language such as C/C++ to the hardware description language Verilog, to overcome complicated data dependency and skewed computations, and to accelerate the development cycle. To achieve real-time, highly-efficiency implementation of both applications on a single FPGA, we present hardware implementation optimization techniques based on the PE level. Overall, compared with the state-of-the-art LSTM implementation, the presented C-LSTM designs generated by our framework achieve up to an 18.8× and 33.5× gains in performance and energy efficiency with small accuracy degradation, respectively. Experimental results show that our presented design automation and optimization techniques can significantly compress LSTM models while introducing negligible accuracy degradation. To further bridge the gap between algorithm and hardware platforms, we present REQ, a resource-aware, efficient weight quantization framework for DNNs by maximally exploiting both software and hardware-level optimization opportunities on FPGAs. We present a heterogeneous weight quantization method, including both equal-distance and mixed powers-of-two methods considering hardware resource. Compared to Titan X-YOLO, our GPU implementation achieves up to 2.9× enhancement in energy efficiency. Compared to the GPU-based YOLO implementation with the best energy efficiency (our TX2-YOLO), our two FPGA YOLO implementations achieve a 3.5× and 5.8× improvement in energy efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call