Abstract

Convolutional neural networks (CNNs) exhibit state-of-the-art performance while performing computer-vision tasks. CNNs require high-speed, low-power, and high-accuracy hardware for various scenarios, such as edge environments. However, the number of weights is so large that embedded systems cannot store them owing to their limited on-chip memory. A different method is used to minimize the input image size, for real-time processing, but it causes a considerable drop in accuracy. Although pruned sparse CNNs and special accelerators are proposed, the requirement of random access incurs a large number of wide multiplexers for a high degree of parallelism, which becomes more complicated and unsuitable for FPGA implementation. To address this problem, we propose filter-wise pruning with distillation and block RAM (BRAM)-based zero-weight skipping accelerator. It eliminates weights such that each filter has the same number of nonzero weights, performing retraining with distillation, while retaining comparable accuracy. Further, filter-wise pruning enables our accelerator to exploit inter-filter parallelism, where a processing block for a layer executes filters concurrently, with a straightforward architecture. We also propose an overlapped tiling algorithm, where tiles are extracted with overlap to prevent both accuracy degradation and high utilization of BRAMs storing high-resolution images. Our evaluation using semantic-segmentation tasks showed a 1.8 times speedup and 18.0 times increase in power efficiency of our FPGA design compared with a desktop GPU. Additionally, compared with the conventional FPGA implementation, the speedup and accuracy improvement were 1.09 times and 6.6 points, respectively. Therefore, our approach is useful for FPGA implementation and exhibits considerable accuracy for applications in embedded systems.

Highlights

  • IntroductionPruning [13] is a compression technique that eliminates unnecessary weights below a threshold, where pruning converts dense weight matrices to unstructured sparse matrices

  • Convolutional neural networks (CNNs) [27] deliver stateof-the-art performance in computer-vision tasks such as object classification [25], object detection [30], and semantic segmentation [41]

  • We propose an overlapped tiling algorithm to reduce the utilization of on-chip memory on FPGAs for high-resolution images (Section 6)

Read more

Summary

Introduction

Pruning [13] is a compression technique that eliminates unnecessary weights below a threshold, where pruning converts dense weight matrices to unstructured sparse matrices This approach can lead to more than a 10-fold reduction in the number of parameters with comparable accuracy [13]. A new algorithm/hardware co-design approach is proposed in this study It involves filter-wise pruning with distillation, and its special inter-layer pipelined accelerator for FPGA implementation. We apply our filter-wise pruning with distillation to a lightweight network model, MobileNetV1based model, and compare it with the state-of-the-art FPGA implementation. The previous FPGA-based accelerator in ARC 2019 is expanded to use inter-filter parallelism (Section 5.1).

Unstructured Nonzero Weight Matrices
Convolutional Neural Networks
Separable CONV
Sparse CONV
Batch Normalization Folding
Semantic Segmentation
Filter-Wise Pruning with Distillation
Distillation Scheme for Retraining Weights
Hardware Implementation
Convolutional Block
Overlapped Tiling Algorithm
Experimental Results
MobileNetV1-Based PSPNet
Accuracy Comparison for Sparseness Ratio and Quantization
Comparison with a Desktop GPU
Comparison with Other FPGA Implementation
Comparison with Other Pruning Method
Sparseness Approach for Weight Memory Reduction
FPGA Implementation for CNN-Based Semantic Segmentation
Sparse Convolutional Network Architecture
Zero-Weight Skipping Architecture
Zero-Weight and -Activation Skipping Architecture
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call