Memory Bandwidth Requirements Research Articles

This paper presents a comprehensive hardware accelerator architecture of YOLOv3-Tiny targeted for low-cost FPGA with a high frame rate, high accuracy, and low latency. The proposed accelerator implements all YOLO layers in hardware including zero pad layer, convolution layer, leaky ReLU layer, batch normalization layer, max-pooling layer, and up-sampling layer. The architecture is built based on data flow and control flow hybrid architecture. The data preparation and computation process work asynchronously using the data flow paradigm, while the overall governing process is controlled by proposed custom instruction set which adopts the principle of control flow paradigm. The principle of General Matrix Multiplication (GEMM) is adopted to compute the convolution process. We designed a GEMM processor using an optimum size of systolic array architecture. The systolic core is small and the overall architecture supports the multicore system, making it scalable to be implemented on larger size FPGAs. We also proposed a hardware architecture for mapping feature maps into matrix form for GEMM convolution which can save on-chip memory space. Lastly, we modified the original YOLO algorithm to further optimize it in our hardware. The modification includes reducing the bit precision to reduce memory space and bandwidth requirement, merging the normalization layer with the convolution layer to reduce arithmetic complexity, and adding a new DLQ layer to keep the bit size small while maintaining the accuracy. Based on the experimental results, our proposed design manages to achieve a frame rate of 8.3 FPS with the throughput of 31.5 GOPS, outperforming the same convolution computation that is performed by Ryzen 5 3600 CPU up to $69.3\times $ in latency. Moreover, our proposed design also has the smallest clock cycle ratio up to $1.75\times $ than other commercial accelerators. The system is useful and suitable for edge computing applications.

Read full abstract

Deep Neural Networks (DNNs) have become popular for various applications in the domain of image and computer vision due to their well-established performance attributes. DNN algorithms involve powerful multilevel feature extractions resulting in an extensive range of parameters and memory footprints. However, memory bandwidth requirements, memory footprint and the associated power consumption of models are issues to be addressed to deploy DNN models on embedded platforms for real time vision-based applications. We present an optimized DNN model for memory and accuracy for vision-based applications on embedded platforms. In this paper we propose Quantization Friendly MobileNet (QF-MobileNet) architecture. The architecture is optimized for inference accuracy and reduced resource utilization. The optimization is obtained by addressing the redundancy and quantization loss of the existing baseline MobileNet architectures. We verify and validate the performance of the QF-MobileNet architecture for image classification task on the ImageNet dataset. The proposed model is tested for inference accuracy and resource utilization and compared to the baseline MobileNet architecture. The inference accuracy of the proposed QF-MobileNetV2 float model attained 73.36% and the quantized model has 69.51%. The MobileNetV3 float model attained an inference accuracy of 68.75% and the quantized model has 67.5% respectively. The proposed model saves 33% of time complexity for QF-MobileNetV2 and QF-MobileNetV3 models against the baseline models. The QF-MobileNet also showed optimized resource utilization with 32% fewer tunable parameters, 30% fewer MAC’s operations per image and reduced inference quantization loss by approximately 5% compared to the baseline models. The model is ported onto the android application using TensorFlow API. The android application performs inference on the native devices viz. smartphones, tablets and handheld devices. Future work is focused on introducing channel-wise and layer-wise quantization schemes to the proposed model. We intend to explore quantization aware training of DNN algorithms to achieve optimized resource utilization and inference accuracy.

Read full abstract

Memory Bandwidth Requirements Research Articles

Related Topics

Articles published on Memory Bandwidth Requirements

A Sliding Window for Data Reuse in Deep Convolution Operations to Reduce Bandwidth Requirements and Resource Utilization

Leveraging HLS to Design a Versatile & High-Performance Classic McEliece Accelerator

GEMA: A Genome Exact Mapping Accelerator Based on Learned Indexes.

Tailor : Altering Skip Connections for Resource-Efficient Inference

BASALISC: Programmable Hardware Accelerator for BGV Fully Homomorphic Encryption

Bottleneck-Stationary Compact Model Accelerator With Reduced Requirement on Memory Bandwidth for Edge Applications

Efficient Design of Low Bitwidth Convolutional Neural Networks on FPGA with Optimized Dot Product Units

RNA: A Flexible and Efficient Accelerator Based on Dynamically Reconfigurable Computing for Multiple Convolutional Neural Networks

CREW: Computation reuse and efficient weight storage for hardware-accelerated MLPs and RNNs

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

On-the-Fly Lowering Engine: Offloading Data Layout Conversion for Convolutional Neural Networks

Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp Multicasting

Data multiplexed and hardware reused architecture for deep neural network accelerator

Lossless Compression Algorithm and Architecture for Reduced Memory Bandwidth Requirement with Improved Prediction Based on the Multiple DPCM Golomb-Rice Algorithm

Impact of mixed precision and storage layout on additive Schwarz smoothers

Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using General Matrix Multiplication Principle

Sparse-PE: A Performance-Efficient Processing Engine Core for Sparse Convolutional Neural Networks

Quantization Friendly MobileNet (QF-MobileNet) Architecture for Vision Based Applications on Embedded Platforms

Ta/CoFeB/MgO analysis for low power nanomagnetic devices

Decoder-Side Motion Vector Refinement in VVC: Algorithm and Hardware Implementation Considerations

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Memory Bandwidth Requirements Research Articles

Related Topics

Articles published on Memory Bandwidth Requirements

A Sliding Window for Data Reuse in Deep Convolution Operations to Reduce Bandwidth Requirements and Resource Utilization

Leveraging HLS to Design a Versatile &amp; High-Performance Classic McEliece Accelerator

GEMA: A Genome Exact Mapping Accelerator Based on Learned Indexes.

Tailor : Altering Skip Connections for Resource-Efficient Inference

BASALISC: Programmable Hardware Accelerator for BGV Fully Homomorphic Encryption

Bottleneck-Stationary Compact Model Accelerator With Reduced Requirement on Memory Bandwidth for Edge Applications

Efficient Design of Low Bitwidth Convolutional Neural Networks on FPGA with Optimized Dot Product Units

RNA: A Flexible and Efficient Accelerator Based on Dynamically Reconfigurable Computing for Multiple Convolutional Neural Networks

CREW: Computation reuse and efficient weight storage for hardware-accelerated MLPs and RNNs

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

On-the-Fly Lowering Engine: Offloading Data Layout Conversion for Convolutional Neural Networks

Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp Multicasting

Data multiplexed and hardware reused architecture for deep neural network accelerator

Lossless Compression Algorithm and Architecture for Reduced Memory Bandwidth Requirement with Improved Prediction Based on the Multiple DPCM Golomb-Rice Algorithm

Impact of mixed precision and storage layout on additive Schwarz smoothers

Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using General Matrix Multiplication Principle

Sparse-PE: A Performance-Efficient Processing Engine Core for Sparse Convolutional Neural Networks

Quantization Friendly MobileNet (QF-MobileNet) Architecture for Vision Based Applications on Embedded Platforms

Ta/CoFeB/MgO analysis for low power nanomagnetic devices

Decoder-Side Motion Vector Refinement in VVC: Algorithm and Hardware Implementation Considerations

Leveraging HLS to Design a Versatile & High-Performance Classic McEliece Accelerator