Efficient Hardware Research Articles

In recent years, convolutional neural networks (CNNs) have achieved significant advancements in various fields. However, the computation and storage overheads of CNNs are overwhelming for Internet-of-Things devices. Both network pruning algorithms and hardware accelerators have been introduced to empower CNN inference at the edge. Network pruning algorithms reduce the size and computational cost of CNNs by regularizing unimportant weights to zeros. However, existing works lack intrakernel structured types to tradeoff between sparsity and hardware efficiency, and the index storage for irregularly pruned networks is significant. Hardware accelerators leverage the sparsity of pruned CNNs to improve energy efficiency. However, their process element (PE) utilization rate is low because of uneven sparsity among input convolutional kernels. To overcome these problems, we propose PACA: a Pattern pruning Algorithm and Channel-fused high PE utilization Accelerator for CNNs. It includes three parts: a pattern pruning algorithm to explore the intrakernel sparsity type and reduce the index storage, a channel-fused hardware architecture to reduce the PEs’ idle rate and improve the performance, and a heuristic and taboo search-based smart fusion scheduler to analyze the idle PE problem and schedule the channel fusion in hardware. To demonstrate the effectiveness of PACA, we have implemented the software parts by Python and the hardware architecture by RTL codes. Experimental results on various datasets show that compared with an existing work, PACA can reduce the index storage overhead by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.47\times $ </tex-math></inline-formula> – <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$5.63\times $ </tex-math></inline-formula> with 3.85–9.12 average patterns, and it can improve the hardware performance by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$2.01\times $ </tex-math></inline-formula> – <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$5.53\times $ </tex-math></inline-formula> because of PEs’ idle rate reduction.

Read full abstract

The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due to the immense parameters and operations of models. To relieve this burden, exploiting sparsity is an effective approach to accelerate Transformers. Newly emerging Ampere graphics processing units (GPUs) leverage a 2:4 sparsity pattern to achieve model acceleration, while it can hardly meet the diverse algorithm and hardware constraints when deploying models. By contrast, we propose an algorithm–hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. First, from an algorithm perspective, we propose a sparsity inheritance mechanism along with inherited dynamic pruning (IDP) to obtain a series of N:M sparse candidate Transformers rapidly. A model compression scheme is further proposed to significantly reduce the storage requirement for deployment. Second, from a hardware perspective, we present a flexible and efficient hardware architecture, namely, STA, to achieve significant speedup when deploying N:M sparse Transformers. STA features not only a computing engine unifying both sparse–dense and dense–dense matrix multiplications with high computational efficiency but also a scalable softmax module eliminating the latency from intermediate off-chip data communication. Experimental results show that, compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency. Moreover, STA can achieve <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$14.47\times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$11.33\times $ </tex-math></inline-formula> speedups compared to Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and perform <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$2.00 \,\,\sim 19.47 \times $ </tex-math></inline-formula> faster inference than the state-of-the-art field-programmable gate array (FPGA)-based accelerators for Transformers.

Read full abstract

Efficient Hardware Research Articles

Related Topics

Articles published on Efficient Hardware

A Study on the Design Procedure of Re-Configurable Convolutional Neural Network Engine for FPGA-Based Applications

Efficient Hardware Accelerator Design of Non-Linear Optimization Correlative Scan Matching Algorithm in 2D LiDAR SLAM for Mobile Robots.

Replicated Simulated Annealing with a Global-Best Reference for Efficient Hardware Implementation

Efficient Cryptographic Hardware for Safety Message Verification in Internet of Connected Vehicles

VHDL implementation of circular shifting‐partial transmit sequence in MIMO OFDM systems

Ultrafast Near-Ideal Phase-Change Memristive Physical Unclonable Functions Driven by Amorphous State Variations.

B[formula omitted]N[formula omitted]: Resource efficient Bayesian neural network accelerator using Bernoulli sampler on FPGA

SCA: Search-Based Computing Hardware Architecture with Precision Scalable and Computation Reconfigurable Scheme.

Thermal Sensor Placement for Multicore Systems Based on Low-Complex Compressive Sensing Theory

PACA: A Pattern Pruning Algorithm and Channel-Fused High PE Utilization Accelerator for CNNs

Efficient Neuromorphic Hardware Through Spiking Temporal Online Local Learning

Memristive Fast-Canny Operation for Edge Detection

An Algorithm–Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

ESSA: Design of a Programmable Efficient Sparse Spiking Neural Network Accelerator

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

Efficient hardware implementations of lightweight Simeck Cipher for resource-constrained applications

High-speed photonic neuromorphic computing using recurrent optical spectrum slicing neural networks

Artificial Tactile Recognition Enabled by Flexible Low-Voltage Organic Transistors and Low-Power Synaptic Electronics.

Efficient deep steering control method for self-driving cars through feature density metric

A CSMA/CA based MAC protocol for hybrid Power-line/Visible-light communication networks: Design and analysis

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Efficient Hardware Research Articles

Related Topics

Articles published on Efficient Hardware

A Study on the Design Procedure of Re-Configurable Convolutional Neural Network Engine for FPGA-Based Applications

Efficient Hardware Accelerator Design of Non-Linear Optimization Correlative Scan Matching Algorithm in 2D LiDAR SLAM for Mobile Robots.

Replicated Simulated Annealing with a Global-Best Reference for Efficient Hardware Implementation

Efficient Cryptographic Hardware for Safety Message Verification in Internet of Connected Vehicles

VHDL implementation of circular shifting‐partial transmit sequence in MIMO OFDM systems

Ultrafast Near-Ideal Phase-Change Memristive Physical Unclonable Functions Driven by Amorphous State Variations.

B[formula omitted]N[formula omitted]: Resource efficient Bayesian neural network accelerator using Bernoulli sampler on FPGA

SCA: Search-Based Computing Hardware Architecture with Precision Scalable and Computation Reconfigurable Scheme.

Thermal Sensor Placement for Multicore Systems Based on Low-Complex Compressive Sensing Theory

PACA: A Pattern Pruning Algorithm and Channel-Fused High PE Utilization Accelerator for CNNs

Efficient Neuromorphic Hardware Through Spiking Temporal Online Local Learning

Memristive Fast-Canny Operation for Edge Detection

An Algorithm–Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

ESSA: Design of a Programmable Efficient Sparse Spiking Neural Network Accelerator

EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers

Efficient hardware implementations of lightweight Simeck Cipher for resource-constrained applications

High-speed photonic neuromorphic computing using recurrent optical spectrum slicing neural networks

Artificial Tactile Recognition Enabled by Flexible Low-Voltage Organic Transistors and Low-Power Synaptic Electronics.

Efficient deep steering control method for self-driving cars through feature density metric

A CSMA/CA based MAC protocol for hybrid Power-line/Visible-light communication networks: Design and analysis