Systolic Array Architecture Research Articles

Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each layer and result in various tensor sizes in each layer. Mapping these irregular tensors to the systolic array while fully utilizing the entire PEs in a systolic array is challenging. Furthermore, modern DNN systolic accelerators typically employ a single dataflow. However, such a dataflow isn’t optimal for every DNN model. This work proposes ReSA, a reconfigurable dataflow architecture that aims to minimize the execution time of a DNN model by mapping tiny tensors on the spatially partitioned systolic array. Unlike conventional systolic array architectures, the ReSA data path controller enables the execution of the input, weight, and output-stationary dataflow on PEs. ReSA also decomposes the coarse-grain systolic array into multiple small ones to reduce the fragmentation issue on the tensor mapping. Each small systolic sub-array unit relies on our data arbiter to dispatch tensors to each other through the simple interconnected network. Furthermore, ReSA reorders the memory access to overlap the memory load and execution stages to hide the memory latency when tackling tiny tensors. Finally, ReSA splits tensors of each layer into multiple small ones and searches for the best dataflow for each tensor on the host side. Then, ReSA encodes the predefined dataflow in our proposed instruction to notify the systolic array to switch the dataflow correctly. As a result, our optimization on the systolic array architecture achieves a geometric mean speedup of 1.87X over the weight-stationary systolic array architecture across 9 different DNN models.

Systolic Array (SA) architectures are well-suited for accelerating matrix multiplications through the use of a pipelined array of Processing Elements (PEs) communicating with local connections and pre-orchestrated data movements. Even though most of the dynamic power consumption in SAs is due to multiplications and additions, pipelined data movement within the SA constitutes an additional important contributor. The goal of this work is to reduce the dynamic power consumption associated with the feeding of data to the SA, by employing both dynamic (run-time) and static (offline) techniques. At the hardware level, the proposed architecture synergistically applies bus-invert coding and zero-value clock gating. By exploiting salient attributes of state-of-the-art CNNs, such as the value distribution of the weights, the proposed SA applies appropriate encoding only to the data that exhibits high switching activity. Similarly, when one of the inputs is zero, unnecessary operations are entirely skipped. In addition to this duet of run-time techniques, the proposed methodology also leverages the inherent property of the weight matrix to remain unchanged throughout the inference phase. As such, the weight matrix is appropriately reordered offline to minimize the switching activity between consecutive values, as the matrix is repeatedly loaded into the array. The weight reordering process is formulated as a Traveling Salesman Problem (TSP) and its solution is translated into a switching-activity-aware row permutation of the weight matrix. The symbiotic combination of selectively targeted, application-aware dynamic encoding and offline weight reordering is demonstrated to reduce the switching activity by 38%, on average. This translates to an overall dynamic power reduction of 17.1%–23% when executing state-of-the-art CNN layers on an SA of size 32 × 32. These power savings scale with the array size; for an array of size 64 × 64, the proposed design consumes 29.7%–35.4% less power.

Systolic Array Architecture Research Articles

Related Topics

Articles published on Systolic Array Architecture

Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

FPGA-Based Acceleration of Polar-Format Algorithm for Video Synthetic-Aperture Radar Imaging

VerSA: Versatile Systolic Array Architecture for Sparse and Dense Matrix Multiplications

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

Exploiting data encoding and reordering for low-power streaming in systolic arrays

A Survey of Design and Optimization for Systolic Array-based DNN Accelerators

SaARSP: An Architecture for Systolic-Array Acceleration of Recurrent Spiking Neural Networks

Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration

Power-based Attacks on Spatial DNN Accelerators

Systematic realization of a fully connected deep and convolutional neural network architecture on a field programmable gate array

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

A New Approach for a Unified Architecture for Type IV DCT/DST with an Efficient Incorporation of Obfuscation Technique

Low‐space bit‐serial systolic array architecture for interleaved multiplication over GF(2 m )

Toward Functional Safety of Systolic Array-Based Deep Learning Hardware Accelerators

Heterogeneous Systolic Array Architecture for Compact CNNs Hardware Accelerators

Near-Precise Parameter Approximation for Multiple Multiplications on A Single DSP Block

Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using General Matrix Multiplication Principle

2-D Systolic Array architecture of CBNS based Discrete Hilbert Transform Processor

Systolic architecture for adaptive block FIR filter for throughput using distributed arithmetic

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Systolic Array Architecture Research Articles

Related Topics

Articles published on Systolic Array Architecture

Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

FPGA-Based Acceleration of Polar-Format Algorithm for Video Synthetic-Aperture Radar Imaging

VerSA: Versatile Systolic Array Architecture for Sparse and Dense Matrix Multiplications

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

Exploiting data encoding and reordering for low-power streaming in systolic arrays

A Survey of Design and Optimization for Systolic Array-based DNN Accelerators

SaARSP: An Architecture for Systolic-Array Acceleration of Recurrent Spiking Neural Networks

Hybrid Accumulator Factored Systolic Array for Machine Learning Acceleration

Power-based Attacks on Spatial DNN Accelerators

Systematic realization of a fully connected deep and convolutional neural network architecture on a field programmable gate array

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

A New Approach for a Unified Architecture for Type IV DCT/DST with an Efficient Incorporation of Obfuscation Technique

Low‐space bit‐serial systolic array architecture for interleaved multiplication over GF(2 m )

Toward Functional Safety of Systolic Array-Based Deep Learning Hardware Accelerators

Heterogeneous Systolic Array Architecture for Compact CNNs Hardware Accelerators

Near-Precise Parameter Approximation for Multiple Multiplications on A Single DSP Block

Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using General Matrix Multiplication Principle

2-D Systolic Array architecture of CBNS based Discrete Hilbert Transform Processor

Systolic architecture for adaptive block FIR filter for throughput using distributed arithmetic