Systolic Array Research Articles

Acceleration techniques play a crucial role in enhancing the performance of modern high-speed computations, especially in Deep Learning (DL) applications where the speed is of utmost importance. One essential component in this context is the Systolic Array (SA), which effectively handles computational tasks and data processing in a rhythmic manner. Google's Tensor Processing Unit (TPU) leverages the power of SA for neural networks. The core SA's functionality and performance lies in the Computation Element (CE), which facilitates parallel data flow. In our article, we introduce a novel approach called Proposed Systolic Array (PSA), which is implemented on the CE and further enhanced with a modified Hybrid Kogge Stone adder (MHA). This design incorporates principles to expedite computations by rounding and extracting data model in SA as PSA-MHA. The PSA, utilizing a data flow model with MHA, significantly accelerates data shifts and control passes in execution cycles. We validated our approach through simulations on the Cadence Virtuoso platform using 65 nm process technology, comparing it to the General Matrix Multiplication (GMMN) benchmark. The results showed remarkable improvements in the CE, with a 30.29 % reduction in delay, a 23.07 % reduction in area, and an 11.87 % reduction in power consumption. The PSA outperformed these improvements, achieving a 46.38 % reduction in delay, a 7.58 % reduction in area, and an impressive 48.23 % decrease in Area Delay Product (ADP). To further substantiate our findings, we applied the PSA-based approach to pre-trained hybrid Convolutional and Recurrent (CNN-RNN) neural models. The PSA-based hybrid model incorporates 189 million Multiply-Accumulate (MAC) units, resulting in a weighted mean architecture value of 784.80 for the RNN component. We also explored variations in bit width, which led to delay reductions ranging from 20.17 % to 30.29 %, area variations between 13.08 % and 32.16 %, and power consumption changes spanning from 11.88 % to 20.42 %.

Read full abstract

Digital signal processing (DSP) is an engineering field involved with increasing the precision and dependability of digital communications and mathematical processes, including equalization, modulation, demodulation, compression, and decompression, which can be used to produce a signal of the highest caliber. To execute vital tasks in DSP, an essential electronic circuit such as a multiplier plays an important role, continually performing tasks such as the multiplication of two binary numbers. Multiplier is a crucial component utilized to implement a wide range of DSP tasks, including convolution, Fourier transform, discrete wavelet transforms (DWT), filtering and dithering, multimedia information processing, and more. A multiplier device includes a clock and reset buttons for more flexible operational control. Each digital signal processor constitutes a multiplier unit. A multiplier unit functions entirely autonomously from the central processing unit (CPU); consequently, the CPU is burdened with a significantly reduced amount of work. Since DSP algorithms must constantly carry out multiplication tasks, the employment of a high-speed multiplier to execute fast-speed filtering processes is vital. The previous multipliers had lots of weaknesses, such as high energy, low speed, and high area, because they implemented this necessary circuit based on traditional technology such as complementary metal-oxide semiconductor (CMOS) and very large-scale integration (VLSI). To solve all previous drawbacks in this necessary circuit, we can use nanotechnology, which directly affects the performance of the multiplier and can overcome all previous issues. One of the alternative nanotechnologies that can be used for designing digital circuits is quantum dot cellular automata, which is high speed, low area, and low power. Therefore, this manuscript suggests a quantum technology-based multiplier for DSP applications. In addition, some vital circuits, such as half adder, full adder, and ripple carry adder (RCA), are suggested for designing a multiplier. Moreover, a systolic array, accumulator, and multiply and accumulate (MAC) unit are proposed based on the quantum technology-based multiplier. Nonetheless, each of the suggested frameworks has a coplanar configuration without rotated cells. The suggested structure is developed and verified utilizing the QCADesigner 2.0.3 tools. The findings showed that all circuits have no complicated configuration, including a higher number of quantum cells, latency, and an optimum area.

Read full abstract

Systolic Array Research Articles

Related Topics

Articles published on Systolic Array

Systolic array-based CNN accelerator soft error approximate fault tolerance design

A certain examination on heterogeneous systolic array (HSA) design for deep learning accelerations with low power computations

Accelerating Neural Network Inference in Handwritten Digit Recognition — Comparative Study

Unifying Static and Dynamic Intermediate Languages for Accelerator Generators

High-throughput systolic array-based accelerator for hybrid transformer-CNN networks

Enhancing efficiency in spaceborne phased array systems: MVDR algorithm and FPGA integration

ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array

Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

Factored Systolic Arrays Based on Radix-8 Multiplication for Machine Learning Acceleration

Design and implementation of a nano-scale high-speed multiplier for signal processing applications

FPGA-Based Acceleration of Polar-Format Algorithm for Video Synthetic-Aperture Radar Imaging

Symmetry-Enabled Resource-Efficient Systolic Array Design for Montgomery Multiplication in Resource-Constrained MIoT Endpoints

Enhancing Field Multiplication in IoT Nodes with Limited Resources: A Low-Complexity Systolic Array Solution

Highly Fault-Tolerant Systolic-Array-Based Matrix Multiplication

High-Speed CNN Accelerator SoC Design Based on a Flexible Diagonal Cyclic Array

VerSA: Versatile Systolic Array Architecture for Sparse and Dense Matrix Multiplications

Voltage Scaled Low Power DNN Accelerator Design on Reconfigurable Platform

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

Efficient ASIC Architecture for Low Latency Classic McEliece Decoding

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Systolic Array Research Articles

Related Topics

Articles published on Systolic Array

Systolic array-based CNN accelerator soft error approximate fault tolerance design

A certain examination on heterogeneous systolic array (HSA) design for deep learning accelerations with low power computations

Accelerating Neural Network Inference in Handwritten Digit Recognition — Comparative Study

Unifying Static and Dynamic Intermediate Languages for Accelerator Generators

High-throughput systolic array-based accelerator for hybrid transformer-CNN networks

Enhancing efficiency in spaceborne phased array systems: MVDR algorithm and FPGA integration

ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array

Enhancing Computation-Efficiency of Deep Neural Network Processing on Edge Devices through Serial/Parallel Systolic Computing

Factored Systolic Arrays Based on Radix-8 Multiplication for Machine Learning Acceleration

Design and implementation of a nano-scale high-speed multiplier for signal processing applications

FPGA-Based Acceleration of Polar-Format Algorithm for Video Synthetic-Aperture Radar Imaging

Symmetry-Enabled Resource-Efficient Systolic Array Design for Montgomery Multiplication in Resource-Constrained MIoT Endpoints

Enhancing Field Multiplication in IoT Nodes with Limited Resources: A Low-Complexity Systolic Array Solution

Highly Fault-Tolerant Systolic-Array-Based Matrix Multiplication

High-Speed CNN Accelerator SoC Design Based on a Flexible Diagonal Cyclic Array

VerSA: Versatile Systolic Array Architecture for Sparse and Dense Matrix Multiplications

Voltage Scaled Low Power DNN Accelerator Design on Reconfigurable Platform

Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

Efficient ASIC Architecture for Low Latency Classic McEliece Decoding