Pipelined Array Research Articles

Systolic Array (SA) architectures are well-suited for accelerating matrix multiplications through the use of a pipelined array of Processing Elements (PEs) communicating with local connections and pre-orchestrated data movements. Even though most of the dynamic power consumption in SAs is due to multiplications and additions, pipelined data movement within the SA constitutes an additional important contributor. The goal of this work is to reduce the dynamic power consumption associated with the feeding of data to the SA, by employing both dynamic (run-time) and static (offline) techniques. At the hardware level, the proposed architecture synergistically applies bus-invert coding and zero-value clock gating. By exploiting salient attributes of state-of-the-art CNNs, such as the value distribution of the weights, the proposed SA applies appropriate encoding only to the data that exhibits high switching activity. Similarly, when one of the inputs is zero, unnecessary operations are entirely skipped. In addition to this duet of run-time techniques, the proposed methodology also leverages the inherent property of the weight matrix to remain unchanged throughout the inference phase. As such, the weight matrix is appropriately reordered offline to minimize the switching activity between consecutive values, as the matrix is repeatedly loaded into the array. The weight reordering process is formulated as a Traveling Salesman Problem (TSP) and its solution is translated into a switching-activity-aware row permutation of the weight matrix. The symbiotic combination of selectively targeted, application-aware dynamic encoding and offline weight reordering is demonstrated to reduce the switching activity by 38%, on average. This translates to an overall dynamic power reduction of 17.1%–23% when executing state-of-the-art CNN layers on an SA of size 32 × 32. These power savings scale with the array size; for an array of size 64 × 64, the proposed design consumes 29.7%–35.4% less power.

SVD and QR Decomposition have attracted much attention recently for solving linear algebra and digital signal processing problems. The House-holder QR Decomposition requires a smaller number of operations then the Givens QR Decomposition and therefore is strongly recommended for sequential computer implementations (as found in nearly all known libraries like NAG, LINPACK, EISPACK etc.) However for parallel implementation it has not been widely used, because the Givens Rotations possess inherent parallelism and the Householder Reflections do not. This is true if one analyzes the original algorithm. Four independent loops and three bottlenecks between the loops are constraint for pipelining the computations. This leads to an inefficient solution, i.e. 4n time moments for each iteration and since there are n iterations then the total time to execute the algorithm is O(4n 2 ). Compared to 3n m 2 time steps for the Givens QR Decomposition it is uneconomic. In this paper we have used some recent results for the elimination of the computational and data broadcast, and data synchronization to derive a fully localized form of the Householder QR Decomposition algorithm. We have succeeded in reorganizing and transforming the algorithm from three bottlenecks to one and four loops to one. A linear and a double pipeline array of n+1 processors are presented to solve the problem in O ( n 2 /2) time steps. It is also shown that the bottlenecks for the bidiagonalization and SVD computation cannot be eliminated.

Pipelined Array Research Articles

Related Topics

Articles published on Pipelined Array

Exploiting data encoding and reordering for low-power streaming in systolic arrays

Low power DNA protein sequence alignment using FSM state transition controller

Design of Generalized Pipeline Cellular Array in Quantum-Dot Cellular Automata

A 181 GOPS AKAZE Accelerator Employing Discrete-Time Cellular Neural Networks for Real-Time Feature Extraction.

A 130.7-$\hbox{mm}^{2}$ 2-Layer 32-Gb ReRAM Memory Device in 24-nm Technology

C-slow retimed parallel histogram architectures for consumer imaging devices

A Generalized Frame-Level FSBM FLSA Architecture

Real-Time Computation of Local Neighborhood Functions in Application-Specific Instruction-Set Processors

High-performance IP Lookup Engine with Compact Clustered Trie Search

Parallel pipelined array architectures for real-time histogram computation in consumer devices

Research on the Application of Micro-Program Controller in Parallel Neural Networks

Binary multipliers on quantum-dot cellular automata

A New Pipelined Systolic Array-Based Architecture for Matrix Inversion in FPGAs with Kalman Filter Case Study

A pipelined array architecture for Euclidean distance transformation and its FPGA implementation

Improving power-awareness of pipelined array multipliers using two-dimensional pipeline gating and its application on FIR design

Real-time area correlation tracker implementation based on absolute difference algorithm

Hardware implementation of soft color image morphological operations

Systolic SVD and QR Decomposition by Householder Reflections

Area-time-power tradeoff in cellular arrays VLSI implementations

A pipelined architecture for image segmentation by adaptive progressive thresholding

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Pipelined Array Research Articles

Related Topics

Articles published on Pipelined Array

Exploiting data encoding and reordering for low-power streaming in systolic arrays

Low power DNA protein sequence alignment using FSM state transition controller

Design of Generalized Pipeline Cellular Array in Quantum-Dot Cellular Automata

A 181 GOPS AKAZE Accelerator Employing Discrete-Time Cellular Neural Networks for Real-Time Feature Extraction.

A 130.7-$\hbox{mm}^{2}$ 2-Layer 32-Gb ReRAM Memory Device in 24-nm Technology

C-slow retimed parallel histogram architectures for consumer imaging devices

A Generalized Frame-Level FSBM FLSA Architecture

Real-Time Computation of Local Neighborhood Functions in Application-Specific Instruction-Set Processors

High-performance IP Lookup Engine with Compact Clustered Trie Search

Parallel pipelined array architectures for real-time histogram computation in consumer devices

Research on the Application of Micro-Program Controller in Parallel Neural Networks

Binary multipliers on quantum-dot cellular automata

A New Pipelined Systolic Array-Based Architecture for Matrix Inversion in FPGAs with Kalman Filter Case Study

A pipelined array architecture for Euclidean distance transformation and its FPGA implementation

Improving power-awareness of pipelined array multipliers using two-dimensional pipeline gating and its application on FIR design

Real-time area correlation tracker implementation based on absolute difference algorithm

Hardware implementation of soft color image morphological operations

Systolic SVD and QR Decomposition by Householder Reflections

Area-time-power tradeoff in cellular arrays VLSI implementations

A pipelined architecture for image segmentation by adaptive progressive thresholding