Throughput Architecture Research Articles

Distributed arithmetic (DA) is an efficient look-up table (LUT) based approach. The throughput of DA based implementation is limited by the LUT size. This paper presents two high-throughput architectures (Type I and II) of non-pipelined DA based least-mean-square (LMS) adaptive filters (ADFs) using two’s complement (TC) and offset-binary coding (OBC) respectively. We formulate the LMS algorithm using the steepest descent approach with possible extension to its power-normalized LMS version and followed by its convergence properties. The coefficient update equation of LMS algorithm is then transformed via TC DA and OBC DA to design and develop non-pipelined architectures of ADFs. The proposed structures employ the LUT pre-decomposition technique to increase the throughput performance. It enables the same mapping scheme for concurrent update of the decomposed LUTs. An efficient fixed-point quantization model for the evaluation of proposed structures from a realistic point-of-view is also presented. It is found that Type II structure provides higher throughput than Type I structure at the expense of slow convergence rate with almost the same steady-state mean square error. Unlike existing non-pipelined LMS ADFs, the proposed structures offer very high throughput performance, especially with large order DA base units. Furthermore, they are capable of performing less number of additions in every filter cycle. Based on the simulation results, it is found that 256th order filter with 8th order DA base unit using Type I structure provides 9.41× higher throughput while Type II structure provides 16.68× higher throughput as compared to the best existing design. Synthesis results show that 32nd order filter with 8th order DA base unit using Type I structure achieves 38.76% less minimum sampling period (MSP), occupies 28.62% more area, consumes 67.18% more power, utilizes 49.06% more slice LUTs and 3.31% more flip-flops (FFs), whereas Type II structure achieves 51.25% less MSP, occupies 21.42% more area, consumes 47.84% more power, utilizes 29.10% more slice LUTs and 1.47% fewer FFs as compared to the best existing design.

Read full abstract

Throughput architectures such as GPUs require substantial hardware resources to hold the state of a massive number of simultaneously executing threads. While GPU register files are already enormous, reaching capacities of 256KB per streaming multiprocessor (SM), we find that nearly half of real-world applications we examined are register-bound and would benefit from a larger register file to enable more concurrent threads. This article seeks to increase the thread occupancy and improve performance of these register-bound applications by making more efficient use of the existing register file capacity. Our first technique eagerly deallocates register resources during execution. We show that releasing register resources based on value liveness as proposed in prior states of the art leads to unreliable performance and undue design complexity. To address these deficiencies, our article presents a novel compiler-driven approach that identifies and exploits last use of a register name (instead of the value contained within) to eagerly release register resources. Furthermore, while previous works have leveraged “scalar” and “narrow” operand properties of a program for various optimizations, their impact on thread occupancy has been relatively unexplored. Our article evaluates the effectiveness of these techniques in improving thread occupancy and demonstrates that while any one approach may fail to free very many registers, together they synergistically free enough registers to launch additional parallel work. An in-depth evaluation on a large suite of applications shows that just our early register technique outperforms previous work on dynamic register allocation, and together these approaches, on average, provide 12% performance speedup (23% higher thread occupancy) on register bound applications not already saturating other GPU resources.

Read full abstract

Throughput Architecture Research Articles

Related Topics

Articles published on Throughput Architecture

Active Queue Management in L4S with Asynchronous Advantage Actor-Critic: A FreeBSD Networking Stack Perspective

Design and performance analysis of manchester coder-based body channel communication using FPGA

High-Throughput Hardware Design for Linear Equation System Solving of VVC Affine Prediction

Heterogeneous microstructures tuned in a high throughput architecture

Secure image encryption using high throughput architectures of PRINT cipher for radio frequency identification applications

High throughput architecture for multiscale variational optical flow

Heterogeneous Microstructures Tuned in a High Throughput Architecture

Two Distributed Arithmetic Based High Throughput Architectures of Non-Pipelined LMS Adaptive Filters

Enhanced parallel CFAR architecture with sharing resources using FPGA

GPU-based power converter transient simulation with matrix exponential integration and memory management

Validating the Sharing Behavior and Latency Characteristics of the L4S Architecture

Performance Analysis of High Throughput MAP Decoder for Turbo Codes and Self Concatenated Convolutional Codes

Software-Directed Techniques for Improved GPU Register File Utilization

An In-Memory VLSI Architecture for Convolutional Neural Networks

GPU NTC Process Variation Compensation With Voltage Stacking

High throughput FPGA Implementation of Advanced Encryption Standard Algorithm

High throughput resource shared 2D integer transform computation for H.264/MPEG-4 AVC

VLSI Implementation of a Rate Decoder for Structural LDPC Channel Codes

A high throughput architecture for a low complexity soft-output demapping algorithm

Network-Level FPGA Acceleration of Low Latency Market Data Feed Arbitration

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Throughput Architecture Research Articles

Related Topics

Articles published on Throughput Architecture

Active Queue Management in L4S with Asynchronous Advantage Actor-Critic: A FreeBSD Networking Stack Perspective

Design and performance analysis of manchester coder-based body channel communication using FPGA

High-Throughput Hardware Design for Linear Equation System Solving of VVC Affine Prediction

Heterogeneous microstructures tuned in a high throughput architecture

Secure image encryption using high throughput architectures of PRINT cipher for radio frequency identification applications

High throughput architecture for multiscale variational optical flow

Heterogeneous Microstructures Tuned in a High Throughput Architecture

Two Distributed Arithmetic Based High Throughput Architectures of Non-Pipelined LMS Adaptive Filters

Enhanced parallel CFAR architecture with sharing resources using FPGA

GPU-based power converter transient simulation with matrix exponential integration and memory management

Validating the Sharing Behavior and Latency Characteristics of the L4S Architecture

Performance Analysis of High Throughput MAP Decoder for Turbo Codes and Self Concatenated Convolutional Codes

Software-Directed Techniques for Improved GPU Register File Utilization

An In-Memory VLSI Architecture for Convolutional Neural Networks

GPU NTC Process Variation Compensation With Voltage Stacking

High throughput FPGA Implementation of Advanced Encryption Standard Algorithm

High throughput resource shared 2D integer transform computation for H.264/MPEG-4 AVC

VLSI Implementation of a Rate Decoder for Structural LDPC Channel Codes

A high throughput architecture for a low complexity soft-output demapping algorithm

Network-Level FPGA Acceleration of Low Latency Market Data Feed Arbitration