Dataflow Architecture Research Articles

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This article investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on field-programmable gate arrays (FPGAs). Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. This model can be extended to multi-FPGA settings for distributed inference. Through our analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT2) on an AMD Xilinx Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4× speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2× speedup compared to Design for Excellence, an FPGA overlay, in the prefill stage, while achieving a 1.9× speedup and a 5.7× improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.

Read full abstract

The Bonsai Merkle tree (BMT) is a widely used tree structure for authentication of metadata such as encryption counters in a secure computing system. Common BMT algorithms were designed for traditional Von Neumann architectures with a software-centric implementation in mind and as such, they are predominantly recursive and sequential in nature. However, the modern heterogeneous computing platforms employing Field-Programmable Gate Array (FPGA) devices require concurrency-focused algorithms to fully utilize the versatility and parallel nature of such systems. The recursive nature of traditional BMT algorithms makes them challenging to implement in such hardware-based setups. Our goal for this work is to introduce HMT, a hardware-friendly BMT algorithm that enables the verification and update processes to function independently and provides the benefits of relaxed update while being comparable to the eager update in terms of update complexity. The methodology of HMT contributes both novel algorithmic revisions and innovative hardware techniques to implementing BMT. We mathematically demonstrate the challenges of potentially unbounded recursions in relaxed BMT updates. To solve this problem, we use a partitioned BMT caching scheme that allocates a separate write-back cache for each BMT level—thus allowing for low and fixed upper bounds for dirty evictions compared to the traditional BMT caches. Then we introduce the aforementioned hybrid BMT algorithm that is hardware-targeted, parallel, and relaxes the update depending on BMT cache hit but makes the update conditions more flexible compared to lazy update to save additional write-backs. Deploying this new algorithm, we have designed a new BMT controller with a dataflow architecture including speculative buffers and parallel write-back engines to facilitate performance-enhancing mechanisms (like multiple concurrent authentication and independent updates) that were not possible with the conventional lazy algorithm. Our empirical performance measurements on a Xilinx U200 accelerator FPGA have demonstrated that HMT can achieve up to 7× improvement in bandwidth and 4.5× reduction in latency over lazy-update BMT baseline and up to 14% faster execution in standard benchmarks compared to a state-of-the-art, eager-update BMT solution.

Read full abstract

Dataflow Architecture Research Articles

Related Topics

Articles published on Dataflow Architecture

HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow Architectures

Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model Inference

Reconfigurable Acceleration of Neural Networks: A Comprehensive Study of FPGA-based Systems

Domain Specific Abstractions for the Development of Fast-by-Construction Dataflow Codes on FPGAs

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

CUTE: A scalable CPU-centric and Ultra-utilized Tensor Engine for convolutions

Improving Utilization of Dataflow Unit for Multi-Batch Processing

Coupled Ferroelectric-Photonic Memory in a Retinomorphic Hardware for In-Sensor Computing.

Unleashing the power of decentralized serverless IoT dataflow architecture for the Cloud-to-Edge Continuum: a performance comparison

Merging control-flow and dataflow architectures on a single chip

A TinyML Model for Gesture-Based Air Handwriting Arabic Numbers Recognition

Time, causality, and realizability: Engineering interactive, distributed software systems

On the RTL Implementation of FINN Matrix Vector Unit

An area-efficient and low-latency elliptic curve scalar multiplication accelerator over prime field

HMT: A Hardware-centric Hybrid Bonsai Merkle Tree Algorithm for High-performance Authentication

PicoTDC: a flexible 64 channel TDC with picosecond resolution

A High-Throughput Full-Dataflow MobileNetv2 Accelerator on Edge FPGA

Study of FIT Dedicated Computer with Dataflow Architecture for High Performance 2-D Magneto-Static Field Simulation

Coordinated Cloud-Edge Anomaly Identification for Active Distribution Networks

The Design of Efficient Data Flow and Low-Complexity Architecture for a Highly Configurable CNN Accelerator

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Dataflow Architecture Research Articles

Related Topics

Articles published on Dataflow Architecture

HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow Architectures

Understanding the Potential of FPGA-based Spatial Acceleration for Large Language Model Inference

Reconfigurable Acceleration of Neural Networks: A Comprehensive Study of FPGA-based Systems

Domain Specific Abstractions for the Development of Fast-by-Construction Dataflow Codes on FPGAs

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

CUTE: A scalable CPU-centric and Ultra-utilized Tensor Engine for convolutions

Improving Utilization of Dataflow Unit for Multi-Batch Processing

Coupled Ferroelectric-Photonic Memory in a Retinomorphic Hardware for In-Sensor Computing.

Unleashing the power of decentralized serverless IoT dataflow architecture for the Cloud-to-Edge Continuum: a performance comparison

Merging control-flow and dataflow architectures on a single chip

A TinyML Model for Gesture-Based Air Handwriting Arabic Numbers Recognition

Time, causality, and realizability: Engineering interactive, distributed software systems

On the RTL Implementation of FINN Matrix Vector Unit

An area-efficient and low-latency elliptic curve scalar multiplication accelerator over prime field

HMT: A Hardware-centric Hybrid Bonsai Merkle Tree Algorithm for High-performance Authentication

PicoTDC: a flexible 64 channel TDC with picosecond resolution

A High-Throughput Full-Dataflow MobileNetv2 Accelerator on Edge FPGA

Study of FIT Dedicated Computer with Dataflow Architecture for High Performance 2-D Magneto-Static Field Simulation

Coordinated Cloud-Edge Anomaly Identification for Active Distribution Networks

The Design of Efficient Data Flow and Low-Complexity Architecture for a Highly Configurable CNN Accelerator