Vivado High-level Synthesis Research Articles

With the prevalence of hardware accelerators as an integral part of the modern systems on chip (SoCs), the ability to model accelerators quickly and accurately within the system in which it operates is critical. This paper presents gem5-SALAMv2 as a novel system architecture for LLVM-based modeling and simulation of custom hardware accelerators integrated into the gem5 framework. It overcomes the inherent limitations of state-of-the-art trace-based pre-register-transfer level (RTL) simulators by offering a truly “execute-in-execute” LLVM-based model. It enables scalable modeling of multiple dynamically interacting accelerators with full-system simulation support. To create long-term sustainable expansion compatible with the gem5 system framework, gem5-SALAM offers a general-purpose and modular communication interface and memory hierarchy integrated into the gem5 ecosystem, streamlining designing and modeling accelerators for new and emerging applications. gem5-SALAMv2 expands upon the framework established in gem5-SALAMv1 with improved LLVM-based elaboration and simulation, improved and more extensible system integration, and new automations to simplify rapid prototyping and design space exploration. 11Conference Paper Extension: This work extends the work presented in gem5-SALAM: A System Architecture for LLVM-based Accelerator Modeling from MICRO 2020 (Rogers et al., 2020). This work expands on the aforementioned work by revamping the gem5-SALAM internals to provide more robust and extensible simulations, introducing new automation tools for expanding and simplifying design space exploration, and demonstrates the new capabilities of gem5-SALAMv2 by exploring multiple configurations of simple neural network architectures.Validation on the MachSuite (Reagen et al., 2014) benchmarks presents a timing estimation error of less than 1% against the Vivado High-Level Synthesis (HLS) tool. Results also show less than a 4% area and power estimation error against Synopsys Design Compiler. Additionally, system validation against implementations on an Ultrascale+ ZCU102 shows an average end-to-end timing error of less than 2%. Lastly, we demonstrate the upgraded capabilities of gem5-SALAMv2 by exploring accelerator platforms for two deep neural networks, LeNet5 and MobileNetv2. In these explorations, we demonstrate how gem5-SALAMv2 can simulate such systems and guide architectural optimizations for these types of accelerator-rich architectures. 22The most up-to-date version of gem5-SALAMv2 is available at https://github.com/TeCSAR-UNCC/gem5-SALAM..

Read full abstract

Efficient utilization of restrained memory resources is of paramount importance in CPU-FPGA heterogeneous multiprocessor system-on-chip (HMPSoC)-based system design for memory-intensive applications. State-of-the-art high level synthesis (HLS) tools rely on the system programmers to manually determine the data placement within the complex memory hierarchy. Different data placement policies may lead to different system performance, and finding an optimal data placement policy is a nontrivial problem. For instance, we show counter-intuitive results that traditional frequency and locality-based data placement strategy designed for CPU architecture leads to nonoptimal system performance in CPU-FPGA HMPSoCs. In this work, we first propose an automatic data placement framework for field programmable gate array (FPGA) kernels to determine whether each array object should be accessed via the on-chip BRAM, shared CPU L2-cache, or DDR memory to achieve the optimal performance. Moreover, we find that when the CPU kernel and the FPGA kernel are executed in parallel, memory contentions may degrade the performance and the optimal data placement policy designed for the FPGA kernel alone will not achieve the optimal overall system performance. In this article, we proposed to use cache partitioning to alleviate the impact brought by memory contentions. We extend the framework designed for FPGA by adding the cross-layer memory contentions analysis to automatically generate an optimal data placement policy and cache partitioning mechanism for the parallel executing kernels. The proposed data placement framework can be seamlessly integrated with the commercial Vivado HLS. The experimental results on the Zedboard platform show an average <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.5\times $ </tex-math></inline-formula> performance speedup for FPGA kernels compared with a greedy-based allocation strategy. When FPGA kernels and CPU kernels are executed in parallel, the FPGA kernel and the CPU kernel have a performance speedup of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.62\times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.10\times $ </tex-math></inline-formula> on average, respectively.

Read full abstract

Vivado High-level Synthesis Research Articles

Related Topics

Articles published on Vivado High-level Synthesis

Expanding hardware accelerator system design space exploration with gem5-SALAMv2

A New Optimized Hybridization Approach for in silico High Throughput Molecular Docking on FPGA Platform.

Notification Oriented Paradigm to Digital Hardware — A benchmark evaluation with Random Forest algorithm

FPGA implementation of QUasi-Affine TRansformation evolutionary algorithm

Accelerated FPGA-Based Vector Directional Filter for Real-Time Color Image Denoising with Enhanced Performance

A Comprehensive Memory Management Framework for CPU-FPGA Heterogenous SoCs

Neuromorphic circuit based on the un-supervised learning of biologically inspired spiking neural network for pattern recognition

Implementation and Hardware Acceleration of Wiener filter algorithm using Vivado High Level Synthesis HLS for X-ray baggage Images

Hardware Acceleration of Video Edge Detection with Hight Level Synthesis on the Xilinx Zynq Platform

An FPGA Design for Real-Time Image Denoising

Unified FPGA Design for the HEVC Dequantization and Inverse Transform Modules

FPGA-based DFT system design, optimisation and implementation using high-level synthesis

Dependency Graph-based High-level Synthesis for Maximum Instruction Parallelism

High level synthesis strategies for ultra fast and low latency matrix inversion implementation for massive MIMO processing

A Highly Configurable High-Level Synthesis Functional Pattern Library

Implementation of Efficient Stopping Criteria for Turbo Decoding

High-Level Annotation of Routing Congestion for Xilinx Vivado HLS Designs

Design, optimisation and implementation of a DCT/IDCT-based image processing system on FPGA

The 3D-DTW Custom IP based FPGA Hardware Acceleration for Action Recognition

RETRACTED: FPGA-Based Motion Estimation Algorithm Optimization

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Vivado High-level Synthesis Research Articles

Related Topics

Articles published on Vivado High-level Synthesis

Expanding hardware accelerator system design space exploration with gem5-SALAMv2

A New Optimized Hybridization Approach for in silico High Throughput Molecular Docking on FPGA Platform.

Notification Oriented Paradigm to Digital Hardware — A benchmark evaluation with Random Forest algorithm

FPGA implementation of QUasi-Affine TRansformation evolutionary algorithm

Accelerated FPGA-Based Vector Directional Filter for Real-Time Color Image Denoising with Enhanced Performance

A Comprehensive Memory Management Framework for CPU-FPGA Heterogenous SoCs

Neuromorphic circuit based on the un-supervised learning of biologically inspired spiking neural network for pattern recognition

Implementation and Hardware Acceleration of Wiener filter algorithm using Vivado High Level Synthesis HLS for X-ray baggage Images

Hardware Acceleration of Video Edge Detection with Hight Level Synthesis on the Xilinx Zynq Platform

An FPGA Design for Real-Time Image Denoising

Unified FPGA Design for the HEVC Dequantization and Inverse Transform Modules

FPGA-based DFT system design, optimisation and implementation using high-level synthesis

Dependency Graph-based High-level Synthesis for Maximum Instruction Parallelism

High level synthesis strategies for ultra fast and low latency matrix inversion implementation for massive MIMO processing

A Highly Configurable High-Level Synthesis Functional Pattern Library

Implementation of Efficient Stopping Criteria for Turbo Decoding

High-Level Annotation of Routing Congestion for Xilinx Vivado HLS Designs

Design, optimisation and implementation of a DCT/IDCT-based image processing system on FPGA

The 3D-DTW Custom IP based FPGA Hardware Acceleration for Action Recognition

RETRACTED: FPGA-Based Motion Estimation Algorithm Optimization