PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

Moazin Khatti,Jason Cong,Xingyu Tian,Ahmad Sedigh Baroughi,Licheng Guo,Zhenman Fang,Akhil Raj Baranwal,Yuze Chi

doi:10.1145/3676849

Abstract

In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern datacenter FPGAs that comprise multiple dies. The extra delays introduced due to die crossings and routing congestion can significantly degrade the frequency of large designs on these FPGA boards. Due to the gap between HLS design and physical design, it is challenging for HLS programmers to analyze and identify the root causes, and fix their HLS design to achieve better timing closure. Recent efforts have aimed to address these issues by employing coarse-grained floorplanning and pipelining strategies on task-parallel HLS designs where multiple tasks run concurrently and communicate through FIFO stream channels. However, many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer channel based communication between tasks. In this work, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFO stream channels and buffer channels. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. Our framework introduces a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. On the frontend, we provide an easy-to-use programming model for utilizing the proposed buffer channel; while on the backend, we implement efficient placement and pipelining strategies for the proposed buffer channel. To validate the effectiveness of our framework, we test it on four widely used Rodinia HLS benchmarks and two real-world accelerator designs and show an average frequency improvement of 25%, with peak improvements of up to 89% on AMD/Xilinx Alveo U280 boards compared to Vitis HLS baselines.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Reconfigurable Technology and Systems

Lead the way for us

Similar Papers

Source-to-Source Optimization for HLS
Jason Cong ... Muhuan Huang
-
Jason Cong, et. al.Jason Cong ... Muhuan Huang
01 Jan 2015
01 Jan 2015

A High-level Synthesis Design Flow from ESL to RTL with Multi-parametric Optimization Objective
Anirban Sengupta ... Reza Sedaghat
IETE Journal of Research | VOL. 57
Anirban Sengupta, et. al.Anirban Sengupta ... Reza Sedaghat
01 Mar 2011
IETE Journal of Research | VOL. 57

Exploring Sparse Visual Odometry Acceleration With High-Level Synthesis
Ruiqi Ye ... Graham Riley
IEEE Access | VOL. 11
Ruiqi Ye, et. al.Ruiqi Ye ... Graham Riley
01 Jan 2023
IEEE Access | VOL. 11

ScaleHLS: A New Scalable High-Level Synthesis Framework on Multi-Level Intermediate Representation
Hanchen Ye ... Cong Hao
-
Hanchen Ye, et. al.Hanchen Ye ... Cong Hao
01 Apr 2022
01 Apr 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Reconfigurable Technology and Systems