Computational Kernels Research Articles

This paper presents GraphAGILE, a domain-specific FPGA-based overlay accelerator for graph neural network (GNN) inference. GraphAGILE consists of (1) <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">a novel unified architecture design</i> with an <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">instruction set</i> , and (2) <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">a compiler</i> built upon the instruction set that can quickly generate optimized code. Due to the proposed instruction set architecture (ISA) and the compiler, GraphAGILE does not require any FPGA reconfiguration when performing inference on various GNN models and input graphs. For the architecture design, we propose a novel hardware module named Adaptive Computation Kernel (ACK), that can execute various computation kernels of GNNs, including general matrix multiplication (GEMM), sparse-dense matrix multiplication (SpDMM), and sampled dense-dense matrix multiplication (SDDMM). The compiler takes the specifications of a GNN model and the graph meta data (e.g., the number of vertices and edges) as input, and generates a sequence of instructions for inference execution. We develop the following compiler optimizations to reduce inference latency: (1) computation order optimization that automatically reorders the computation graph to reduce the total computation complexity, (2) layer fusion that merges adjacent layers to reduce data communication volume, (3) data partitioning with a partition-centric execution scheme that partitions the input graph to fit the available on-chip memory of FPGA, (4) kernel mapping that automatically selects execution mode for ACK, and performs task scheduling to overlap computation with data communication and achieves dynamic load balance. We implement GraphAGILE on a state-of-the-art FPGA platform, Xilinx Alveo U250. GraphAGILE can execute widely used GNN models, including GCN, GAT, GIN, GraphSAGE, SGC and other GNN models supported by GraphGym. Experimental results show that GraphAGILE achieves up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$47.1\times$</tex-math></inline-formula> ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$3.9\times$</tex-math></inline-formula> ) reduction in end-to-end latency, including the latency of compilation and hardware execution, compared with the state-of-the-art implementations on CPU (GPU), and achieves up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$2.9\times$</tex-math></inline-formula> reduction in hardware execution latency compared with the state-of-the-art FPGA accelerators.

Read full abstract

Support Vector Machines (SVM) are widely used techniques in the field of classification problems because of their ability to effectively deal with datasets that have complex non-linear structures and a high dimensionality. The compute-intensive training algorithm associated with SVM makes it challenging to keep an up-to-date model that accurately reflects the characteristics of newly arriving data points in real-time systems. This paper proposes a novel training algorithm for incremental learning from large datasets, based on a variant of Sequential Minimal Optimization (SMO). High-Level Synthesis (HLS) was used for implementing the Field Programmable Gate Array (FPGA) based Intellectual Property (IP) Core, which includes the computationally intensive kernel computation portion of the training algorithm. In addition to the kernel computation, the inference phase of the SVM classifier is built into the IP core, and its use can be switched on the fly. The computational latency and memory bandwidth of an IP core are optimized using loop pipelining and DMA burst data transfer. With the help of hardware/software co-design, the IP core is integrated into the design of a flexible and re-usable System on Chip (SoC) called PYNQ Overlay. The experiments show that the overlay outperforms the embedded processor, multiple hardware SVM classifiers, and hardware accelerated Convolutional Neural Networks (CNN) in terms of real-time efficiency. The Overlay makes much less use of the resources available on the chip in comparison to the majority of the CNN accelerators. The overlay achieves an average classification accuracy that is only 1% lower than that of an ARM Cortex-A9 processor, according to experimental results on six datasets. Furthermore, it can increase training speed by an average of 31.82x and inference speed by an average of 31.74x. In addition, the proposed Overlay design achieves a 2.3x improvement in average training speed, as measured in Mega bits per second, compared to existing SVM training implementations, along with incremental learning and multi-class classification support.

Read full abstract

Computational Kernels Research Articles

Related Topics

Articles published on Computational Kernels

Automated Buffer Sizing of Dataflow Applications in a High-level Synthesis Workflow

Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction code

Boosting RDataFrame performance with transparent bulk event processing

Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph Transformations

Geometry Optimization: A Comparison of Different Open-Source Geometry Optimizers.

Integrative generalized master equation: A method to study long-timescale biomolecular dynamics via the integrals of memory kernels.

A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU

GraphAGILE: An FPGA-Based Overlay Accelerator for Low-Latency GNN Inference

A High Performance and Robust FPGA Implementation of a Driver State Monitoring Application.

Kernel well-posedness and computation by power series in backstepping output feedback for radially-dependent reaction–diffusion PDEs on multidimensional balls

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

Sensitivity Analysis of the Data Assimilation-Driven Decomposition in Space and Time to Solve PDE-Constrained Optimization Problems

SoC-based real-time SVM classification with integrated training using HLS and PYNQ

On the computation of the robust viability kernel for switched systems

A novel graph transformation strategy for optimizing SpTRSV on CPUs

Towards computational awareness in autonomous robots: an empirical study of computational kernels

Energy-Efficient Parallel Computing: Challenges to Scaling

Grey-level intensity measurements processing by means of Volterra equations and Least Squares Method for Video restoration

Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation

Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processors

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Computational Kernels Research Articles

Related Topics

Articles published on Computational Kernels

Automated Buffer Sizing of Dataflow Applications in a High-level Synthesis Workflow

Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction code

Boosting RDataFrame performance with transparent bulk event processing

Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph Transformations

Geometry Optimization: A Comparison of Different Open-Source Geometry Optimizers.

Integrative generalized master equation: A method to study long-timescale biomolecular dynamics via the integrals of memory kernels.

A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU

GraphAGILE: An FPGA-Based Overlay Accelerator for Low-Latency GNN Inference

A High Performance and Robust FPGA Implementation of a Driver State Monitoring Application.

Kernel well-posedness and computation by power series in backstepping output feedback for radially-dependent reaction–diffusion PDEs on multidimensional balls

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

Sensitivity Analysis of the Data Assimilation-Driven Decomposition in Space and Time to Solve PDE-Constrained Optimization Problems

SoC-based real-time SVM classification with integrated training using HLS and PYNQ

On the computation of the robust viability kernel for switched systems

A novel graph transformation strategy for optimizing SpTRSV on CPUs

Towards computational awareness in autonomous robots: an empirical study of computational kernels

Energy-Efficient Parallel Computing: Challenges to Scaling

Grey-level intensity measurements processing by means of Volterra equations and Least Squares Method for Video restoration

Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation

Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processors