GPU Solution Research Articles

Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators’ resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host’s file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.

Read full abstract

Binocular vision and neural networks (CNNs) are widely seen in modern intelligent vision processing systems, such as robotics, autonomous vehicles, and AR gadgets. However, both the classic semiglobal matching (SGM) and deep CNNs entail substantial computing resource to reach the performance goal. Traditional embedded CPU/graphic processor unit (GPU) cannot simultaneously meet the processing speed and energy requirement, while the specialized circuits dedicated to SGM and CNN processing, respectively, will take considerable hardware and development costs. However, as the popularity of deep learning, neural processing units (NPUs) become prevalent in many embedded and edge devices, which possess high throughput computing power to deal with the matrix operations involved by neural networks. In this work, we attempt to take advantage of the neural processing architectures integrated in SoC chips to accelerate the SGM process, so that the hardware resources will be better utilized instead of investing more resources to create specialized SGM components. Thereby, this letter first deploys SGM on NPU by converting the incompatible operations into the neural-computing flow, and a configurable neural processing element is proposed to flexibly support various vector operation sequences. Then, a hybrid dataflow scheduler and the corresponding hardware modification are introduced to accelerate the cost processing, improving hardware utilization and on-chip memory footprint and access. Our solution runs at 45 fps for an image size of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$640\times 480$ </tex-math></inline-formula> , with 128 disparity levels. The speed-energy efficiency is <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$52\times $ </tex-math></inline-formula> better than the GPU (Jetson TX1) solution with negligible additional hardware overhead and accuracy loss.

Read full abstract

GPU Solution Research Articles

Articles published on GPU Solution

3D Modelling for the Hajj and Umrah Pilgrims

MIMO-SA-Based 3-D Image Reconstruction of Targets Under Illumination of Terahertz Gaussian Beam—Theory and Experiment

Distributed large-scale graph processing on FPGAs

Dadu-SV: Accelerate Stereo Vision Processing on NPU

Remarks on the numerical approximation of Dirac delta functions

Study of the accuracy and applicability of the difference scheme for solving the diffusion-convection problem at large grid Péclet numbers

Using analysis information in the synchronization‐free GPU solution of sparse triangular systems

Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA

Embedded GPU 3D Panoramic Viewing System Based on Virtual Camera Roaming 3D Environment

FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL

Accelerating Persistent Scatterer Pixel Selection for InSAR Processing

Chunked Bounding Volume Hierarchies for Fast Digital Prototyping Using Volumetric Meshes.

ODoST

A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU–GPU systems

Solving the Caputo Fractional Reaction-Diffusion Equation on GPU

Parallelization Research of SapTis-Software of Multi-Field Simulation and Nonlinear Analysis of Complex Structures

Perturbation Functions in Computer Graphics

Real-time, fast radio transient searches with GPU de-dispersion

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

GPU Solution Research Articles

Articles published on GPU Solution

3D Modelling for the Hajj and Umrah Pilgrims

MIMO-SA-Based 3-D Image Reconstruction of Targets Under Illumination of Terahertz Gaussian Beam—Theory and Experiment

Distributed large-scale graph processing on FPGAs

Dadu-SV: Accelerate Stereo Vision Processing on NPU

Remarks on the numerical approximation of Dirac delta functions

Study of the accuracy and applicability of the difference scheme for solving the diffusion-convection problem at large grid Péclet numbers

Using analysis information in the synchronization‐free GPU solution of sparse triangular systems

Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA

Embedded GPU 3D Panoramic Viewing System Based on Virtual Camera Roaming 3D Environment

FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL

Accelerating Persistent Scatterer Pixel Selection for InSAR Processing

Chunked Bounding Volume Hierarchies for Fast Digital Prototyping Using Volumetric Meshes.

ODoST

A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU–GPU systems

Solving the Caputo Fractional Reaction-Diffusion Equation on GPU

Parallelization Research of SapTis-Software of Multi-Field Simulation and Nonlinear Analysis of Complex Structures

Perturbation Functions in Computer Graphics

Real-time, fast radio transient searches with GPU de-dispersion