Abstract

In this manuscript, variants of Jacobi solver implementation on general purpose graphical processing units (GPGPU) have been purposed and compared. During this work, parallel implementation of finite element method (FEM) using Poisson's equation on shared memory architecture as well as on GPGPUs has been observed to identify computationally most expensive part of FEM software, which is linear algebra Jacobi solver. Sparse matrices were used for system of linear equations. Nine implementations of Jacobi solver have been developed and compared using various synchronization and computation methods like atomicAdd, atomicAdd_block, butterfly communication, grid synchronization, hybrid and whole GPU based computation methods, respectively. Experiments have showed that Jacobi implementations based on our implemented Butterfly communication method have outperformed CUDA 10.0 provided critical execution methods like atomicAdd, atomicAdd_block and grid methods. The GPU has achieved a max speedup of 46 times using GTX 1060 and 60 times using Quadro P4000 with double precision computations when compared with sequential implementation on Core-i7 8750H. All the developments were performed using C/C++ GNU compiler 7.3.0 on Ubuntu 18.04 and CUDA 10.0.

Highlights

  • High Performance Computing is referred as a branch of computer science used for solution of large and highly complex problems in domain of science, engineering and business.In High Performance Computing, many-core processors have gained more popularity than multi-core CPUs

  • Until 2006, it was very challenging for programmers to write programs for early graphics chips in higher level programming interface, as underlying code must fit into APIs that intended to paint

  • Majority of computations in finite element method (FEM) solver are of single instruction multiple data flavor, that’s why these are well suited for many core architecture

Read more

Summary

INTRODUCTION

High Performance Computing is referred as a branch of computer science used for solution of large and highly complex problems in domain of science, engineering and business. As GPUs have been evolved into massively parallel many-threaded multi-core units that supports highly efficient computation of large blocks of data in parallel and high memory bandwidth. The limited size of on-chip memory which is 49 KB per block in device of compute capability 6.x or more, is the main hurdle in utilizing registers or shared memory This memory is organized into 32 banks, that serves 32 threads of one wrap concurrently. The NVIDIA introduces a barrier synchronization method __syncthreads() for block-level coordination When this method is called in kernel, all threads of a block have to wait at this point until all threads in the block reach at that point. M. Aslam et al.: Performance Comparison of GPU-Based Jacobi Solvers Using CUDA Provided Synchronization Methods FIGURE 2. It is implemented for a multi-core shared memory processor and than into GPGPU

BACKGROUND
GPGPU BASED PARALLEL JACOBI SOLVER FORMULATION
EXPERIMENTS AND RESULTS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.