Abstract

Compute Unified Device Architecture (CUDA) programmed,Graphic Processing Units (GPUs) are rapidly becoming a major choice in high performance computing. Hence, the number of applications ported to the CUDA platform is growing high. Message Passing Interface(MPI) has been the choice of high performance computing for more than a decade and it has proven its capability in delivering higher performance in parallel applications. CUDA and MPI use different programming approaches but both of them depend on the inherent parallelism of the application to be effective. However, much less research had been carried out to evaluate the performance when CUDA is integrated with other parallel programming paradigms. This paper investigates on integration of these capabilities of both programming approaches and how we can achieve superior performance in general purpose applications. Thus, we have experimented CUDA+MPI programming approach with two well-known algorithms (Strassens Algorithm & Conjugate Gradient Algorithm) and shown how we can achieve higher performance by means of using MPI as computation distributing mechanism and CUDA as the main execution engine. We have developed a general purpose matrix multiplication algorithm and a Conjugate Gradient algorithm using CUDA and MPI. In this approach, MPI functions as the data distributing mechanism between the GPU nodes and CUDA as the main computing engine. This allows the programmer to connect GPU nodes via high speed Ethernet without special technologies. Thus, the programmer is enabled to view each GPU node separately as they are and to execute different components of a program in several GPU nodes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call