Solving the problem of one-dimensional thermal conductivity on graphics processors using CUDA technology

Pavel Sechenov,Inna Rybenko

doi:10.15593/2499-9873/2021.4.02

Abstract

A mathematical model for solving the problem of one-dimensional thermal conductivity has been developed and implemented programmatically. The purpose of the simulation is to compare the performance of algorithms on the central and graphics processors. The task of parallelization is relevant, since back in 2015 the number of stream processors in the most powerful video card was 2816, and in 2021 there were video cards with 10 496 stream processors. Applications running on NVIDIA GPUs demonstrate greater performance per dollar of invested funds and per watt of energy consumed compared to implementations built on the basis of central processors alone. This is confirmed by the high demand for video cards from miners, which has led to a 1.5-2.5 times increase in the price of video cards at the moment. The requirements for the hardware and software components necessary for the start of modeling are presented. Three methods of finite difference approximation are implemented: explicit, implicit and Crank-Nicolson on the central and graphics processors. The programming languages chosen are C (CPU) and CUDA C (GPU). For a well-parallelized task, when each thread is executed separately and it does not need data from other threads, the acceleration of calculations on the video card increased up to 60 times (an entry-level video card was used). The CUDA C language appeared relatively recently in 2006 and has a number of features when implementing a parallel algorithm. For the selected schemes: explicit, implicit, Crank-Nicolson, at each iteration, it is necessary to access neighboring threads and synchronize the threads. Synchronization of threads occurs in such a way that all threads wait for the slowest of them at each iteration, so solving problems using finite-difference approximation will be performed slower. A fragment of code on a GPU for implementing the Crank-Nicolson scheme is presented. The implementation of the Crank-Nicolson scheme requires the use of fast shared memory for data exchange between threads. The amount of shared memory is limited and affects the number of cells in the grid. The use of graphics cards gave a significant increase in execution speed even on an entry-level card with a number of 384 stream processors. The article presents a comparative analysis of the computing speed for different grid sizes from 1024 to 4000, as well as for different amounts of computing volumes in one thread.

Full Text