A hybrid MPI-CUDA approach for nonequispaced discrete Fourier transformation

Sheng-Chun Yang,Yong-Lei Wang

doi:10.1016/j.cpc.2020.107513

Abstract

Nonequispaced discrete Fourier transformation (NDFT) is widely applied in all aspects of computational science and engineering. The computational efficiency and accuracy of NDFT has always been a critical issue in hindering its comprehensive applications both in intensive and in extensive aspects of scientific computing. In our previous work Yang et al. (2018), a CUNFFT method was proposed and it shown outstanding performance in handling NDFT at intermediate scale based on CUDA (Compute Unified Device Architecture) technology. In the current work, we further improved the computational efficiency of the CUNTTF method using an efficient MPI-CUDA hybrid parallelization (HP) scheme of NFFT to achieve a cutting-edge treatment of NDFT at super extended scale. Within this HP-NFFT method, the spatial domain of NDFT is decomposed into several parts according to the accumulative feature of NDFT and the detailed number of CPU and GPU nodes. These decomposed NDFT subcells are independently calculated on different CPU nodes using a MPI process-level parallelization mode, and on different GPU nodes using a CUDA thread-level parallelization mode and CUNFFT algorithm. A massive benchmarking of the HP-NFFT method indicates that this method exhibits a dramatic improvement in computational efficiency for handling NDFT at super extended scale without loss of computational precision. Furthermore, the HP-NFFT method is validated via the calculation of Madelung constant of fluorite crystal structure, and thereafter verified that this method is robust for the calculation of electrostatic interactions between charged ions in molecular dynamics simulation systems. Program summaryProgram title: HP-NFFTCPC Library link to program files:http://dx.doi.org/10.17632/ys2y92jkwy.1Licensing provisions: GNU General Public License 3Programming language: MPI, C, and CUDA CSupplementary material: The program is designed for effective computation of large-scale nonequispaced discrete Fourier transformation (NDFT), which runs on particular computers equipped with NVIDIA GPUs. It has been tested on (a) one single computer node with Intel(R) Core(TM) i7-3770 @ 3.40 GHz (CPU) and GTX 980 Ti (GPU), and (b) MPI parallel computer nodes with the same configurations.Nature of problem: For NDFT, the computation is extremely time-consuming in many domains of computational physics due to the failure to utilize FFT directly, which often affects the computational efficiency of whole system seriously. Although the parallel method CUNFFT based on GPU has achieved a qualitative leap compared with previous methods in NDFT computation, the computation capability is limited to the throughput capacity of GPU when the size of NDFT system is big enough.Solution method: We constructed a hybrid parallel architecture, in which CPU and GPU are combined to accelerate the NDFT computation effectively. Firstly, the NDFT system is divided into several subcells via domain-decomposition method. Then MPI (Message Passing Interface) is used to implement the CPU-parallel computation with each computer node corresponding to a particular subcell, and furthermore each subcell in one computer node will be executed in GPU in parallel efficiently. In this hybrid parallel method, the most critical technical problem is how to parallelize a CUNFFT in the parallel strategy, which is conquered effectively by deep-seated research of basic principles and some algorithm skills.Restrictions: The HP-NFFT is mainly oriented to, and has shown significant computational efficiency to deal with large-scale NDFT computations. However, for a small NDFT system containing less than 10^6 particles, the mode of multiple computer nodes has no apparent efficiency advantage or even lower efficiency compared with the mode of a single computer node because of the serious network delay.

Full Text