FFT data distribution in plane-waves DFT codes. A case study from Quantum ESPRESSO

Fabio Affinito,Carlo Cavazzoni

doi:10.1145/2966884.2966892

Abstract

Density Functional Theory calculations with plane waves and pseudopotentials represent one of the most important simulation techniques in high performance computing. Together with parallel linear algebra (ZGEMM and matrix diagonalization), the most important bottleneck results from the Fast Fourier Transform (FFT), required, for example, when the local potential is applied to the wavefunction. In these calculations, the existence of a cutoff on the plane waves is reflected on a spherical domain for the FFT. After a 1D FFT is performed on pencils distributed among processors, data is transposed with a MPI_Alltoall and a 2D FFT is executed [2]. Typically, the workload of the FFT is not particularly high, since grid sizes do not exceed (103 102)3. However, the load distribution is crucial and the consequent impact of collective communications becomes a critical factor for achieving a high parallel efficiency. Quantum ESPRESSO [3] is one of the most used codes based on plane-wave DFT in the community of material science. It has been successfully ported and optimized on a large number of HPC infrastructures all over the world. The parallel structure of Quantum ESPRESSO is mainly based on several layers of MPI communicators, plus a finer grain OpenMP parallelization. Recently, the parallelization structure of the FFT was deeply refactored. The combination of two different data distributions, i.e. bands and taskgroups, allow the underlyinghardware to be hierarchically filled and two different layers of communications to be tuned. In particular, with sufficient memory, by tuning the number of taskgroups one can fit all the data required to perform a single 3D FFT reducing the impact of the MPI_Alltoall between the 1D and 2D FFTs. In order to better check the results of the parametrization of the parallel distributions, a miniapp [1] containing only the FFT kernel was extracted from the Quantum ESPRESSO distribution. This miniapp is also important for the future activity of code design of novel architectures. We present and discuss the profiling data obtained from the QE-FFT miniapp and the impact on the communication pattern deriving from the choice of the parallelization parameters.

Full Text