A GPU-accelerated fast multipole method based on barycentric Lagrange interpolation and dual tree traversal

Leighton Wilson,Nathan Vaughn,Robert Krasny

doi:10.1016/j.cpc.2021.108017

Leighton Wilson, Nathan Vaughn + Show 1 more

Open Access

https://doi.org/10.1016/j.cpc.2021.108017

Copy DOI

Journal: Computer Physics Communications	Publication Date: May 4, 2021
Citations: 10	License type: publisher-specific-oa

Affiliation: University of Michigan–Ann Arbor

Abstract

We present a GPU-accelerated fast multipole method (FMM) called BLDTT, which uses barycentric Lagrange interpolation for the near-field and far-field approximations, and dual tree traversal to construct the interaction lists. The scheme replaces well-separated particle-particle interactions by adaptively chosen particle-cluster, cluster-particle, and cluster-cluster approximations given by barycentric Lagrange interpolation on a Chebyshev grid of proxy particles in each cluster. The BLDTT employs FMM-type upward and downward passes, although here they are adapted to interlevel polynomial interpolation. The BLDTT is kernel-independent, and the approximations have a direct sum form that efficiently maps onto GPUs, where targets provide an outer level of parallelism and sources provide an inner level of parallelism. The code uses OpenACC directives for GPU acceleration and MPI remote memory access for distributed memory parallelization. Computations are presented for different particle distributions, domains, and interaction kernels, and for unequal targets and sources. The BLDTT consistently outperforms our earlier particle-cluster barycentric Lagrange treecode (BLTC). On a single GPU for problem size ranging from N=1E5 to 1E8, the BLTC scales like O(Nlog⁡N) and the BLDTT scales like O(N). We also present MPI strong scaling results for the BLDTT and BLTC with N=64E6 particles running on 1 to 32 GPUs.

Full Text