Abstract

We present a GPU-accelerated fast multipole method (FMM) called BLDTT, which uses barycentric Lagrange interpolation for the near-field and far-field approximations, and dual tree traversal to construct the interaction lists. The scheme replaces well-separated particle-particle interactions by adaptively chosen particle-cluster, cluster-particle, and cluster-cluster approximations given by barycentric Lagrange interpolation on a Chebyshev grid of proxy particles in each cluster. The BLDTT employs FMM-type upward and downward passes, although here they are adapted to interlevel polynomial interpolation. The BLDTT is kernel-independent, and the approximations have a direct sum form that efficiently maps onto GPUs, where targets provide an outer level of parallelism and sources provide an inner level of parallelism. The code uses OpenACC directives for GPU acceleration and MPI remote memory access for distributed memory parallelization. Computations are presented for different particle distributions, domains, and interaction kernels, and for unequal targets and sources. The BLDTT consistently outperforms our earlier particle-cluster barycentric Lagrange treecode (BLTC). On a single GPU for problem size ranging from N=1E5 to 1E8, the BLTC scales like O(Nlog⁡N) and the BLDTT scales like O(N). We also present MPI strong scaling results for the BLDTT and BLTC with N=64E6 particles running on 1 to 32 GPUs.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call