A Cuda Fast Multipole Method with Highly Efficient M2L Far Field Evaluation

Bartosz Kohnke,Carsten Kutzner,Helmut Grubmuller

doi:10.1016/j.bpj.2020.11.1234

Abstract

Solving an N-body problem is a computationally quite demanding task in many scientific fields ranging from astrophysics to biomolecular simulations. However, the direct solution scales with O(N2); hence, even on modern hardware, the direct calculation becomes impractical even for moderate numbers of particles, and thus, efficient yet accurate approximations are key. For biomolecular simulations, a widely used such method is Particle Mesh Ewald (PME), which scales with O(N log N). Although extremely fast on a single node, PME runs into communication bottlenecks when parallelized for large simulation systems on many nodes. The fast multipole method (FMM) offers an attractive alternative: It requires less communication and reduces the complexity to optimal O(N). The method approximates long-range interactions by grouping the particles into clusters represented as multipoles. The cluster size grows with the interaction distance according to the underlying octree structure. Hence, further separated particles require fewer interaction computations, hence less communication. Here, we present our full NVIDIA CUDA FMM implementation, which has been optimized for the electrostatic interactions described by Coulomb's law relevant to molecular dynamics simulations. We compare different parallelization approaches to the computationally limiting part of the algorithm, the Multipole-to-Local (M2L) operator, and discuss their performance bottlenecks. The first approach can be implemented with only minimal modifications to the sequential CPU implementation. It features the Unified Memory concept, which allows for a simple utilization of the existing CPU data structures. The second approach enhances the achieved performance by exploiting CUDA Dynamic Parallelism. It introduces a significant speedup, especially for a high accuracy requirement. The third parallelization approach abstracts the underlying octree with precomputed interaction lists and it exploits operator symmetries to achieve nearly optimal performance in the whole tested accuracy range.

Full Text