The N-body simulations consist of computing mutual gravitational forces exerted on each body in O ( N ) . The Barnes–Hut approximation allows processing a group of bodies in O ( 1 ) if they are far enough from a given body, which drops the complexity of the whole simulation to O ( N L o g N ) . The octree is used to ease the pruning process but at the cost of some irregularity in the access pattern. In a parallel N-body implementation the bodies are partitioned among threads that are executed on multiple cores. The depth-first traversal of the octree is used for processing each body, which causes repeated cache misses during traversal. This paper proposes different types of tiling methods to improve the performance of N-body simulations. It presents an experimental analysis of octree traversal by using these tiling methods to identify the potential of cache data reuse. It then evaluates these tiling methods for varying tile sizes with different galaxy sizes and a varying number of threads on several machine architectures. The efficiency of tiling approaches depends on the chosen tile size. It is shown that a speedup of 8 times can be achieved by choosing the appropriate tile size on a 60-core Intel accelerator. In order to determine appropriate tile size, the paper proposes an adaptive tiling approach to implicitly adapt the tile size to the distribution of threads, the cache capacity, cache latency, problem size and dynamic changes in the access pattern over the iterations. The proposed adaptive tiling approach can be used as an optimization option in parallel compilers. • Proposes tiling methods to improve cache data reuse for Barnes-Hut simulation. • Provides experimental analysis of octree traversal with proposed tiling approaches. • Evaluates different galaxy sizes and number of threads on several architectures. • A hybrid tiling approach with a suitable tile size provides better performance. • Proposes an adaptive tiling approach to identify an optimal tile size.
Read full abstract