Abstract

The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on molecular dynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fully reap these benefits, it has been necessary to reformulate some of the most fundamental algorithms, including the Verlet list, pair searching, and cutoffs. Here, we present the heterogeneous parallelization and acceleration design of molecular dynamics implemented in the GROMACS codebase over the last decade. The setup involves a general cluster-based approach to pair lists and non-bonded pair interactions that utilizes both GPU and central processing unit (CPU) single instruction, multiple data acceleration efficiently, including the ability to load-balance tasks between CPUs and GPUs. The algorithm work efficiency is tuned for each type of hardware, and to use accelerators more efficiently, we introduce dual pair lists with rolling pruning updates. Combined with new direct GPU-GPU communication and GPU integration, this enables excellent performance from single GPU simulations through strong scaling across multiple GPUs and efficient multi-node parallelization.

Highlights

  • Molecular dynamics (MD) simulation has had tremendous success in a number of application areas in the past two decades, in part due to hardware improvements that have enabled studies of systems and timescales that were previously not feasible

  • The original Message Passing Interface (MPI)- ( PVM) based scaling was less impressive, but in version 4.0,8 this was replaced with a state-of-the-art neutral-territory domain-decomposition27 combined with fully flexible 3D dynamic load balancing (DLB) of triclinic domains. This is combined with a high-level task decomposition that dedicates a subset of MPI ranks to long-range Particle Mesh Ewald (PME) electrostatics to reduce the cost of collective communication required by the 3D FFTs, which means multiple-program, multipledata (MPMD) parallelization

  • On the central processing unit (CPU) front, SIMD parallelism is used for most major time-consuming parts of the code. This was necessitated by Amdahl’s law: as the performance of non-bonded kernels and PME improved, previously insignificant components such as integration turned into new bottlenecks. This was made fully portable by the introduction of the GROMACS SIMD abstraction layer, which started as the replacement of raw assembly with intrinsics and supports a range of CPU architectures using 14 different SIMD instruction sets,28 with additional ones in development

Read more

Summary

INTRODUCTION

Molecular dynamics (MD) simulation has had tremendous success in a number of application areas in the past two decades, in part due to hardware improvements that have enabled studies of systems and timescales that were previously not feasible. By employing state-ofthe-art algorithms and efficient parallel implementations, the code is able to target hardware and efficiently parallelize from the lowest level of SIMD (single instruction, multiple data) vector units to multiple cores and caches, accelerators, and distributed-memory HPC resources We believe that this approach makes great use of limited compute resources to improve research productivity, and it is increasingly enabling higher absolute performance on any given resource. While there has been some convergence of architectures, the difference between latencyand throughput-optimized functional units is fundamental, and utilizing each of them for the tasks at which they are best suited requires heterogeneous parallelization This typically employs the CPU for scheduling work, transferring data, and launching computation on the accelerator, as well as inter- and intra-node communication.

COMPUTATIONAL CHALLENGES IN MD SIMULATIONS
The structure of the MD algorithm
Multi-level parallelism
HETEROGENEOUS PARALLELIZATION
Offloading force computation
Offloading complete MD iterations
The cluster pair algorithm
Non-bonded pair interaction kernel throughput
The pair list generation algorithm
Dual pair list with dynamic pruning
Multi-level load balancing
Benchmark systems
Findings
DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call