Highly scalable implementation of an [formula omitted]-body code on a GPU cluster

Yohei Miki,Daisuke Takahashi,Masao Mori

doi:10.1016/j.cpc.2013.04.011

Abstract

We have developed a highly optimized code for collisionless N-body calculations based on direct summation. Our new optimization hides the global memory access latency, and the resulting CUDA code has a peak performance of 1006.7 GFlop/s in single precision (assuming 26 floating-point operations per interaction) with a single NVIDIA Tesla M2090 board. To improve the scalability of the OpenMP/MPI hybrid parallelized code, we have reduced the number of communications among multiple GPUs and have overlapped communications with computations to hide communication time. The code’s performance was measured on the HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences), a recently installed GPGPU cluster at University of Tsukuba. The results show excellent scalability with superlinear scaling when the number of N-body particles per GPU is less than 104 and parallel efficiency approaching unity when the number of N-body particles per GPU is greater than 104. The CUDA/OpenMP/MPI code has a peak performance of 255.5 TFlop/s when 256 NVIDIA Tesla M2090 boards are used, which is 75.0% of the theoretical peak performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Highly scalable implementation of an [formula omitted]-body code on a GPU cluster

Abstract

Talk to us

Similar Papers

More From: Computer Physics Communications

Lead the way for us

Journal: Computer Physics Communications	Publication Date: May 2, 2013
Citations: 5

Similar Papers

Performance and accuracy of a GRAPE-3 system for collisionless N-body simulations
E Athanassoula ... A Bosma
Monthly Notices of the Royal Astronomical Society | VOL. 293
E Athanassoula, et. al.E Athanassoula ... A Bosma
01 Feb 1998
Monthly Notices of the Royal Astronomical Society | VOL. 293

SU (2) lattice gauge theory simulations on Fermi GPUs
Nuno Cardoso ... Pedro Bicudo
Journal of Computational Physics | VOL. 230
Nuno Cardoso, et. al.Nuno Cardoso ... Pedro Bicudo
20 Feb 2011
Journal of Computational Physics | VOL. 230

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance
Hiroyuki Ootomo ... Rio Yokota
The International Journal of High Performance Computing Applications | VOL. 36
Hiroyuki Ootomo, et. al.Hiroyuki Ootomo ... Rio Yokota
03 Jun 2022
The International Journal of High Performance Computing Applications | VOL. 36

CP-PACS: A massively parallel processor at the University of Tsukuba
Kisaburo Nakazawa ... Yoshiyuki Yamashita
Parallel Computing | VOL. 25
Kisaburo Nakazawa, et. al.Kisaburo Nakazawa ... Yoshiyuki Yamashita
01 Dec 1999
Parallel Computing | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Highly scalable implementation of an [formula omitted]-body code on a GPU cluster

Abstract

Talk to us

Similar Papers

More From: Computer Physics Communications