Accelerated FDPS: Algorithms to use accelerators with FDPS

Masaki Iwasawa,Junichiro Makino,Miyuki Tsubouchi,Keigo Nitadori,Daisuke Namekata,Kentaro Nomura,Long Wang

doi:10.1093/pasj/psz133

Abstract

Abstract We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

Highlights

In this paper we describe new algorithms implemented in FDPS (Framework for Developing Particle Simulators: Iwasawa et al 2016; Namekata et al 2018), to make efficient use of accelerators such as GPGPUs
The main cause of this problem is that modern highperformance computing (HPC) platforms have become very complex, requiring a lot of effort to develop complex programs to make efficient use of such platforms
The GPGPU performs the calculations for multiple interaction lists in parallel, and this goal, we have designed FDPS so that it provides all necessary functions for efficient parallel programming of particle-based simulations

Summary

Introduction

In this paper we describe new algorithms implemented in FDPS (Framework for Developing Particle Simulators: Iwasawa et al 2016; Namekata et al 2018), to make efficient use of accelerators such as GPGPUs (general-purpose computing on graphics processing units). To develop efficient parallel programs for particle-based simulations requires a very large amount of work, comparable with the work of a large team of people for many years. Just to write and debug such a program is difficult, and it has become nearly impossible for any single person or even for a small group of people to develop large-scale simulation programs which run efficiently on modern HPC systems. This extremely large number of nodes is just one of the many difficulties of using modern HPC systems, since even within one node there are many levels of parallelism to be taken care of by the programmer.

Overview of FDPS

Traditional approach to using accelerators and its limitation

New algorithms

Indirect addressing of particles

Reuse of interaction Lists

Procedures with or without the new algorithms

APIs for using accelerators

Method

Performance model on a single node

Model of Tconstlt

Model of Troot

Model of Tconst gt

Model of Treorder gt

10 Tflops

Performance model on multiple nodes

Discussion and summary

Tree of domains

Findings

Procedure

Further improvement in single-node performance

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Publications of the Astronomical Society of Japan	Publication Date: Feb 1, 2020
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Accelerated FDPS: Algorithms to use accelerators with FDPS

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Publications of the Astronomical Society of Japan

Lead the way for us

Similar Papers

Adaptive signal processing for multichannel sound using high performance computing
Jorge Lorente Giner
-
Jorge Lorente GinerJorge Lorente Giner
02 Dec 2015
02 Dec 2015

Using general-purpose computing on graphics processing units (GPGPU) to accelerate the ordinary kriging algorithm
E Gutiérrez De Ravé ... J.M Gómez-López
Computers & Geosciences | VOL. 64
E Gutiérrez De Ravé, et. al.E Gutiérrez De Ravé ... J.M Gómez-López
06 Dec 2013
Computers & Geosciences | VOL. 64

Scalable Simulation Methodologies for Many-Core Heterogeneous Systems

-

01 Jan 2014
01 Jan 2014

Emerging technology about GPGPU
Enhua Wu ... Youquan Liu
-
Enhua Wu, et. al. Enhua Wu ... Youquan Liu
01 Nov 2008
01 Nov 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Accelerated FDPS: Algorithms to use accelerators with FDPS

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Publications of the Astronomical Society of Japan