Ultrascalable Fourier transfroms in three dimensions

Dmitry Pekurovsky

doi:10.1145/2016741.2016751

Abstract

Fourier and related types of transforms are widely used in scientific community. Three-dimensional Fast Fourier Transforms (3D FFT), for example, are used in many areas such as DNS turbulence, astrophysics, material science, chemistry, oceanography and X-ray crystallography. In many cases this is a very compute-intensive operation. Lately there has been a need for implementations of scalable 3D FFT and related algorithms on Petascale parallel machines [1-8]. Most existing implementations of 3D FFT use one-dimensional task decomposition, and therefore are subject to scaling limitation when the number of cores reaches domain size. P3DFFT library overcomes this limitation. It is an open-source, easy-to-use software package [9] providing general solution for 3D FFT based on two-dimensional decomposition. In this way it is different from majority of other libraries such as FFTW, PESSL, MKL and ACML. P3DFFT is written in Fortran90 and MPI, with C interface available. It uses FFTW as an underlying library for FFT computation in one dimension. P3DFFT has been demonstrated to scale quite well up to tens of thousands cores on several platforms, including Kraken at NICS/ORNL. Theoretically it is scalable up to N-squared cores, provided suitable hardware support, where N is the domain size. In practice all-to-all communication inherent in the algorithm is often the performance bottleneck at large core counts. This type of communication stresses bisection bandwidth of the interconnect and is a challenging operation for most High Performance Computing (HPC) systems. (In fact one of the three NSF Track 1 system application procurement requirements involves 3D FFT as a crucial software component.) As a consequence, communication time is typically a high fraction of overall time for the algorithm (80% is not uncommon). In spite of this, P3DFFT scales quite well since with the increase of core counts the volume of data to be exchanged decreases proportionately. A test benchmark P3DFFT program has shown about 50% efficiency in strong scaling from 4k to 64k cores on Cray XT5 (see Figure 1). This is consistent with the expectation of a power law scaling of an all-to-all exchange on a 3D torus (where bisection bandwidth scales as P2/3). Some performance tuning is recommended to get the maximum benefit, and it is carried out by simply varying the aspect ratio of the two-dimensional processor grid. More details will be included in presentation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Ultrascalable Fourier transfroms in three dimensions

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Re-Running Large-Scale Parallel Programs Using Two Nodes
Yayu Guo ... Yi Liu
-
Yayu Guo, et. al.Yayu Guo ... Yi Liu
01 Dec 2018
01 Dec 2018

Development and performance of a HemeLB GPU code for human-scale blood flow simulation
I Zacharoudiou ... P.V Coveney
Computer Physics Communications | VOL. 282
I Zacharoudiou, et. al.I Zacharoudiou ... P.V Coveney
22 Sep 2022
Computer Physics Communications | VOL. 282

Understanding Network Saturation Behavior on Large-Scale Blue Gene/P Systems
Pavan Balaji ... Narayan Desai
-
Pavan Balaji, et. al.Pavan Balaji ... Narayan Desai
01 Jan 2009
01 Jan 2009

Performance analysis of pure MPI versus MPI+OpenMP for Jacobi Iteration and a 3D FFT on the Cray XT5
Olga Weiss
-
Olga WeissOlga Weiss
31 Oct 2012
31 Oct 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Ultrascalable Fourier transfroms in three dimensions

Abstract

Talk to us

Similar Papers