Abstract

Fourier and related types of transforms are widely used in scientific community. Three-dimensional Fast Fourier Transforms (3D FFT), for example, are used in many areas such as DNS turbulence, astrophysics, material science, chemistry, oceanography and X-ray crystallography. In many cases this is a very compute-intensive operation. Lately there has been a need for implementations of scalable 3D FFT and related algorithms on Petascale parallel machines [1-8]. Most existing implementations of 3D FFT use one-dimensional task decomposition, and therefore are subject to scaling limitation when the number of cores reaches domain size. P3DFFT library overcomes this limitation. It is an open-source, easy-to-use software package [9] providing general solution for 3D FFT based on two-dimensional decomposition. In this way it is different from majority of other libraries such as FFTW, PESSL, MKL and ACML. P3DFFT is written in Fortran90 and MPI, with C interface available. It uses FFTW as an underlying library for FFT computation in one dimension. P3DFFT has been demonstrated to scale quite well up to tens of thousands cores on several platforms, including Kraken at NICS/ORNL. Theoretically it is scalable up to N-squared cores, provided suitable hardware support, where N is the domain size. In practice all-to-all communication inherent in the algorithm is often the performance bottleneck at large core counts. This type of communication stresses bisection bandwidth of the interconnect and is a challenging operation for most High Performance Computing (HPC) systems. (In fact one of the three NSF Track 1 system application procurement requirements involves 3D FFT as a crucial software component.) As a consequence, communication time is typically a high fraction of overall time for the algorithm (80% is not uncommon). In spite of this, P3DFFT scales quite well since with the increase of core counts the volume of data to be exchanged decreases proportionately. A test benchmark P3DFFT program has shown about 50% efficiency in strong scaling from 4k to 64k cores on Cray XT5 (see Figure 1). This is consistent with the expectation of a power law scaling of an all-to-all exchange on a 3D torus (where bisection bandwidth scales as P2/3). Some performance tuning is recommended to get the maximum benefit, and it is carried out by simply varying the aspect ratio of the two-dimensional processor grid. More details will be included in presentation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.