Abstract

In this paper, we propose an implementation of a parallel one-dimensional fast Fourier transform (FFT) on GPU clusters. This implementation is based on the six-step FFT algorithm. Because the parallel one-dimensional FFT requires three all-to-all communications, one goal for parallel FFTs on GPU clusters is to minimize the PCI Express transfer time and the MPI communication time. We demonstrate that the advanced features of MVAPICH2-GPU make it easy to overlap PCI Express transfers and MPI communication. Performance results of one-dimensional FFTs on a GPU cluster are reported. We successfully achieved a performance of over 763 GFlops on 128 nodes of the HA-PACS (268 nodes, 2.99 TFlops/node, 802 TFlops peak performance) for 234-point FFT.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.