Abstract

The Fast Fourier transform (FFT) is a bandwidth-limited algorithm. To alleviate the bandwidth requirement, shared memory is commonly used to accommodate intermediate data. However, given the limited capacity of shared memory in state-of-art GPUs, the intensive shared-memory usage of FFT reduces the number of thread blocks/workgroups that can run concurrently on a streaming multiprocessor/compute unit. In this work, we present our solution, called shared-memory multiplexing, to make more effective use of shared memory. Shared-memory multiplexing is built on a key observation that allocated shared memory is not utilized throughput the lifetime of a thread block/workgroup. We propose our pure software approaches to enable multiple thread blocks to time-multiplex shared memory so as to increase the number of concurrent thread blocks/workgroups on each streaming multiprocessor/compute unit. The improved thread-level parallelism introduces significant performance gains for FFT. On an NVIDIA GTX 480 GPU, our FFT kernel outperforms the NVIDIA library CUFFT V4.0 by 21 % for a 1 k-point FFT with a batch size of 2,048. On an NVIDIA Tesla K20c GPU, our FFT kernel outperforms the NVIDIA library CUFFT V5.0 by 58 % for the same inputs.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.