A Highly Efficient FFT Using Shared-Memory Multiplexing

Yi Yang,Huiyang Zhou

doi:10.1007/978-3-319-06548-9_17

Abstract

The Fast Fourier transform (FFT) is a bandwidth-limited algorithm. To alleviate the bandwidth requirement, shared memory is commonly used to accommodate intermediate data. However, given the limited capacity of shared memory in state-of-art GPUs, the intensive shared-memory usage of FFT reduces the number of thread blocks/workgroups that can run concurrently on a streaming multiprocessor/compute unit. In this work, we present our solution, called shared-memory multiplexing, to make more effective use of shared memory. Shared-memory multiplexing is built on a key observation that allocated shared memory is not utilized throughput the lifetime of a thread block/workgroup. We propose our pure software approaches to enable multiple thread blocks to time-multiplex shared memory so as to increase the number of concurrent thread blocks/workgroups on each streaming multiprocessor/compute unit. The improved thread-level parallelism introduces significant performance gains for FFT. On an NVIDIA GTX 480 GPU, our FFT kernel outperforms the NVIDIA library CUFFT V4.0 by 21 % for a 1 k-point FFT with a batch size of 2,048. On an NVIDIA Tesla K20c GPU, our FFT kernel outperforms the NVIDIA library CUFFT V5.0 by 58 % for the same inputs.

Full Text