An Efficient Shuffle-Light FFT Library

Salvatore Servodio,Xiaoming Li

doi:10.1109/ipccc51483.2021.9679431

Abstract

The Fast Fourier Transform (FFT) is one of the most widely used computation kernels. Its performance is highly sensitive to the efficiency of its memory operations. In particular, when the FFT is applied on large problem sizes, the overall performance becomes almost totally memory bound. In those cases, the impact of the algorithm’s computational operations is demoted to minimal. This is probably why existing efforts on optimizing large size FFTs have been focused on optimizing the memory operation efficiency. The widely adopted approach in that regard is to adapt parameters external to the FFT algorithm, such as data layout, system software memory management, etc., to the needs of FFT’s memory operation pattern.In this paper, we look into the FFT internals for a way to mitigate FFT’s memory "boundness". The low efficiency of FFT’s memory operation, to a large degree, is incurred by the many strided data shuffles intrinsic to the algorithm, and made worse by the fact that many of those stride sizes are power-of-two. The strided data shuffling is already one of the hard workload patterns for memory system, and the power-of-two stride sizes make such a memory access pattern even more challenging. This power-of-two strided memory access pattern is the key problem that our paper addresses and represents our main contribution. Our key idea is a complete separation of the logic indexes and the physical indexes of the data internally accessed in FFT. By maintaining an efficient mapping between the logic indexes and the physical indexes through all passes, almost any stride sizes can be used and be further tuned. As the result, FFTs can be correctly carried out without the need for most of those worst power-of-two data shuffles. This greatly improves memory performance and in turn raises the overall performance. We compare our Shuffle-Light FFT (SLFFT) library with FFTW, a very widely used FFT library, and achieve the average speedup on 24.7% on problems sizes from 2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">6</sup> to 2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">26</sup> , and the maximal speedup of 39.4%.

Full Text