Abstract

The Fast Fourier Transform (FFT) has been considered one of the most important computing algorithms for decades. Its vast application domain makes it an important performance benchmark for new computer architectures. The most common Cooley-Tukey FFT algorithm factorizes a large FFT into a combination of smaller ones. The choice of factors and the order in which they are applied are critical to the ultimate performance of the large FFT. Traditional hand coded FFT libraries can immediately execute a given sized FFT applying constant heuristics to different kernel sizes, but are not always optimal. FFTW is a popular auto tuning FFT library which searches over the possible factorizations and empirically determines one with the best performance. This search method produces FFT kernels for a given size that are competitive with hand tuned libraries. Unfortunately, the search process for a large size takes hours on real hardware, and is completely infeasible to use when evaluating the FFT performance of new hardware which is still in the simulation phase. It is also less than ideal in environments where it is desirable to have a rapid response to a new sized FFT. This paper introduces a novel performance model that allows the FFT performance of a given data size to be estimated to within 2% error without ever running the actual FFT. In addition, by recognizing more sophisticated patterns within the computation, this model reduces the search tree size from a permutation of the number of factors to a combination. Because typical FFT sizes contain a large number of similar factors, this effectively reduces the search by an order of magnitude. When given a set of computational kernels, this model can completely characterize the performance of a chosen target architecture by just running some short performance tests on each sized kernel, a process which takes a few minutes or less. Once characterized, an optimal FFT plan for a given input size can be determined in milliseconds instead of hours. In this paper, we first derive our mathematical model. We then validate its accuracy by using it to improve the performance of a state of the art, hand tuned FFT library by 30%. Finally, we demonstrate its effectiveness by replacing FFTWs own planning stage with our model, resulting in the same FFT performance using FFTWs own kernels in as little as one millionth the computation time.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call