Abstract
FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor. We show that it is possible to to achieve excellent parallel scaling while maintaining power and area efficiency comparable to that of the single-core solution. The result is an architecture that can effectively use up to 16 hybrid cores for transform sizes that can be contained in on-chip SRAM. When configured with 12MiB of on-chip SRAM, our technology evaluation shows that the proposed 16-core FFT accelerator should sustain 388 GFLOPS of nominal double-precision performance, with power and area efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm2, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have