Instruction scheduling heuristic for an efficient FFT in VLIW processors with balanced resource usage

Mounir Bahtat,Philippe Le Gall,Said Belkouch,Philippe Elleaume

doi:10.1186/s13634-016-0336-0

Mounir Bahtat, Philippe Le Gall + Show 2 more

Open Access

https://doi.org/10.1186/s13634-016-0336-0

Copy DOI

Abstract

The fast Fourier transform (FFT) is perhaps today’s most ubiquitous algorithm used with digital data; hence, it is still being studied extensively. Besides the benefit of reducing the arithmetic count in the FFT algorithm, memory references and scheme’s projection on processor’s architecture are critical for a fast and efficient implementation. One of the main bottlenecks is in the long latency memory accesses to butterflies’ legs and in the redundant references to twiddle factors. In this paper, we describe a new FFT implementation on high-end very long instruction word (VLIW) digital signal processors (DSP), which presents improved performance in terms of clock cycles due to the resulting low-level resource balance and to the reduced memory accesses of twiddle factors. The method introduces a tradeoff parameter between accuracy and speed. Additionally, we suggest a cache-efficient implementation methodology for the FFT, dependently on the provided VLIW hardware resources and cache structure. Experimental results on a TI VLIW DSP show that our method reduces the number of clock cycles by an average of 51 % (2 times acceleration) when compared to the most assembly-optimized and vendor-tuned FFT libraries. The FFT was generated using an instruction-level scheduling heuristic. It is a modulo-based register-sensitive scheduling algorithm, which is able to compute an aggressively efficient sequence of VLIW instructions for the FFT, maximizing the parallelism rate and minimizing clock cycles and register usage.

Highlights

The discrete Fourier transform (DFT) is a used transform for spectral analysis of finite-domain discrete-time signals
2 Background on the fast Fourier transform (FFT) algorithm of interest Many factors other than the pure number of arithmetic operations must be considered for an efficient FFT implementation on a very long instruction word (VLIW) digital signal processors (DSP), which can be derived from memory-induced stalls, regularity, and algorithm’s projection on hardware VLIW architectures
In order to reduce the code expansion issue that is naturally required by modulo scheduling, hardware facilities for software pipelining are implemented in VLIW

Summary

Introduction

The discrete Fourier transform (DFT) is a used transform for spectral analysis of finite-domain discrete-time signals. The most recent high-end DSP architectures are VLIW, which mainly support an instruction-level parallelism (ILP) feature, offering the possibility to execute simultaneously multiple instructions and a data-level parallelism allowing the access to multiple data during each cycle These kinds of processors are known to have greater performance compared to RISC or CISC, even having simpler and more explicit internal design. 2 Background on the FFT algorithm of interest Many factors other than the pure number of arithmetic operations must be considered for an efficient FFT implementation on a VLIW DSP, which can be derived from memory-induced stalls, regularity, and algorithm’s projection on hardware VLIW architectures. We describe the targeted VLIW family and the related state-of-art modulo scheduling

VLIW DSP processors

Our implementation methodology for the FFT on VLIW DSPs

Findings

Conclusions