Abstract

The fast Fourier transform (FFT) is perhaps today’s most ubiquitous algorithm used with digital data; hence, it is still being studied extensively. Besides the benefit of reducing the arithmetic count in the FFT algorithm, memory references and scheme’s projection on processor’s architecture are critical for a fast and efficient implementation. One of the main bottlenecks is in the long latency memory accesses to butterflies’ legs and in the redundant references to twiddle factors. In this paper, we describe a new FFT implementation on high-end very long instruction word (VLIW) digital signal processors (DSP), which presents improved performance in terms of clock cycles due to the resulting low-level resource balance and to the reduced memory accesses of twiddle factors. The method introduces a tradeoff parameter between accuracy and speed. Additionally, we suggest a cache-efficient implementation methodology for the FFT, dependently on the provided VLIW hardware resources and cache structure. Experimental results on a TI VLIW DSP show that our method reduces the number of clock cycles by an average of 51 % (2 times acceleration) when compared to the most assembly-optimized and vendor-tuned FFT libraries. The FFT was generated using an instruction-level scheduling heuristic. It is a modulo-based register-sensitive scheduling algorithm, which is able to compute an aggressively efficient sequence of VLIW instructions for the FFT, maximizing the parallelism rate and minimizing clock cycles and register usage.

Highlights

  • The discrete Fourier transform (DFT) is a used transform for spectral analysis of finite-domain discrete-time signals

  • 2 Background on the fast Fourier transform (FFT) algorithm of interest Many factors other than the pure number of arithmetic operations must be considered for an efficient FFT implementation on a very long instruction word (VLIW) digital signal processors (DSP), which can be derived from memory-induced stalls, regularity, and algorithm’s projection on hardware VLIW architectures

  • In order to reduce the code expansion issue that is naturally required by modulo scheduling, hardware facilities for software pipelining are implemented in VLIW

Read more

Summary

Introduction

The discrete Fourier transform (DFT) is a used transform for spectral analysis of finite-domain discrete-time signals. The most recent high-end DSP architectures are VLIW, which mainly support an instruction-level parallelism (ILP) feature, offering the possibility to execute simultaneously multiple instructions and a data-level parallelism allowing the access to multiple data during each cycle These kinds of processors are known to have greater performance compared to RISC or CISC, even having simpler and more explicit internal design. 2 Background on the FFT algorithm of interest Many factors other than the pure number of arithmetic operations must be considered for an efficient FFT implementation on a VLIW DSP, which can be derived from memory-induced stalls, regularity, and algorithm’s projection on hardware VLIW architectures. We describe the targeted VLIW family and the related state-of-art modulo scheduling

VLIW DSP processors
Our implementation methodology for the FFT on VLIW DSPs
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call