Abstract

The Complex Matrix Multiplication (CMM) algorithm is known to require a high computing performance and presenting exceptional challenges in real-life applications. Recent advances in Very Long Instruction Word (VLIW) based Digital Signal Processors (DSP) demonstrated high computing capabilities with a very low power consumption. In this work, we propose three ultra-fast, parallel and efficient VLIW implementation approaches of the CMM algorithm which could be used to meet tighter real-time constraints of several signal and image processing applications like radars. A novel parallel kernel, task mapping strategy and low-level optimization techniques are suggested, to fit a set of modern VLIW architectures. Additionally, an original memory access management technique was adopted to accelerate the algorithm by avoiding cache misses and bank conflicts. The experimental results showed the effectiveness of the proposed approaches where a peak performance of 15.89 GFLOPS was achieved on one C66x DSP core with a core utilization of 99% and a speedup of about 1.61, 3 and 10 compared to the state-of-the-art, the most optimized vendor and the conventional approaches, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call