Versatile Direct and Transpose Matrix Multiplication with Chained Operations: An Optimized Architecture Using Circulant Matrices

Taras Iakymchuk,Alfredo Rosado-Munoz,Manuel Bataller Mompean,Emmanuel Ovie Osimiry,Jose Vicente Frances Villora

doi:10.1109/tc.2016.2538235

Abstract

With growing demands in real-time control, classification or prediction, algorithms become more complex while low power and small size devices are required. Matrix multiplication (direct or transpose) is common for such computation algorithms. In numerous algorithms, it is also required to perform matrix multiplication repeatedly, where the result of a multiplication is further multiplied again. This work describes a versatile computation procedure and architecture: one of the matrices is stored in internal memory in its circulant form, then, a sequence of direct or transpose multiplications can be performed without timing penalty. The architecture proposes a RAM-ALU block for each matrix column, being connected in a systolic ring. The computation is propagated through internal RAM-ALU blocks. The architecture exploits local connections, minimizing delays. The system is described as an IP core, it is fully parameterisable to customize matrix size and data format when implementing. An $N\times N$ matrix multiplication is performed in $\mathcal{O}(N^2)$ clock cycles requiring $N$ RAM-ALU blocks being $2N$ in memory size. For a Virtex7 FPGA, clock runs at 340 MHz for 100 $\times$ 100 and 290 MHz for 1000 $\times$ 1,000 matrix size with 1 clock cycle per element multiplication.

Full Text