Abstract

With growing demands in real-time control, classification or prediction, algorithms become more complex while low power and small size devices are required. Matrix multiplication (direct or transpose) is common for such computation algorithms. In numerous algorithms, it is also required to perform matrix multiplication repeatedly, where the result of a multiplication is further multiplied again. This work describes a versatile computation procedure and architecture: one of the matrices is stored in internal memory in its circulant form, then, a sequence of direct or transpose multiplications can be performed without timing penalty. The architecture proposes a RAM-ALU block for each matrix column, being connected in a systolic ring. The computation is propagated through internal RAM-ALU blocks. The architecture exploits local connections, minimizing delays. The system is described as an IP core, it is fully parameterisable to customize matrix size and data format when implementing. An $N\times N$ matrix multiplication is performed in $\mathcal{O}(N^2)$ clock cycles requiring $N$ RAM-ALU blocks being $2N$ in memory size. For a Virtex7 FPGA, clock runs at 340 MHz for 100 $\times$ 100 and 290 MHz for 1000 $\times$ 1,000 matrix size with 1 clock cycle per element multiplication.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call