Abstract

On FPGA, this paper presents the implementation of a simple processor architecture for accelerating data-parallel applications. Our proposed processor called SuperSMP, which can execute multi-scalar, vector, and matrix instructions on parallel execution datapaths. 4×32-bit instructions are fetched from instruction cache. The fetched instructions are decoded and their dependencies are checked. Up to four independent scalar instructions can be issued in-order to the parallel execution units. However, vector/matrix instructions iterate the issuing of four vector/matrix operations without checking. On four parallel execution units, SuperSMP can perform addition, subtraction, multiplication, division, and shifting on scalar/vector/matrix data. 4×32-bit contiguous vector/matrix elements can be loaded/stored per clock cycle from/to L2 cache to/from matrix register file. Finally, up to 4×32-bit results or loaded data can be written into scalar/matrix register files. The FPGA implementation of our proposed SuperSMP requires 14,032 slices on Xilinx Virtex-5, XC5VLX110-3FF1153. The number of LUT flip-flop pairs is 49,398, where 17,166, 10,267, and 21,965, are the numbers of unused flip-flop, unused LUT, and fully used LUT flip-flop pairs, respectively. The complexity of SuperSMP is about 3.5 times of the baseline scalar processor. However, the performance of SuperSMP ranges from 4.3 to 18.2 times higher than the baseline scalar processor.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call