Abstract

In recent years, High Performance Computing (HPC) codes are widely used on various computing platforms such as server clusters, parallel GPUs, FPGAs and so on [1]. For various HPC applications, the Basic Linear Algebra Subroutines (BLAS) is one of the most basic and important function libraries. In this paper, we designed and developed a parametric and modular implementation for matrix-matrix multiplications on the FPGA. Then we analyzed and compared BLAS by means of the embedded computing platform integrated with CPU, GPU and FPGA. For the CPU, we adopted the latest standard library on the experimental devices. For the GPU, We compiled a OpenGL program to realize the calculation of matrix-matrix multiplication. Finally, we treat performance as the indicator to evaluate CPU, GPU and FPGA. The experimental result shows that the BLAS kernel on FPGA provides the best computing performance and it is 18.7-22.0 times better than the implementation on CPU and 3.8-6.0 times faster than the implementation on GPU.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call