Abstract
In recent years, High Performance Computing (HPC) codes are widely used on various computing platforms such as server clusters, parallel GPUs, FPGAs and so on [1]. For various HPC applications, the Basic Linear Algebra Subroutines (BLAS) is one of the most basic and important function libraries. In this paper, we designed and developed a parametric and modular implementation for matrix-matrix multiplications on the FPGA. Then we analyzed and compared BLAS by means of the embedded computing platform integrated with CPU, GPU and FPGA. For the CPU, we adopted the latest standard library on the experimental devices. For the GPU, We compiled a OpenGL program to realize the calculation of matrix-matrix multiplication. Finally, we treat performance as the indicator to evaluate CPU, GPU and FPGA. The experimental result shows that the BLAS kernel on FPGA provides the best computing performance and it is 18.7-22.0 times better than the implementation on CPU and 3.8-6.0 times faster than the implementation on GPU.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.