Abstract

The new-generation Sunway supercomputer has ultra-high computing capacity. But due to the unique heterogeneous architecture of the supercomputer, the open-source versions of basic linear algebra subprograms (BLAS) are insufficient for performance or compatibility. In addition, due to the update of the architecture, BLAS based on the previous Sunway could not fully exploit the performance of the successor. To address the challenges, we propose an optimized BLAS on the new-generation Sunway supercomputer in this paper. Specially, for achieving efficient computation, a parallel optimization method based on the new-generation Sunway for the Level-1 BLAS computing between vectors and the Level-2 BLAS computing between vectors and matrices is first proposed. Then, an adaptive scheduling algorithm for various data sizes is proposed, which is used to balance the tasks of core groups. Finally, to achieve highly efficient general matrix multiplication (GEMM) kernels, a parallel optimization method based on the new-generation Sunway for the Level-3 BLAS computing between matrices is proposed, which includes source-level optimization as well as assembly-level optimization. Experimental results show that the memory bandwidth utilization of the optimized Level-1/2 BLAS exceeds 95%, and the computational efficiency of the optimized GEMM kernel exceeds 94%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call