Tuning a general purpose software cache library for TaihuLight’s SW26010 processor

Xiaohui Duan,Wei Xue,Lin Gan,Guangwen Yang,Weiguo Liu,Haohuan Fu,Meng Zhang

doi:10.1007/s42514-020-00031-y

Abstract

The Sunway TaihuLight supercomputer has been installed for several years and many applications have been ported or built for TaihuLight. Initially most applications running on TaihuLight are with regular memory access patterns, such as dense linear algebra, structured grids and dynamic programming. At the year of 2018, developers have published a general purpose graph processing framework, a ported version of LAMMPS and a sparse triangular solver. These applications are with irregular memory access patterns which need a lot of special processings to make use of the computing processing elements (CPEs) of TaihuLight. While those strategies are efficient, doing such processing may be difficult for wider range of applications, especially for the constantly changing molecular dynamics applications or dynamic unstructured grids. In this paper, we present our work of designing a general purpose software cache library, SWCache, for simplifying the work of applying software cache in kernels, as well as a series of tools for tuning and modelling the performance of our software cache. After a series of optimizations including reordering branches for better branch prediction, hand-tuning register allocation, we evaluate our implementation in two mini-apps: miniFE and miniMD. Experiments show that our tuned software cache library can be applied in these applications, and can provide 20% speedup in miniMD compared to the strategies in a previous port of LAMMPS. Also, the workload of writing code can be reduced by using our library. Besides, the experience of efficient macro-based programming should be valuable for further application development on CPEs which are lack of C++ support.

Full Text