Abstract
The Sunway TaihuLight, equipped with 10 million cores, is currently the world's third fastest supercomputer. SpMV is one of core algorithms in many high-performance computing applications. This paper implements a fine-grained design for generic parallel SpMV based on the special Sunway architecture and finds three main performance limitations, i.e., storage limitation, load imbalance, and huge overhead of irregular memory accesses. To address these problems, this paper introduces a customized and accelerative framework for SpMV (CASpMV) on the Sunway. The CASpMV customizes an auto-tuning four-way partition scheme for SpMV based on the proposed statistical model, which describes the sparse matrix structure characteristics, to make it better fit in with the computing architecture and memory hierarchy of the Sunway. Moreover, the CASpMV provides an accelerative method and customized optimizations to avoid irregular memory accesses and further improve its performance on the Sunway. Our CASpMV achieves a performance improvement that ranges from 588.05 to 2118.62 percent over the generic parallel SpMV on a CG (which corresponds to an MPI process) of the Sunway on average and has good scalability on multiple CGs. The performance comparisons of the CASpMV with state-of-the-art methods on the Sunway indicate that the sparsity and irregularity of data structures have less impact on CASpMV.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have