Abstract
This paper presents analysis and optimizations for High Performance Conjugate Gradient benchmark (HPCG) on the Sunway many-core processor. For modern multi-core and many-core processors, HPCG always presents a poor performance and under-utilizes computation resource because of its low arithmetic intensity and fine-grain parallelism. We apply two conventional methods to parallel Gauss-Seidel smoother the most time consumer kernel in HPCG, including Level-Scheduling (LS) and Multi-Coloring (MC). These strategies are effective and achieve 1.54x and 5.52x performance improvement. For overcoming the poor locality for MC and limited parallelism for LS, we propose a novel Hierarchical Grid (HG) algorithm and our algorithmic and architecture-aware optimizations achieve an aggregated performance of 3.54 Gflops, which is around 0.475% of the peak performance and 15.4x higher than reference on the single core-group of SW26010 processor. With MPI parallelize, we balance the parallelism, pre-processing, convergence rate and communication overheads, we achieved 192 TFlops (70% parallelization efficiency) when scaling to 81920 CGs (5,324,800 cores) on Sunway Taihulight System. Moreover, we analyze the adaptability of our parallel method and optimization strategies and summarize several key points when refactoring and optimizing HPC applications on the Sunway heterogeneous many-core processor.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.