NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Wei Zhang,Zihao Jiang,Yang Ou,Nong Xiao,Zhiguang Chen

doi:10.3390/electronics10161984

Wei Zhang, Zihao Jiang + Show 3 more

Open Access

https://doi.org/10.3390/electronics10161984

Copy DOI

Abstract

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Highlights

The results show that non-uniform memory access (NUMA)-aware Double-precision general matrix multiplication (DGEMM) has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%
Unlike Su’s and Wail’s methods, we reduce the impact of NUMA effect on performance by reducing the number of cross-die and cross-chip memory access events when DGEMM runs on a multi-NUMA architecture compatible with ARMv8
Kunpeng920 5250 processor) server described in Table 2, which is a high-performance server with multicore processors based on the ARMv8 architecture

Summary

Introduction

By considering the advantages of high performance and energy efficiency, ARM-based. SoCs have stimulated the development of ARM-based servers [1]. This ARM-based server is widely used in data centers but is applicable for supercomputers. We provide a brief insight into the details of Kunpeng920 dual-chip processor architecture, non-uniform memory access (NUMA), and the implementation of DGEMM in BLAS3. The Kunpeng920 core uses TaiShanV110, which is compatible with the ARMv8A architecture. The dual-chip system function is based on two 48-core processors

Methods

Results

Discussion

Conclusion