Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor

Zheng Wu,Mengxian Chi,Mingfan Li,Hong An,Le Xu

doi:10.1109/access.2020.3019302

Abstract

The study of matrix multiplication on the emerging SW26010 processor is highly significant for many scientific and engineering applications. The state-of-the-art work from the swBLAS library, called SWMM, focuses mainly on the infrequent case involving special matrix dimensions and determines the execution action of matrix multiplication by one specified algorithm. To further adapt to various matrix shapes, in this article, we present a runtime adaptive matrix multiplication methodology, called RTAMM, which targets the features of the SW26010 architecture. The execution action of RTAMM is determined dynamically at runtime via several fundamental cost formulas and multiple sets of blocking factors, rather than determining the action at library generation time. With comprehensive trade-offs between the computation and data access, overall architecture-oriented optimization methods are introduced at three levels (macro, assistant, and micro) to fully exploit the computing capability of SW26010. The experiments show that RTAMM can achieve competitive peak performance compared with SWMM. Moreover, in tests on 6000 different matrix multiplication cases, RTAMM outperforms SWMM in 85.55% of the cases, and the improvements range from 5% to 308%, whereas RTAMM is slightly inferior to SWMM in only 1.28% of the cases. These results demonstrate that RTAMM has both great adaptability and considerable performance improvement.

Highlights

As an application program interface standard, BLAS (Basic Linear Algebra Subprograms) [1] contains many primary vector and matrix operations, which can be applied to different types of linear algebraic calculations [2]
The Sunway TaihuLight [8], which was developed by the National Research Center of China for Parallel Computer Engineering Technology, is the first supercomputer in the world with a peak performance exceeding 100 PFlops, and is composed mainly of 40k SW26010 heterogeneous many-core processors
To solve the above problems, in this article, we present a runtime adaptive matrix multiplication methodology called RTAMM for the architectural features of SW26010

Summary

INTRODUCTION

As an application program interface standard, BLAS (Basic Linear Algebra Subprograms) [1] contains many primary vector and matrix operations, which can be applied to different types of linear algebraic calculations [2]. General matrix multiplication can straightforwardly rely on this special case at the expense of superfluous computation and data access overheads, the adaptability which receives more attention in the real world will be diminished. Another nonnegligible consideration is that different matrix shapes have complicated characteristics, that is, various scales, ratios, and data alignments. If only one fixed execution action is relied upon, highly efficient implementation will not be feasible for different matrix multiplication cases. The key novelty of this work is the coordination of several fundamental cost formulas and multiple sets of blocking factors, where each cost formula corresponds to one matrix multiplication algorithm.

BACKGROUND

MICRO OPTIMIZATION

Mrg when Mrg

ADAPTIVE ENGINE CONSTRUCTION

Findings

CONCLUSION