HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

Yulu Jia,Azzam Haidar,Khairul Kabir,Jack Dongarra,Stanimire Tomov,Mark Gates,Piotr Luszczek

doi:10.1155/2015/502593

Abstract

This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA.

Highlights

Introduction and BackgroundSolving linear systems of equations and eigenvalue problems is fundamental to scientific computing
This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors
This is not the first time that DLA libraries have needed a redesign to be efficient on new architectures, notable examples being the transition from LINPACK [4] to LAPACK [1] in the 1980s to make algorithms cache-friendly

Summary

Introduction

Introduction and BackgroundSolving linear systems of equations and eigenvalue problems is fundamental to scientific computing. The popular LAPACK library [1], and in particular its vendor optimized implementations such as Intel’s MKL [2] or AMD’s ACML [3], has been the software of choice to provide solver routines for dense matrices on shared memory systems. This paper considers a redesign of the LAPACK algorithms and their implementation to add efficient support for heterogeneous systems of multicore processors with Intel Xeon Phi coprocessors. This is not the first time that DLA libraries have needed a redesign to be efficient on new architectures, notable examples being the transition from LINPACK [4] to LAPACK [1] in the 1980s to make algorithms cache-friendly. The PLASMA and MAGMA libraries [6] target efficiency on, respectively, multicore and heterogeneous architectures

Methods

Results

Conclusion