Intel&amp;lt;sup&amp;gt;&amp;amp;reg;&amp;lt;/sup&amp;gt; Math Kernel Library PARDISO&amp;lt;sup&amp;gt;*&amp;lt;/sup&amp;gt; for Intel&amp;lt;sup&amp;gt;&amp;amp;reg;&amp;lt;/sup&amp;gt; Xeon Phi&amp;lt;sup&amp;gt;TM&amp;lt;/sup&amp;gt; Manycore Coprocessor

Alexander Kalinkin,Roman Anders,Anton Anders

doi:10.4236/am.2015.68121

Abstract

The paper describes an efficient direct method to solve an equation Ax = b, where A is a sparse matrix, on the Intel® Xeon PhiTM coprocessor. The main challenge for such a system is how to engage all available threads (about 240) and how to reduce OpenMP* synchronization overhead, which is very expensive for hundreds of threads. The method consists of decomposing A into a product of lower-triangular, diagonal, and upper triangular matrices followed by solves of the resulting three subsystems. The main idea is based on the hybrid parallel algorithm used in the Intel® Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]. Our implementation exploits a static scheduling algorithm during the factorization step to reduce OpenMP synchronization overhead. To effectively engage all available threads, a three-level approach of parallelization is used. Furthermore, we demonstrate that our implementation can perform up to 100 times better on factorization step and up to 65 times better in terms of overall performance on the 240 threads of the Intel® Xeon PhiTM coprocessor.

Highlights

This paper describes a direct method for solving the equation Ax = b with sparse matrix A on Intel® Xeon PhiTM coprocessors
It is very important to reduce OpenMP* [2] synchronization overhead because it has a significant impact on the overall performance on systems with a large number of threads
We present an OpenMP implementation of the LDU decomposition and solve of the triangular systems obtained based on the hybrid parallel algorithm used in Intel® Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]

Summary

Introduction

This paper describes a direct method for solving the equation Ax = b with sparse matrix A on Intel® Xeon PhiTM coprocessors. Actual factorization of the permuted matrix in the LDU form is performed as described in Amestoy [7]. We present an OpenMP implementation of the LDU decomposition and solve of the triangular systems obtained based on the hybrid parallel algorithm used in Intel® Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]. This approach demonstrates good performance and scalability on a large number of MPI processes [8].

Algorithm Description

Factorization Step

Solution Step

Experimental Results

Conclusions