Abstract

The paper describes an efficient direct method to solve an equation Ax = b, where A is a sparse matrix, on the Intel® Xeon PhiTM coprocessor. The main challenge for such a system is how to engage all available threads (about 240) and how to reduce OpenMP* synchronization overhead, which is very expensive for hundreds of threads. The method consists of decomposing A into a product of lower-triangular, diagonal, and upper triangular matrices followed by solves of the resulting three subsystems. The main idea is based on the hybrid parallel algorithm used in the Intel® Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]. Our implementation exploits a static scheduling algorithm during the factorization step to reduce OpenMP synchronization overhead. To effectively engage all available threads, a three-level approach of parallelization is used. Furthermore, we demonstrate that our implementation can perform up to 100 times better on factorization step and up to 65 times better in terms of overall performance on the 240 threads of the Intel® Xeon PhiTM coprocessor.

Highlights

  • This paper describes a direct method for solving the equation Ax = b with sparse matrix A on Intel® Xeon PhiTM coprocessors

  • It is very important to reduce OpenMP* [2] synchronization overhead because it has a significant impact on the overall performance on systems with a large number of threads

  • We present an OpenMP implementation of the LDU decomposition and solve of the triangular systems obtained based on the hybrid parallel algorithm used in Intel® Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]

Read more

Summary

Introduction

This paper describes a direct method for solving the equation Ax = b with sparse matrix A on Intel® Xeon PhiTM coprocessors. Actual factorization of the permuted matrix in the LDU form is performed as described in Amestoy [7]. We present an OpenMP implementation of the LDU decomposition and solve of the triangular systems obtained based on the hybrid parallel algorithm used in Intel® Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]. This approach demonstrates good performance and scalability on a large number of MPI processes [8].

Algorithm Description
Factorization Step
Solution Step
Experimental Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call