Abstract

Many problems, in diverse areas of science and engineering, involve the solution of largescale sparse systems of linear equations. In most of these scenarios, they are also a computational bottleneck, and therefore their efficient solution on parallel architectureshas motivated a tremendous volume of research.This dissertation targets the use of GPUs to enhance the performance of the solution of sparse linear systems using iterative methods complemented with state-of-the-art preconditioned techniques. In particular, we study ILUPACK, a package for the solution of sparse linear systems via Krylov subspace methods that relies on a modern inverse-based multilevel ILU (incomplete LU) preconditioning technique.We present new data-parallel versions of the preconditioner and the most important solvers contained in the package that significantly improve its performance without affecting its accuracy. Additionally we enhance existing task-parallel versions of ILUPACK for shared- and distributed-memory systems with the inclusion of GPU acceleration. The results obtained show a sensible reduction in the runtime of the methods, as well as the possibility of addressing large-scale problems efficiently.

Highlights

  • Sparse systems of linear equations appear in many areas of knowledge, such as circuit simulation, optimal control, quantum mechanics or economics [1, 2]

  • A number of current real-world applications involve linear systems with millions of equations and unknowns. Direct solvers such as those based on Gaussian Elimination (GE) [3], which apply a sequence of matrix transformations to reach an equivalent but easier-to-solve system, today fall short when solving large-scale problems because of their excessive memory requirements, impractical time to solution and complexity of implementation

  • The favorable numerical properties of ILUPACK’s preconditioner in the context of an iterative solvers come at the cost of expensive construction and application procedures, especially for large-scale sparse linear systems. This high computational cost motivated the development of task-parallel implementations of ILUPACK for shared-memory and message-passing platforms [5, 6, 7] but, despite showing good performance and scalability results, these variants of ILUPACK are limited to the solution of symmetric and positive-definite (SPD) linear systems, and they slightly modify the preconditioner to exploit taskparallelism

Read more

Summary

Introduction

Sparse systems of linear equations appear in many areas of knowledge, such as circuit simulation, optimal control, quantum mechanics or economics [1, 2]. A number of current real-world applications (as for example three-dimensional PDEs) involve linear systems with millions of equations and unknowns Direct solvers such as those based on Gaussian Elimination (GE) [3], which apply a sequence of matrix transformations to reach an equivalent but easier-to-solve system, today fall short when solving large-scale problems because of their excessive memory requirements, impractical time to solution and complexity of implementation. The favorable numerical properties of ILUPACK’s preconditioner in the context of an iterative solvers come at the cost of expensive construction and application procedures, especially for large-scale sparse linear systems This high computational cost motivated the development of task-parallel implementations of ILUPACK for shared-memory and message-passing platforms [5, 6, 7] but, despite showing good performance and scalability results, these variants of ILUPACK are limited to the solution of symmetric and positive-definite (SPD) linear systems, and they slightly modify the preconditioner to exploit taskparallelism. The experimental evaluation is performed on the Jetson AGX Xavier board, one of the latest low-power computing platform from NVIDIA (Section 8)

Objectives
Sparse systems of linear equations
ILUPACK
Task-parallel ILUPACK
Related work
Data-parallel variants
Leveraging task and data parallelism in BiCG
Variant for two GPUs
Leveraging task-parallelism in the GPU using streams
Concurrent CPU-GPU variant
Self-scheduled sparse triangular solvers
Variant of BiCGStab for low-power devices
Concluding Remarks
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call