A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization

George Bosilca,Yu Pei,David Keyes,Hatem Ltaief,Rabab Alomairy,Qinglei Cao,Jack Dongarra

doi:10.1109/ipdps53621.2022.00047

Abstract

We present a general framework that couples the PaRSEC runtime system and the HiCMA numerical library to solve challenging 3D data-sparse problems. Though formally dense, many matrix operators possess a rank structured property that can be exploited during the most time-consuming computational phase, i.e., the matrix factorization. In particular, this work highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by compressing the dense operator. Using Tile Low-Rank (TLR) approximation, our approach consists in capturing the most significant information in each tile of the matrix using a threshold which satisfies the application's accuracy requirements. Matrix operations are performed on the compressed data layout, reducing memory footprint and algorithmic complexity. Our proposed software solution accommodates a range of traditional data structures of linear algebra, i.e., from dense and data-sparse to sparse, within a single matrix operation. Separation of concerns is at the heart: hardware-agnostic implementation, asynchronous execution with a dynamic runtime system, and high performance numerical kernels, to prepare scientific applications to embrace exascale opportunities. This ambition necessitates extensions to PaRSEC that incorporate information related to data structure and rank distribution into the runtime decision-making. We introduce two runtime optimizations to address the challenges encountered when confronted with a large rank disparity: (1) a trimming procedure performed at runtime to cut away data dependencies from the directed acyclic graph discovered to be no longer required after compression and (2) a rank-aware diamond-shaped data distribution to mitigate the load imbalance overheads, reduce data movement, and conserve memory foot-print. We assess our implementation using 3D unstructured mesh deformation based on Radial Basis Function (RBF) interpolation. We report performance results on two different high-performance supercomputers and compare against existing state-of-the-art implementation. Our implementation shows up to 7-fold on Shaheen II and 9-fold on Fugaku performance superiority in situations where the 3D unstructured mesh deformation application renders a matrix operator with low density. Our software framework solves a formally dense 3D problem with 52M mesh points on 65K cores in about half an hour. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system, by synergistically bridging matrix algebra libraries with scientific applications.

Full Text