Qklu: a two-dimensional block-cyclic sparse direct solver

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Qklu: a two-dimensional block-cyclic sparse direct solver

Similar Papers
  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ipdps53621.2022.00012
Direct solution of larger coupled sparse/dense linear systems using low-rank compression on single-node multi-core machines in an industrial context
  • May 1, 2022
  • Emmanuel Agullo + 2 more

While hierarchically low-rank compression methods are now commonly available in both dense and sparse direct solvers, their usage for the direct solution of coupled sparse/dense linear systems has been little investigated. The solution of such systems is though central for the simulation of many important physics problems such as the simulation of the propagation of acoustic waves around aircrafts. Indeed, the heterogeneity of the jet flow created by reactors often requires a Finite Element Method (FEM) discretization, leading to a sparse linear system, while it may be reasonable to assume as homogeneous the rest of the space and hence model it with a Boundary Element Method (BEM) discretization, leading to a dense system. In an industrial context, these simulations are often operated on modern multicore workstations with fully-featured linear solvers. Exploiting their low-rank compression techniques is thus very appealing for solving larger coupled sparse/dense systems (hence ensuring a finer solution) on a given multicore workstation, and - of course - possibly do it fast. The standard method performing an efficient coupling of sparse and dense direct solvers is to rely on the Schur complement functionality of the sparse direct solver. However, to the best of our knowledge, modern fully-featured sparse direct solvers offering this functionality return the Schur complement as a non compressed matrix. In this paper, we study the opportunity to process larger systems in spite of this constraint. For that we propose two classes of algorithms, namely multi-solve and multi-factorization, consisting in composing existing parallel sparse and dense methods on well chosen submatrices. An experimental study conducted on a 24 cores machine equipped with 128 GiB of RAM shows that these algorithms, implemented on top of state-of-the-art sparse and dense direct solvers, together with proper low-rank assembly schemes, can respectively process systems of 9 million and 2.5 million total unknowns instead of 1.3 million unknowns with a standard coupling of compressed sparse and dense solvers.

  • Research Article
  • Cite Count Icon 4
  • 10.1088/1742-6596/1011/1/012041
The Use of Sparse Direct Solver in Vector Finite Element Modeling for Calculating Two Dimensional (2-D) Magnetotelluric Responses in Transverse Electric (TE) Mode
  • Apr 1, 2018
  • Journal of Physics: Conference Series
  • Lisa’ Yihaa Roodhiyah + 3 more

The late research, linear matrices of vector finite element in two dimensional(2-D) magnetotelluric (MT) responses modeling was solved by non-sparse direct solver in TE mode. Nevertheless, there is some weakness which have to be improved especially accuracy in the low frequency (10-3 Hz-10-5 Hz) which is not achieved yet and high cost computation in dense mesh. In this work, the solver which is used is sparse direct solver instead of non-sparse direct solverto overcome the weaknesses of solving linear matrices of vector finite element metod using non-sparse direct solver. Sparse direct solver will be advantageous in solving linear matrices of vector finite element method because of the matrix properties which is symmetrical and sparse. The validation of sparse direct solver in solving linear matrices of vector finite element has been done for a homogen half-space model and vertical contact model by analytical solution. Thevalidation result of sparse direct solver in solving linear matrices of vector finite element shows that sparse direct solver is more stable than non-sparse direct solver in computing linear problem of vector finite element method especially in low frequency. In the end, the accuracy of 2D MT responses modelling in low frequency (10-3 Hz-10-5 Hz) has been reached out under the efficient allocation memory of array and less computational time consuming.

  • Research Article
  • Cite Count Icon 9
  • 10.1145/1268769.1268772
Experiences of sparse direct symmetric solvers
  • Aug 1, 2007
  • ACM Transactions on Mathematical Software
  • Jennifer A Scott + 1 more

We recently carried out an extensive comparison of the performance of state-of-the-art sparse direct solvers for the numerical solution of symmetric linear systems of equations. Some of these solvers were written primarily as research codes while others have been developed for commercial use. Our experiences of using the different packages to solve a wide range of problems arising from real applications were mixed. In this paper, we highlight some of these experiences with the aim of providing advice to both software developers and users of sparse direct solvers. We discuss key features that a direct solver should offer and conclude that while performance is an essential factor to consider when choosing a code, there are other features that a user should also consider looking for that vary significantly between packages.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/sc41404.2022.00031
Addressing Irregular Patterns of Matrix Computations on GPUs and Their Impact on Applications Powered by Sparse Direct Solvers
  • Nov 1, 2022
  • Ahmad Abdelfattah + 5 more

Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs. This paper addresses irregular workloads of matrix computations on GPUs, and their application to accelerate sparse direct solvers. We design an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse direct LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on a multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high-frequency indefinite Maxwell problem.

  • Research Article
  • 10.1016/j.epsr.2023.109585
Sparse solver application for parallel real-time electromagnetic transient simulations
  • Jul 13, 2023
  • Electric Power Systems Research
  • B Bruned + 4 more

Sparse solver application for parallel real-time electromagnetic transient simulations

  • Research Article
  • Cite Count Icon 7
  • 10.1111/gwat.13016
On Solving Groundwater Flow and Transport Models with Algebraic Multigrid Preconditioning.
  • May 27, 2020
  • Groundwater
  • M Adil Sbai + 1 more

Iterative solvers preconditioned with algebraic multigrid have been devised as an optimal technology to speed up the response of large sparse linear systems. In this work, this technique was implemented in the framework of the dual delineation approach. This involves a single groundwater flow linear solution and a pure advective transport solution with different right-hand sides. The new solver was compared with other preconditioned iterative methods, the MODFLOW's GMG solver, and direct sparse solvers. Test problems include two- and three-dimensional benchmarks spanning homogeneous and highly heterogeneous and anisotropic formations. For the groundwater flow problems, using the algebraic multigrid preconditioning speeds up the numerical solution by one to two orders of magnitude. The algebraic multigrid preconditioner efficiency was preserved for the three dimensional heterogeneous and anisotropic problem unlike for the MODFLOW's GMG solver. Contrarily, a sparse direct solver was the most efficient for the pure advective transport processes such as the forward travel time simulations. Hence, the best sparse solver for the more general advection-dispersion transport equation is likely to be Péclet number dependent. When equipped with the best solvers, processing multimillion grid blocks by the dual delineation approach is a matter of seconds. This paves the way for its routine application to large geological models. The paper gives practical hints on the strategies and conditions under which algebraic multigrid preconditioning would remain competitive for the class of nonlinear and/or transient problems.

  • Conference Article
  • Cite Count Icon 2
  • 10.23919/date54114.2022.9774499
Exploiting Architecture Advances for Sparse Solvers in Circuit Simulation
  • Mar 14, 2022
  • Zhiyuan Yan + 3 more

Sparse direct solvers provide vital functionality for a wide variety of scientific applications. The dominated part of the sparse direct solver, LU factorization, suffers a lot from the irregularity of sparse matrices. Meanwhile, the specific characteristics of sparse solvers in circuit simulation and unique sparse pattern of circuit matrices provide more design spaces and also great challenges. In this paper, we propose a sparse solver named FLU and re-examine the performance of LU factorization from the perspectives of vectorization, parallelization, and data locality. To improve vectorization efficiency and data locality, FLU introduces a register-level supernode computation method by delicately manipulating data movement. With alternating multiple columns computation, FLU further reduces the off-chip memory accesses greatly. Furthermore, we implement a fine-grained elimination tree based parallelization scheme to fully exploit task-level parallelism. Compared with PARDISO and NICSLU, experimental results show that FLU achieves a speedup up to 19.51 × (3.86 × on average) and 2.56 × (1.66 × on average) on Intel Xeon respectively.

  • Research Article
  • Cite Count Icon 2
  • 10.1007/s11227-016-1892-7
On the improvement of a scalable sparse direct solver for unsymmetrical linear equations
  • Oct 21, 2016
  • The Journal of Supercomputing
  • M Serdar Celebi + 4 more

This paper focuses on the application level improvements in a sparse direct solver specifically used for large-scale unsymmetrical linear equations resulting from unstructured mesh discretization of coupled elliptic/hyperbolic PDEs. Existing sparse direct solvers are designed for distributed server systems taking advantage of both distributed memory and processing units. We conducted extensive numerical experiments with three state-of-the-art direct linear solvers that can work on distributed-memory parallel architectures; namely, MUMPS (MUMPS solver website, http://graal.ens-lyon.fr/MUMPS), WSMP (Technical Report TR RC-21886, IBM, Watson Research Center, Yorktown Heights, 2000), and SUPERLU_DIST (ACM Trans Math Softw 29(2):110---140, 2003). The performance of these solvers was analyzed in detail, using advanced analysis tools such as Tuning and Analysis Utilities (TAU) and Performance Application Programming Interface (PAPI). The performance is evaluated with respect to robustness, speed, scalability, and efficiency in CPU and memory usage. We have determined application level issues that we believe they can improve the performance of a distributed-shared memory hybrid variant of this solver, which is proposed as an alternative solver [SuperLU_MCDT (Many-Core Distributed)] in this paper. The new solver utilizing the MPI/OpenMP hybrid programming is specifically tuned to handle large unsymmetrical systems arising in reservoir simulations so that higher performance and better scalability can be achieved for a large distributed computing system with many nodes of multicore processors. Two main tasks are accomplished during this study: (i) comparisons of public domain solver algorithms; existing state-of-the-art direct sparse linear system solvers are investigated and their performance and weaknesses based on test cases are analyzed, (ii) improvement of direct sparse solver algorithm (SuperLU_MCDT) for many-core distributed systems is achieved. We provided results of numerical tests that were run on up to 16,384 cores, and used many sets of test matrices for reservoir simulations with unstructured meshes. The numerical results showed that SuperLU_MCDT can outperform SuperLU_DIST 3.3 in terms of both speed and robustness.

  • Book Chapter
  • Cite Count Icon 40
  • 10.1007/978-3-642-03869-3_74
PSPIKE: A Parallel Hybrid Sparse Linear System Solver
  • Jan 1, 2009
  • Murat Manguoglu + 2 more

The availability of large-scale computing platforms comprised of tens of thousands of multicore processors motivates the need for the next generation of highly scalable sparse linear system solvers. These solvers must optimize parallel performance, processor (serial) performance, as well as memory requirements, while being robust across broad classes of applications and systems. In this paper, we present a new parallel solver that combines the desirable characteristics of direct methods (robustness) and effective iterative solvers (low computational cost), while alleviating their drawbacks (memory requirements, lack of robustness). Our proposed hybrid solver is based on the general sparse solver PARDISO, and the “Spike” family of hybrid solvers. The resulting algorithm, called PSPIKE, is as robust as direct solvers, more reliable than classical preconditioned Krylov subspace methods, and much more scalable than direct sparse solvers. We support our performance and parallel scalability claims using detailed experimental studies and comparison with direct solvers, as well as classical preconditioned Krylov methods.

  • Conference Article
  • Cite Count Icon 1
  • 10.23919/ropaces.2018.8364197
Sparse direct matrix solvers of finite element discretizations in electromagnetics
  • Mar 1, 2018
  • Marinos N Vouvakis + 1 more

The paper outlines some direct solution strategies for sparse matrices arising from the finite element (FE) discretization of electromagnetic (EM) models, and explores whether the conventional thinking followed by highly optimized direct method tools for sparse matrices is indeed the best available option for directly solving FEM problems in EM. Factorization memory and time, solution time, parallelization potential, and error will be considered in the comparisons between approaches. Drawing from advances in the area of integral equation (IE) methods that rely on relatively small but dense matrices, we will conjecture that direct solvers that minimize factorization fill-in while maintaining sparsity may not lead to the best overall strategy after all. We present an direct solution algorithm for FEM, based on domain decomposition that generates and operates on a lesser sparse matrix with block-wise sparse pattern, thus bridging the gap between the two extreme matrix structures, (a large fully sparse encountered in FEM and a smaller fully dense encountered in IE methods). Early results on 3D arbitrary tetrahedron meshes and arbitrary volumetric geometries suggest that the proposed approach significantly outperform state-of-the-art sparse matrix direct solvers in terms of memory while maintaining competitive run times for serial implementations.

  • Research Article
  • Cite Count Icon 60
  • 10.1002/(sici)1097-0207(19991010)46:4<501::aid-nme685>3.0.co;2-7
Theoretical comparison of the FETI and algebraically partitioned FETI methods, and performance comparisons with a direct sparse solver
  • Aug 23, 1999
  • International Journal for Numerical Methods in Engineering
  • Daniel J Rixen + 3 more

In this paper, we prove that the Algebraic A-FETI method corresponds to one particular instance of the original one-level FETI method. We also report on performance comparisons on an Origin 2000 between the one- and two-level FETI methods and an optimized sparse solver, for two industrial applications: the stress analysis of a thin shell structure, and that of a three-dimensional structure modelled by solid elements. These comparisons suggest that for topologically two-dimensional problems, sparse solvers are effective when the number of processors is relatively small. They also suggest that for three-dimensional applications, scalable domain decomposition methods such as FETI deliver a superior performance on both sequential and parallel hardware configurations. Copyright © 1999 John Wiley & Sons, Ltd.

  • Conference Article
  • 10.3997/2214-4609.201600567
Exploring the Use of SPIKE-based Solvers on Large Electromagnetic Modeling
  • Jan 1, 2016
  • S Rodriguez Bernabeu + 3 more

Frequency-domain seismic and electromagnetic modeling requires solving the linear systems resulting from the discretization of the corresponding time-harmonic equations. Geophysical inversion is typically performed using several discrete frequencies and multiple (up to tens of thousands) source/receiver combinations. Limitations of classical direct and iterative sparse linear solvers have caused the development of the so-called hybrid methods that can be viewed as an intermediate approach between the direct and iterative methods. We present an efficient parallel solver based on the SPIKE algorithm. Several examples in frequency domain electromagnetic modeling illustrate the computational efficiency of the developed method in terms of memory demand and floating-point operations. Multiple sources can be efficiently handled by employing sparse direct solvers in the factorization of diagonal blocks of the system matrix. Based on the divide and conquer idea, this kind of algorithms exposes different parallelism levels, being suitable to take advantage of multiple accelerator devices. The SPIKE solver partially overcomes the fill-in problem of direct solvers, allowing to solve much larger domains on the same system.

  • Conference Article
  • Cite Count Icon 18
  • 10.1109/splc.1993.365566
Parallel block-diagonal-bordered sparse linear solvers for electrical power system applications
  • Oct 6, 1993
  • D.P Koester + 2 more

Research is on-going that examines parallel direct block-diagonal-bordered sparse linear solvers for irregular sparse matrix problems derived from electrical power system applications. Parallel block-diagonal-bordered sparse linear solvers exhibit distinct advantages when compared to current general parallel direct sparse matrix solvers. Our research shows that actual power system matrices can be readily ordered into block-diagonal-bordered form, although load imbalance becomes excessive beyond 16 processors, limiting scalability for a single parallel linear solver within an application. Nevertheless, other dimensions exist in electrical power system applications that can be exploited to efficiently make use of large-scale multi-processors. >

  • Research Article
  • Cite Count Icon 43
  • 10.1190/geo2013-0478.1
Computationally efficient three-dimensional acoustic finite-difference frequency-domain seismic modeling in vertical transversely isotropic media with sparse direct solver
  • Sep 1, 2014
  • GEOPHYSICS
  • Stephane Operto + 5 more

The computational burden of frequency-domain full-waveform inversion (FWI) of wide-aperture fixed-spread data is conventionally reduced by limiting the inversion to a few discrete frequencies. In this framework, frequency-domain seismic modeling is performed efficiently for multiple sources by solving the linear system resulting from the discretization of the time-harmonic wave equation with the massively parallel sparse direct solver. Frequency-domain seismic modeling based on the sparse direct solver (DSFDM) requires specific design finite-difference stencils of compact support to minimize the computational cost of the lower-upper decomposition of the impedance matrix in terms of memory demand and floating-point operations. A straightforward adaptation of such finite-difference stencil, originally developed for the (isotropic) acoustic-wave equation, is proposed to introduce vertical transverse isotropy (VTI) in the modeling without any extra computational cost. The method relies on a fourth-order wave equation, which is decomposed as the sum of a second-order elliptic-wave equation plus an anellipticity correction term. The stiffness matrix of the elliptic-wave equation is easily built from the isotropic stiffness matrix by multiplying its coefficients with factors that depend on Thomsen’s parameters, whereas the anelliptic term is discretized with a parsimonious second-order staggered-grid stencil. Validation of DSFDM against finite-difference time-domain modeling performed in various synthetic models shows that a discretization rule of four grid points per minimum wavelength provides accurate DSFDM solutions. Moreover, comparison between real data from the Valhall field and DSFDM solutions computed in a smooth VTI subsurface model supports that the method can be used as a fast and accurate modeling engine to perform multiparameter VTI FWI of fixed-spread data in the viscoacoustic approximation.

  • Research Article
  • 10.1177/10943420241288567
A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression
  • Sep 30, 2024
  • The International Journal of High Performance Computing Applications
  • Lisa Claus + 3 more

We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. In this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5× faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8× faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9× faster than NVIDIA’s cuDSS solver.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.