Qklu: a two-dimensional block-cyclic sparse direct solver
Qklu: a two-dimensional block-cyclic sparse direct solver
- Conference Article
1
- 10.1109/ipdps53621.2022.00012
- May 1, 2022
While hierarchically low-rank compression methods are now commonly available in both dense and sparse direct solvers, their usage for the direct solution of coupled sparse/dense linear systems has been little investigated. The solution of such systems is though central for the simulation of many important physics problems such as the simulation of the propagation of acoustic waves around aircrafts. Indeed, the heterogeneity of the jet flow created by reactors often requires a Finite Element Method (FEM) discretization, leading to a sparse linear system, while it may be reasonable to assume as homogeneous the rest of the space and hence model it with a Boundary Element Method (BEM) discretization, leading to a dense system. In an industrial context, these simulations are often operated on modern multicore workstations with fully-featured linear solvers. Exploiting their low-rank compression techniques is thus very appealing for solving larger coupled sparse/dense systems (hence ensuring a finer solution) on a given multicore workstation, and - of course - possibly do it fast. The standard method performing an efficient coupling of sparse and dense direct solvers is to rely on the Schur complement functionality of the sparse direct solver. However, to the best of our knowledge, modern fully-featured sparse direct solvers offering this functionality return the Schur complement as a non compressed matrix. In this paper, we study the opportunity to process larger systems in spite of this constraint. For that we propose two classes of algorithms, namely multi-solve and multi-factorization, consisting in composing existing parallel sparse and dense methods on well chosen submatrices. An experimental study conducted on a 24 cores machine equipped with 128 GiB of RAM shows that these algorithms, implemented on top of state-of-the-art sparse and dense direct solvers, together with proper low-rank assembly schemes, can respectively process systems of 9 million and 2.5 million total unknowns instead of 1.3 million unknowns with a standard coupling of compressed sparse and dense solvers.
- Research Article
4
- 10.1088/1742-6596/1011/1/012041
- Apr 1, 2018
- Journal of Physics: Conference Series
The late research, linear matrices of vector finite element in two dimensional(2-D) magnetotelluric (MT) responses modeling was solved by non-sparse direct solver in TE mode. Nevertheless, there is some weakness which have to be improved especially accuracy in the low frequency (10-3 Hz-10-5 Hz) which is not achieved yet and high cost computation in dense mesh. In this work, the solver which is used is sparse direct solver instead of non-sparse direct solverto overcome the weaknesses of solving linear matrices of vector finite element metod using non-sparse direct solver. Sparse direct solver will be advantageous in solving linear matrices of vector finite element method because of the matrix properties which is symmetrical and sparse. The validation of sparse direct solver in solving linear matrices of vector finite element has been done for a homogen half-space model and vertical contact model by analytical solution. Thevalidation result of sparse direct solver in solving linear matrices of vector finite element shows that sparse direct solver is more stable than non-sparse direct solver in computing linear problem of vector finite element method especially in low frequency. In the end, the accuracy of 2D MT responses modelling in low frequency (10-3 Hz-10-5 Hz) has been reached out under the efficient allocation memory of array and less computational time consuming.
- Research Article
9
- 10.1145/1268769.1268772
- Aug 1, 2007
- ACM Transactions on Mathematical Software
We recently carried out an extensive comparison of the performance of state-of-the-art sparse direct solvers for the numerical solution of symmetric linear systems of equations. Some of these solvers were written primarily as research codes while others have been developed for commercial use. Our experiences of using the different packages to solve a wide range of problems arising from real applications were mixed. In this paper, we highlight some of these experiences with the aim of providing advice to both software developers and users of sparse direct solvers. We discuss key features that a direct solver should offer and conclude that while performance is an essential factor to consider when choosing a code, there are other features that a user should also consider looking for that vary significantly between packages.
- Conference Article
4
- 10.1109/sc41404.2022.00031
- Nov 1, 2022
Many scientific applications rely on sparse direct solvers for their numerical robustness. However, performance optimization for these solvers remains a challenging task, especially on GPUs. This is due to workloads of small dense matrices that are different in size. Matrix decompositions on such irregular workloads are rarely addressed on GPUs. This paper addresses irregular workloads of matrix computations on GPUs, and their application to accelerate sparse direct solvers. We design an interface for the basic matrix operations supporting problems of different sizes. The interface enables us to develop irrLU-GPU, an LU decomposition on matrices of different sizes. We demonstrate the impact of irrLU-GPU on sparse direct LU solvers using NVIDIA and AMD GPUs. Experimental results are shown for a sparse direct solver based on a multifrontal sparse LU decomposition applied to linear systems arising from the simulation, using finite element discretization on unstructured meshes, of a high-frequency indefinite Maxwell problem.
- Research Article
- 10.1016/j.epsr.2023.109585
- Jul 13, 2023
- Electric Power Systems Research
Sparse solver application for parallel real-time electromagnetic transient simulations
- Research Article
7
- 10.1111/gwat.13016
- May 27, 2020
- Groundwater
Iterative solvers preconditioned with algebraic multigrid have been devised as an optimal technology to speed up the response of large sparse linear systems. In this work, this technique was implemented in the framework of the dual delineation approach. This involves a single groundwater flow linear solution and a pure advective transport solution with different right-hand sides. The new solver was compared with other preconditioned iterative methods, the MODFLOW's GMG solver, and direct sparse solvers. Test problems include two- and three-dimensional benchmarks spanning homogeneous and highly heterogeneous and anisotropic formations. For the groundwater flow problems, using the algebraic multigrid preconditioning speeds up the numerical solution by one to two orders of magnitude. The algebraic multigrid preconditioner efficiency was preserved for the three dimensional heterogeneous and anisotropic problem unlike for the MODFLOW's GMG solver. Contrarily, a sparse direct solver was the most efficient for the pure advective transport processes such as the forward travel time simulations. Hence, the best sparse solver for the more general advection-dispersion transport equation is likely to be Péclet number dependent. When equipped with the best solvers, processing multimillion grid blocks by the dual delineation approach is a matter of seconds. This paves the way for its routine application to large geological models. The paper gives practical hints on the strategies and conditions under which algebraic multigrid preconditioning would remain competitive for the class of nonlinear and/or transient problems.
- Conference Article
2
- 10.23919/date54114.2022.9774499
- Mar 14, 2022
Sparse direct solvers provide vital functionality for a wide variety of scientific applications. The dominated part of the sparse direct solver, LU factorization, suffers a lot from the irregularity of sparse matrices. Meanwhile, the specific characteristics of sparse solvers in circuit simulation and unique sparse pattern of circuit matrices provide more design spaces and also great challenges. In this paper, we propose a sparse solver named FLU and re-examine the performance of LU factorization from the perspectives of vectorization, parallelization, and data locality. To improve vectorization efficiency and data locality, FLU introduces a register-level supernode computation method by delicately manipulating data movement. With alternating multiple columns computation, FLU further reduces the off-chip memory accesses greatly. Furthermore, we implement a fine-grained elimination tree based parallelization scheme to fully exploit task-level parallelism. Compared with PARDISO and NICSLU, experimental results show that FLU achieves a speedup up to 19.51 × (3.86 × on average) and 2.56 × (1.66 × on average) on Intel Xeon respectively.
- Research Article
2
- 10.1007/s11227-016-1892-7
- Oct 21, 2016
- The Journal of Supercomputing
This paper focuses on the application level improvements in a sparse direct solver specifically used for large-scale unsymmetrical linear equations resulting from unstructured mesh discretization of coupled elliptic/hyperbolic PDEs. Existing sparse direct solvers are designed for distributed server systems taking advantage of both distributed memory and processing units. We conducted extensive numerical experiments with three state-of-the-art direct linear solvers that can work on distributed-memory parallel architectures; namely, MUMPS (MUMPS solver website, http://graal.ens-lyon.fr/MUMPS), WSMP (Technical Report TR RC-21886, IBM, Watson Research Center, Yorktown Heights, 2000), and SUPERLU_DIST (ACM Trans Math Softw 29(2):110---140, 2003). The performance of these solvers was analyzed in detail, using advanced analysis tools such as Tuning and Analysis Utilities (TAU) and Performance Application Programming Interface (PAPI). The performance is evaluated with respect to robustness, speed, scalability, and efficiency in CPU and memory usage. We have determined application level issues that we believe they can improve the performance of a distributed-shared memory hybrid variant of this solver, which is proposed as an alternative solver [SuperLU_MCDT (Many-Core Distributed)] in this paper. The new solver utilizing the MPI/OpenMP hybrid programming is specifically tuned to handle large unsymmetrical systems arising in reservoir simulations so that higher performance and better scalability can be achieved for a large distributed computing system with many nodes of multicore processors. Two main tasks are accomplished during this study: (i) comparisons of public domain solver algorithms; existing state-of-the-art direct sparse linear system solvers are investigated and their performance and weaknesses based on test cases are analyzed, (ii) improvement of direct sparse solver algorithm (SuperLU_MCDT) for many-core distributed systems is achieved. We provided results of numerical tests that were run on up to 16,384 cores, and used many sets of test matrices for reservoir simulations with unstructured meshes. The numerical results showed that SuperLU_MCDT can outperform SuperLU_DIST 3.3 in terms of both speed and robustness.
- Book Chapter
40
- 10.1007/978-3-642-03869-3_74
- Jan 1, 2009
The availability of large-scale computing platforms comprised of tens of thousands of multicore processors motivates the need for the next generation of highly scalable sparse linear system solvers. These solvers must optimize parallel performance, processor (serial) performance, as well as memory requirements, while being robust across broad classes of applications and systems. In this paper, we present a new parallel solver that combines the desirable characteristics of direct methods (robustness) and effective iterative solvers (low computational cost), while alleviating their drawbacks (memory requirements, lack of robustness). Our proposed hybrid solver is based on the general sparse solver PARDISO, and the “Spike” family of hybrid solvers. The resulting algorithm, called PSPIKE, is as robust as direct solvers, more reliable than classical preconditioned Krylov subspace methods, and much more scalable than direct sparse solvers. We support our performance and parallel scalability claims using detailed experimental studies and comparison with direct solvers, as well as classical preconditioned Krylov methods.
- Conference Article
1
- 10.23919/ropaces.2018.8364197
- Mar 1, 2018
The paper outlines some direct solution strategies for sparse matrices arising from the finite element (FE) discretization of electromagnetic (EM) models, and explores whether the conventional thinking followed by highly optimized direct method tools for sparse matrices is indeed the best available option for directly solving FEM problems in EM. Factorization memory and time, solution time, parallelization potential, and error will be considered in the comparisons between approaches. Drawing from advances in the area of integral equation (IE) methods that rely on relatively small but dense matrices, we will conjecture that direct solvers that minimize factorization fill-in while maintaining sparsity may not lead to the best overall strategy after all. We present an direct solution algorithm for FEM, based on domain decomposition that generates and operates on a lesser sparse matrix with block-wise sparse pattern, thus bridging the gap between the two extreme matrix structures, (a large fully sparse encountered in FEM and a smaller fully dense encountered in IE methods). Early results on 3D arbitrary tetrahedron meshes and arbitrary volumetric geometries suggest that the proposed approach significantly outperform state-of-the-art sparse matrix direct solvers in terms of memory while maintaining competitive run times for serial implementations.
- Research Article
60
- 10.1002/(sici)1097-0207(19991010)46:4<501::aid-nme685>3.0.co;2-7
- Aug 23, 1999
- International Journal for Numerical Methods in Engineering
In this paper, we prove that the Algebraic A-FETI method corresponds to one particular instance of the original one-level FETI method. We also report on performance comparisons on an Origin 2000 between the one- and two-level FETI methods and an optimized sparse solver, for two industrial applications: the stress analysis of a thin shell structure, and that of a three-dimensional structure modelled by solid elements. These comparisons suggest that for topologically two-dimensional problems, sparse solvers are effective when the number of processors is relatively small. They also suggest that for three-dimensional applications, scalable domain decomposition methods such as FETI deliver a superior performance on both sequential and parallel hardware configurations. Copyright © 1999 John Wiley & Sons, Ltd.
- Conference Article
- 10.3997/2214-4609.201600567
- Jan 1, 2016
Frequency-domain seismic and electromagnetic modeling requires solving the linear systems resulting from the discretization of the corresponding time-harmonic equations. Geophysical inversion is typically performed using several discrete frequencies and multiple (up to tens of thousands) source/receiver combinations. Limitations of classical direct and iterative sparse linear solvers have caused the development of the so-called hybrid methods that can be viewed as an intermediate approach between the direct and iterative methods. We present an efficient parallel solver based on the SPIKE algorithm. Several examples in frequency domain electromagnetic modeling illustrate the computational efficiency of the developed method in terms of memory demand and floating-point operations. Multiple sources can be efficiently handled by employing sparse direct solvers in the factorization of diagonal blocks of the system matrix. Based on the divide and conquer idea, this kind of algorithms exposes different parallelism levels, being suitable to take advantage of multiple accelerator devices. The SPIKE solver partially overcomes the fill-in problem of direct solvers, allowing to solve much larger domains on the same system.
- Conference Article
18
- 10.1109/splc.1993.365566
- Oct 6, 1993
Research is on-going that examines parallel direct block-diagonal-bordered sparse linear solvers for irregular sparse matrix problems derived from electrical power system applications. Parallel block-diagonal-bordered sparse linear solvers exhibit distinct advantages when compared to current general parallel direct sparse matrix solvers. Our research shows that actual power system matrices can be readily ordered into block-diagonal-bordered form, although load imbalance becomes excessive beyond 16 processors, limiting scalability for a single parallel linear solver within an application. Nevertheless, other dimensions exist in electrical power system applications that can be exploited to efficiently make use of large-scale multi-processors. >
- Research Article
43
- 10.1190/geo2013-0478.1
- Sep 1, 2014
- GEOPHYSICS
The computational burden of frequency-domain full-waveform inversion (FWI) of wide-aperture fixed-spread data is conventionally reduced by limiting the inversion to a few discrete frequencies. In this framework, frequency-domain seismic modeling is performed efficiently for multiple sources by solving the linear system resulting from the discretization of the time-harmonic wave equation with the massively parallel sparse direct solver. Frequency-domain seismic modeling based on the sparse direct solver (DSFDM) requires specific design finite-difference stencils of compact support to minimize the computational cost of the lower-upper decomposition of the impedance matrix in terms of memory demand and floating-point operations. A straightforward adaptation of such finite-difference stencil, originally developed for the (isotropic) acoustic-wave equation, is proposed to introduce vertical transverse isotropy (VTI) in the modeling without any extra computational cost. The method relies on a fourth-order wave equation, which is decomposed as the sum of a second-order elliptic-wave equation plus an anellipticity correction term. The stiffness matrix of the elliptic-wave equation is easily built from the isotropic stiffness matrix by multiplying its coefficients with factors that depend on Thomsen’s parameters, whereas the anelliptic term is discretized with a parsimonious second-order staggered-grid stencil. Validation of DSFDM against finite-difference time-domain modeling performed in various synthetic models shows that a discretization rule of four grid points per minimum wavelength provides accurate DSFDM solutions. Moreover, comparison between real data from the Valhall field and DSFDM solutions computed in a smooth VTI subsurface model supports that the method can be used as a fast and accurate modeling engine to perform multiparameter VTI FWI of fixed-spread data in the viscoacoustic approximation.
- Research Article
- 10.1177/10943420241288567
- Sep 30, 2024
- The International Journal of High Performance Computing Applications
We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. In this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5× faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8× faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9× faster than NVIDIA’s cuDSS solver.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.