Published in last 50 years
Articles published on Basic Linear Algebra Subroutines
- Research Article
- 10.1145/3727344
- Jun 9, 2025
- ACM Transactions on Parallel Computing
- Hussam Al Daas + 5 more
In this article, we focus on the communication costs of three symmetric matrix computations: (i) multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK) (ii) adding the result of the multiplication of a matrix with the transpose of another matrix and the transpose of that result, known as a symmetric rank-2k update (SYR2K) (iii) performing matrix multiplication with a symmetric input matrix (SYMM). All three computations appear in the Level 3 Basic Linear Algebra Subroutines (BLAS) and have wide use in applications involving symmetric matrices. We establish communication lower bounds for these kernels using sequential and distributed-memory parallel computational models, and we show that our bounds are tight by presenting communication-optimal algorithms for each setting. Our lower bound proofs rely on applying a geometric inequality for symmetric computations and analytically solving constrained nonlinear optimization problems. The symmetric matrix and its corresponding computations are accessed and performed according to a triangular block partitioning scheme in the optimal algorithms.
- Research Article
- 10.3390/math13050787
- Feb 27, 2025
- Mathematics
- Shang Li + 4 more
In deep learning, convolutional layers typically bear the majority of the computational workload and are often the primary contributors to performance bottlenecks. The widely used convolution algorithm is based on the IM2COL transform to take advantage of the highly optimized GEMM (General Matrix Multiplication) kernel acceleration, using the highly optimized BLAS (Basic Linear Algebra Subroutine) library, which tends to incur additional memory overhead. Recent studies have indicated that direct convolution approaches can outperform traditional convolution implementations without additional memory overhead. In this paper, we propose a high-performance implementation of the direct convolution algorithm for inference that preserves the channel-first data layout of the convolutional layer inputs/outputs. We evaluate the performance of our proposed algorithm on a multi-core ARM CPU platform and compare it with state-of-the-art convolution optimization techniques. Experimental results demonstrate that our new algorithm performs better across the evaluated scenarios and platforms.
- Research Article
2
- 10.17587/prin.14.329-338
- Jul 27, 2023
- Programmnaya Ingeneria
- V A Egunov + 1 more
This paper considers the problem of increasing the software efficiency in terms of reducing the costs of their development and operation in the process of solving production and research tasks. We have analysed the existing approaches to solving this problem by example of parameterized algorithms for implementing mVm (matrix—vector multiplication) and MMM (matrix—matrix multiplication)) BLAS (Basic Linear Algebra Subroutines) operations. To achieve the goal of increasing the software efficiency, we proposed a new design method, designed to improve data caching algorithms in the software development for computing systems with a hierarchical memory structure. Using the proposed design procedure, we developed an analytical approach to evaluating the software effectiveness from the point of view of using a memory subsystem with a hierarchical structure is implemented. We applied the proposed method to the two-sided Householder transformation for the task of reducing the general form matrix to the Hessenberg form. Then we presented new algorithms for solving the problem, which are optimized variants of the Householder classical transformation: Row-Oriented Householder and Single-Pass Householder. The use of these algorithms can significantly reduce the software execution time. Computational experiments were carried out on a parallel computing system with shared memory, which is one of the nodes of the computing cluster of the Volgograd State Technical University. We made a comparison of the software execution time that reduce general-form matrices to Hessenberg form, written using the proposed algorithms and using the LAPACKE_dgehrd() function of the Intel MKL library. The conclusions made in the work are confirmed by the results of the conducted computational experiments.
- Research Article
2
- 10.1016/j.cpc.2023.108851
- Jul 11, 2023
- Computer Physics Communications
- Hongwei Chen + 3 more
A high-performance implementation of atomistic spin dynamics simulations on x86 CPUs
- Research Article
2
- 10.1002/spe.3214
- May 14, 2023
- Software: Practice and Experience
- Braedy Kuzma + 6 more
Abstract The resurgence of machine learning has increased the demand for high‐performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High‐performance BLAS implementations rely on a layered approach that consists of tiling and packing layers—for data (re)organization—and micro kernels that perform the actual computations. The algorithm for the tiling and packing layers is target independent but is parameterized to the memory hierarchy and register‐file size. The creation of high‐performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by 's (Matrix Multiply Assist—MMA), (Advanced Matrix eXtensions—AMX), and (Matrix Extensions—ME) to deliver high‐performance matrix operations. This article presents a compiler‐only alternative to the use of high‐performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix‐multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The parameterization of the tiling and packing layers is demonstrated in the generation of code for the MMA unit on IBM's POWER10. This article also describes an algorithm that lowers the matrix‐multiply intrinsic to the MMA unit. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to (Intel)—for small matrices—and more than (POWER9)—for large matrices—faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high‐performance libraries and is only slower than OpenBLAS and on‐par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over faster the vector‐extension solution, matches Eigen performance, and achieves up to ofBLASpeak performance.
- Research Article
11
- 10.1007/s42484-021-00048-8
- Jun 18, 2021
- Quantum Machine Intelligence
- Liming Zhao + 3 more
Efficiently processing basic linear algebra subroutines is of great importance for a wide range of computational problems. In this paper, we consider techniques to implement matrix functions on a quantum computer. We embed given matrices into 3 times larger Hermitian matrices and assume as input a given set of unitary operators generated by the embedding matrices. With the matrix embedding formula, we give Trotter-based quantum subroutines for elementary matrix operations include addition, multiplication, Kronecker sum, tensor product, Hadamard product, and arbitrary real eigenvalue single-matrix functions. We then discuss the composed matrix functions in terms of the estimation of scalar quantities such as inner products, traces, determinants, and Schatten p-norms with bounded errors. We thus provide a framework for compiling instructions for linear algebraic computations into gate sequences on actual quantum computers. The framework for calculating the matrix functions is more efficient than the best classical counterpart for a set of matrices.
- Research Article
1
- 10.47037/2020.aces.j.351102
- Feb 3, 2021
- Applied Computational Electromagnetics Society
- John Shaeffer
Basic Linear Algebra Subroutines (BLAS) are well-known low-level workhorse subroutines for linear algebra vector-vector, matrixvector and matrix-matrix operations for full rank matrices. The advent of block low rank (Rk) full wave direct solvers, where most blocks of the system matrix are Rk, an extension to the BLAS III matrix-matrix work horse routine is needed due to the agony of Rk addition. This note outlines the problem of BLAS III for Rk LU and solve operations and then outlines an alternative approach, which we will call BLAS IV. This approach utilizes the thrill of Rk matrix-matrix multiply and uses the Adaptive Cross Approximation (ACA) as a methodology to evaluate sums of Rk terms to circumvent the agony of low rank addition.
- Research Article
14
- 10.1103/physreva.102.032410
- Sep 16, 2020
- Physical Review A
- Xi He
Correlation alignment (CORAL), a representative domain adaptation (DA) algorithm, decorrelates and aligns a labelled source domain dataset to an unlabelled target domain dataset to minimize the domain shift such that a classifier can be applied to predict the target domain labels. In this paper, we implement the CORAL on quantum devices by two different methods. One method utilizes quantum basic linear algebra subroutines (QBLAS) to implement the CORAL with exponential speedup in the number and dimension of the given data samples. The other method is achieved through a variational hybrid quantum-classical procedure. In addition, the numerical experiments of the CORAL with three different types of data sets, namely the synthetic data, the synthetic-Iris data, the handwritten digit data, are presented to evaluate the performance of our work. The simulation results prove that the variational quantum correlation alignment algorithm (VQCORAL) can achieve competitive performance compared with the classical CORAL.
- Research Article
11
- 10.1145/3378671
- May 19, 2020
- ACM Transactions on Mathematical Software
- Gianluca Frison + 3 more
Basic Linear Algebra Subroutines For Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an application programming interface (API) which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open source and proprietary. This article investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are easily developed. This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache. Therefore, it can boost a wide range of applications, where standard BLAS and LAPACK libraries are employed and the matrix size is moderate. In particular, this article investigates the benefits in scientific programming languages such as Octave, SciPy, and Julia.
- Research Article
17
- 10.1109/tpds.2019.2940192
- Mar 1, 2020
- IEEE Transactions on Parallel and Distributed Systems
- Tao Zhang + 3 more
Tensors are the cornerstone data structures in high-performance computing, big data analysis and machine learning. However, tensor computations are compute-intensive and the running time increases rapidly with the tensor size. Therefore, designing high-performance primitives on parallel architectures such as GPUs is critical for the efficiency of ever growing data processing demands. Existing GPU basic linear algebra subroutines (BLAS) libraries (e.g., NVIDIA cuBLAS) do not provide tensor primitives. Researchers have to implement and optimize their own tensor algorithms in a case-by-case manner, which is inefficient and error-prone. In this paper, we develop the cuTensor-tubal library of seven key primitives for the tubal-rank tensor model on GPUs: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal adopts a frequency domain computation scheme to expose the separability in the frequency domain, then maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture. To achieve good performance, we optimize the data transfer, memory accesses, and design the batched and streamed parallelization schemes for tensor operations with data-independent and data-dependent computation patterns, respectively. In the evaluations of t-product, t-SVD, t-QR, t-inverse and t-normalization, cuTensor-tubal achieves maximum $16.91 \times, 27.03 \times, 38.97 \times, 22.36 \times, 15.43 \times$ 16 . 91 × , 27 . 03 × , 38 . 97 × , 22 . 36 × , 15 . 43 × speedups respectively over the CPU implementations running on dual 10-core Xeon CPUs. Two applications, namely, t-SVD-based video compression and low-tubal-rank tensor completion, are tested using our library and achieve maximum $9.80 \times$ 9 . 80 × and $269.26 \times$ 269 . 26 × speedups over multi-core CPU implementations.
- Research Article
6
- 10.1007/s10586-019-02927-z
- Apr 2, 2019
- Cluster Computing
- Sandra Catalán + 4 more
We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of basic linear algebra subroutines (BLAS). The proposed approach is also different from the more sophisticated runtime-based implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of linear algebra package (LAPACK) functionality on any multicore platform with an OpenMP-like runtime.
- Research Article
8
- 10.1016/j.jocs.2019.02.007
- Mar 20, 2019
- Journal of Computational Science
- Filip Pawłowski + 2 more
A multi-dimensional Morton-ordered block storage for mode-oblivious tensor computations
- Research Article
82
- 10.1145/3210754
- Jul 31, 2018
- ACM Transactions on Mathematical Software
- Gianluca Frison + 4 more
Basic Linear Algebra Subroutines for Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and small-scale high-performance computing, in general. A key difference with respect to existing high-performance implementations of BLAS is that the computational performance is optimized for small- to medium-scale matrices, i.e., for sizes up to a few hundred. BLASFEO comes with three different implementations: a high-performance implementation aimed at providing the highest performance for matrices fitting in cache, a reference implementation providing portability and embeddability and optimized for very small matrices, and a wrapper to standard BLAS and LAPACK providing high performance on large matrices. The three implementations of BLASFEO together provide high-performance dense linear algebra routines for matrices ranging from very small to large. Compared to both open-source and proprietary highly tuned BLAS libraries, for matrices of size up to about 100, the high-performance implementation of BLASFEO is about 20--30% faster than the corresponding level 3 BLAS routines and two to three times faster than the corresponding LAPACK routines.
- Research Article
1
- 10.1007/s11075-018-0500-8
- Mar 1, 2018
- Numerical Algorithms
- Rafael Rodríguez-Sánchez + 4 more
We address the reduction to compact band forms, via unitary similarity transformations, for the solution of symmetric eigenvalue problems and the computation of the singular value decomposition (SVD). Concretely, in the first case, we revisit the reduction to symmetric band form, while, for the second case, we propose a similar alternative, which transforms the original matrix to (unsymmetric) band form, replacing the conventional reduction method that produces a triangular–band output. In both cases, we describe algorithmic variants of the standard Level 3 Basic Linear Algebra Subroutines (BLAS)-based procedures, enhanced with look-ahead, to overcome the performance bottleneck imposed by the panel factorization. Furthermore, our solutions employ an algorithmic block size that differs from the target bandwidth, illustrating the important performance benefits of this decision. Finally, we show that our alternative compact band form for the SVD is key to introduce an effective look-ahead strategy into the corresponding reduction procedure.
- Research Article
12
- 10.1016/j.jocs.2018.01.007
- Feb 20, 2018
- Journal of Computational Science
- Tingxing Dong + 3 more
Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs
- Research Article
14
- 10.1016/j.parco.2017.12.006
- Jan 4, 2018
- Parallel Computing
- Hartwig Anzt + 3 more
Variable-size batched Gauss–Jordan elimination for block-Jacobi preconditioning on graphics processors
- Research Article
24
- 10.18637/jss.v084.i04
- Jan 1, 2018
- Journal of Statistical Software
- Wagner Hugo Bonat
This article describes the R package mcglm implemented for fitting multivariate covariance generalized linear models (McGLMs). McGLMs provide a general statistical modeling framework for normal and non-normal multivariate data analysis, designed to handle multivariate response variables, along with a wide range of temporal and spatial correlation structures defined in terms of a covariance link function and a matrix linear predictor involving known symmetric matrices. The models take non-normality into account in the conventional way by means of a variance function, and the mean structure is modeled by means of a link function and a linear predictor. The models are fitted using an estimating function approach based on second-moment assumptions. This provides a unified approach to a wide variety of different types of response variables and covariance structures, including multivariate extensions of repeated measures, time series, longitudinal, genetic, spatial and spatio-temporal structures. The mcglm package allows a flexible specification of the mean and covariance structures, and explicitly deals with multivariate response variables, through a user friendly formula interface similar to the ordinary glm function. Illustrations in this article cover a wide range of applications from the traditional one response variable Gaussian mixed models to multivariate spatial models for areal data using the multivariate Tweedie distribution. Additional features, such as robust and bias-corrected standard errors for regression parameters, residual analysis, measures of goodness-of-fit and model selection using the score information criterion are discussed through six worked examples. The mcglm package is a full R implementation based on the Matrix package which provides efficient access to BLAS (basic linear algebra subroutines), Lapack (dense matrix), TAUCS (sparse matrix) and UMFPACK (sparse matrix) routines for efficient linear algebra in R.
- Research Article
26
- 10.1109/access.2018.2823299
- Jan 1, 2018
- IEEE Access
- M Usman Ashraf + 3 more
The emerging high-performance computing Exascale supercomputing system, which is anticipated to be available in 2020, will unravel many scientific mysteries. This extraordinary processing framework will accomplish a thousand-folds increment in figuring power contrasted with the current Petascale framework. The prospective framework will help development communities and researchers in exploring from conventional homogeneous to the heterogeneous frameworks that will be joined into energy efficient GPU devices along with traditional CPUs. For accomplishing ExaFlops execution through the Ultrascale framework, the present innovations are confronting several challenges. Huge parallelism is one of these challenges, which requires a novel low power consuming parallel programming approach for attaining massive performance. This paper introduced a new parallel programming model that achieves massive parallelism by combining coarse-grained and fine-grained parallelism over inter-node and intra-node computation respectively. The proposed framework is tri-hybrid of MPI, OpenMP, and compute unified device architecture (MOC) that compute input data over heterogeneous framework. We implemented the proposed model in linear algebraic dense matrix multiplication application, and compared the quantified metrics with well-known basic linear algebra subroutine libraries such as CUDA basic linear algebra subroutines library and KAUST basic linear algebra subprograms. MOC outperformed to all implemented methods and achieved massive performance by consuming less power. The proposed MOC approach can be considered an initial and leading model to deal emerging Exascale computing systems.
- Research Article
1
- 10.1117/1.jrs.11.035009
- Aug 17, 2017
- Journal of Applied Remote Sensing
- Jingxiao Cai + 1 more
This study introduces a practical approach to implement real-time signal processing algorithms for general surveillance radar based on NVIDIA graphical processing units (GPUs). The pulse compression algorithms are implemented using compute unified device architecture (CUDA) libraries such as CUDA basic linear algebra subroutines and CUDA fast Fourier transform library, which are adopted from open source libraries and optimized for the NVIDIA GPUs. For more advanced, adaptive processing algorithms such as adaptive pulse compression, customized kernel optimization is needed and investigated. A statistical optimization approach is developed for this purpose without needing much knowledge of the physical configurations of the kernels. It was found that the kernel optimization approach can significantly improve the performance. Benchmark performance is compared with the CPU performance in terms of processing accelerations. The proposed implementation framework can be used in various radar systems including ground-based phased array radar, airborne sense and avoid radar, and aerospace surveillance radar.
- Research Article
4
- 10.21042/amns.2017.1.00017
- Jun 22, 2017
- Applied Mathematics and Nonlinear Sciences
- José I Aliaga + 2 more
Abstract We present a prototype task-parallel algorithm for the solution of hierarchical symmetric positive definite linear systems via the ℋ-Cholesky factorization that builds upon the parallel programming standards and associated runtimes for OpenMP and OmpSs. In contrast with previous efforts, our proposal decouples the numerical aspects of the linear algebra operation from the complexities associated with high performance computing. Our experiments make an exhaustive analysis of the efficiency attained by different parallelization approaches that exploit either task-parallelism or loop-parallelism via a runtime. Alternatively, we also evaluate a solution that leverages multi-threaded parallelism via the parallel implementation of the Basic Linear Algebra Subroutines (BLAS) in Intel MKL.