Dense Matrix-matrix Multiplication Research Articles

Abstract Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.

Read full abstract

Abstract The use of commodity clusters in academic institutions as a cost effective solution for the study of Parallel & Distributed Computing is a well-accepted development since the success of the Beowulf Project at NASA. This paper aims explore the effects of parallel computing on some programs in a Linux based Beowulf Cluster. The research project analyses the performance of some selected parallel programs on the Cluster in an effort to provide a parallel computing system for the practical study of Parallel and Distributed computing. The process of assembling the cluster involves setting up a FastEthernet based LAN of five (5) system units and the installation of Ubuntu-server on them. Compilers were installed for program execution; MPICH for distributed processing; Secure-Shell (OpenSSH) for remote execution and Network File System (NFS) for file system sharing. For performance analysis, two sets of parallel programs were executed on the cluster with varying number of nodes and their respective performance documented. The first was a dense matrix-matrix multiplication program and the second was a program for finding the number of prime numbers in a given range. It was observed for both programs that the rate of increase of parallel speedup in these programs gets higher as the problem size increases (parallelism is more pronounced in larger problem sizes). It was also observed, in both programs, that for too small a problem size, parallelism comes with a penalty.

Read full abstract

Dense Matrix-matrix Multiplication Research Articles

Articles published on Dense Matrix-matrix Multiplication

Matrix-matrix multiplication on graphics processing unit platform using tiling technique

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

Accelerating Dense Matrix Computations with Effective Workload Partitioning on Heterogeneous Architectures

A Beowulf Cluster for Teaching and Learning

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Dense Matrix-matrix Multiplication Research Articles

Articles published on Dense Matrix-matrix Multiplication

Matrix-matrix multiplication on graphics processing unit platform using tiling technique

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

Accelerating Dense Matrix Computations with Effective Workload Partitioning on Heterogeneous Architectures

A Beowulf Cluster for Teaching and Learning