Matrix Multiplication Algorithm Research Articles

Convolution features huge complexity and demands high computation capability. Among hardware platforms, field programmable gate array (FPGA) emerges as a promising solution for its substantial available parallelism and energy efficiency. Besides, convolution can be implemented with different algorithms, including conventional, general matrix–matrix multiplication (GEMM), Winograd, and fast Fourier transformation (FFT) algorithms, which are diverse in arithmetic complexity, resource requirement, etc. Different convolutional neural network (CNN) models have different topologies and structures, favoring different convolution algorithms. In response, software libraries such as cuDNN provide a variety of computational primitives to support these algorithms. However, supporting such libraries on FPGAs is challenging. First, multiple algorithms can share the FPGA resources spatially as well as temporally, introducing either reconfiguration overhead or resource underutilization. Second, FPGA implementation remains a significant challenge for library developers. It typically requires significant specialized hardware knowledge. In this article, we propose <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FCNNLib</monospace> , an efficient and scalable convolution algorithm library on FPGAs. To coordinate multiple convolution algorithms on FPGAs, we develop three schedulings: 1) spatial; 2) temporal; and 3) hybrid, which exhibit different tradeoffs in latency and throughput. We explore these schedulings by balancing the reconfiguration overhead, resource utilization, and optimization objectives of the CNNs. Then, we provide efficient and tunable algorithm templates that allow performance tuning through performance and resource models. To arm the users, <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FCNNLib</monospace> exposes a set of interfaces to support high-level application designs. We demonstrate the usability of <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FCNNLib</monospace> with state-of-the-art CNNs. <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">FCNNLib</monospace> achieves up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$44.6\times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.76\times $ </tex-math></inline-formula> energy efficiency in various scenarios compared with software libraries for CPUs and GPUs, respectively.

Read full abstract

Purpose of research. The need for the development of a mathematical model for determining the composition of binary relations and a hardware-oriented algorithm for multiplying binary matrices that allow organizing parallel data processing when determining the composition of binary relations.Methods. The procedure for determining the composition of binary relations is as follows: at the first stage, the definition of the sequence relation is performed, for this purpose its transitive closure is performed; the definition of the connection relation occurs after clarifying the sequence relation; then the alternative relation is clarified; at the final stage of determining the composition of binary relations, the parallelism relation is clarified.Results. In this paper, a mathematical model for determining the composition of binary relations and an algorithm for multiplying binary matrices are developed. The novelty of the mathematical model is the introduction of binary relations of connection, alternative and parallelism of the vertices of flowgraphs of parallel algorithms in addition to the sequence relation known in the graph theory. The novelty of the binary matrix multiplication algorithm is the reduction of the number of iterations of the inner cycle (with respect to the variable k) when obtaining a single value at one of the iterations of determining the scalar product of binary vectors.Conclusion. The developed mathematical model of binary relations of the sequence and connection of the vertices of flowgraphs of parallel algorithms allows for the organization of parallel data processing when determining the composition of binary relations. Based on the mathematical model for determining the composition of binary relations, a hardware-oriented algorithm for multiplying binary matrices was developed which allows transferring computationally complex procedures for multiplying binary matrices to the hardware level. The developed mathematical model and algorithm allow for the practical implementation of devices for multiplying binary matrices with an interruption of the internal cycle.

Read full abstract

Matrix Multiplication Algorithm Research Articles

Related Topics

Articles published on Matrix Multiplication Algorithm

Optimal sampling algorithms for block matrix multiplication

Binary operations on neuromorphic hardware with application to linear algebraic operations and stochastic equations

ASIC Implementation of Bit Matrix Multiplier

Discovering faster matrix multiplication algorithms with reinforcement learning

Ultra-fast and efficient implementation schemes of complex matrix multiplication algorithm for VLIW architectures

Large-scale distributed linear algebra with tensor processing units

FCNNLib: A Flexible Convolution Algorithm Library for Deep Learning on FPGAs

Stark: Fast and Scalable Strassen’s Matrix Multiplication Using Apache Spark

C-Lop: Accurate contention-based modeling of MPI concurrent communication

LogSC: Model-based one-sided communication performance estimation

PVD-FL: A Privacy-Preserving and Verifiable Decentralized Federated Learning Framework

Linear‐time algorithms for eliminating claws in graphs

Equivalent polyadic decompositions of matrix multiplication tensors

Universal points in the asymptotic spectrum of tensors

A Mathematical Model for Determining the Composition of Binary Relations and an Algorithm for Binary Matrices Multiplication

Dense Matrix Multiplication Algorithms and Performance Evaluation of HPCC in 81 Nodes IBM Power 8 Architecture

Extended use of error-free transformation for real matrix multiplication to complex matrix multiplication

A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems

GCSA Codes With Noise Alignment for Secure Coded Multi-Party Batch Matrix Multiplication

Multi-Disease Classification Model Using Strassen’s Half of Threshold (SHoT) Training Algorithm in Healthcare Sector

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Matrix Multiplication Algorithm Research Articles

Related Topics

Articles published on Matrix Multiplication Algorithm

Optimal sampling algorithms for block matrix multiplication

Binary operations on neuromorphic hardware with application to linear algebraic operations and stochastic equations

ASIC Implementation of Bit Matrix Multiplier

Discovering faster matrix multiplication algorithms with reinforcement learning

Ultra-fast and efficient implementation schemes of complex matrix multiplication algorithm for VLIW architectures

Large-scale distributed linear algebra with tensor processing units

FCNNLib: A Flexible Convolution Algorithm Library for Deep Learning on FPGAs

Stark: Fast and Scalable Strassen’s Matrix Multiplication Using Apache Spark

C-Lop: Accurate contention-based modeling of MPI concurrent communication

LogSC: Model-based one-sided communication performance estimation

PVD-FL: A Privacy-Preserving and Verifiable Decentralized Federated Learning Framework

Linear‐time algorithms for eliminating claws in graphs

Equivalent polyadic decompositions of matrix multiplication tensors

Universal points in the asymptotic spectrum of tensors

A Mathematical Model for Determining the Composition of Binary Relations and an Algorithm for Binary Matrices Multiplication

Dense Matrix Multiplication Algorithms and Performance Evaluation of HPCC in 81 Nodes IBM Power 8 Architecture

Extended use of error-free transformation for real matrix multiplication to complex matrix multiplication

A Parallel Structured Divide-and-Conquer Algorithm for Symmetric Tridiagonal Eigenvalue Problems

GCSA Codes With Noise Alignment for Secure Coded Multi-Party Batch Matrix Multiplication

Multi-Disease Classification Model Using Strassen’s Half of Threshold (SHoT) Training Algorithm in Healthcare Sector