METHOD OF FAST MATRIX MULTIPLICATION UNDER ARM ARCHITECTURE USING SIMD INSTRUCTIONS

Ivan A Dychka,Vasyl Ya Yurchyshyn,Denys A Vinnyk,Yuriy V Bukhtiyarov

doi:10.20535/kpi-sn.2020.2.205115

Abstract

Background. Matrix multiplication is a rather complicated algorithm with a large number of operations. An additional problem is the nonlinear memory traversal of matrices. Matrix multiplication is widely used in various fields, such as neural networks, solutions of linear equation systems, matrix transformations, and so on. Therefore, it is important to develop a method of matrix multiplication, which will take into account the problems of the location of the matrices in memory, and will effectively manage the data when reused.Objective. The purpose of the paper is to develop a method of fast matrix multiplication of two matrices, as well as multiplying the matrix by the transposed matrix and by a list of vectors (including special case for only one vector), as well as to implement it as a function with optimization for ARM architecture processors. The function must be able to handle different types of data and submatrices. The integer result can be scaled.Methods. The main ideas of the developed method are simultaneous work with several rows/columns of input matrices and their splitting into blocks, which will allow the algorithm to run on the same memory for a while. The C programming language was chosen for implementation. SIMD instructions were used to increase productivity. We also need to properly organize the memory preloading for effective implementation under the ARM architecture.Results. A function that performs matrix multiplication by the developed method with the necessary parameters was implemented as a result of the study. Tests on various sizes and types have shown that the implemented function is faster than analogues from the OpenCV2 and Eigen 3 libraries. Testing was done using the vipmed utility for running and measuring features developed for enterprise use at VIT.Conclusions. The proposed matrix multiplication method gives the expected acceleration of matrix multiplication operations, has passed evaluation test for use and meets the target requirements. For further work, it is necessary to study in more detail the influence of the cache at different levels and compare with other existing libraries.

Highlights

In the everyday life, the matrices are used much wider than people are apt to think
Matrix multiplication is widely used when working with neural networks
We describe main problems of effective matrix multiplication implementation

Summary

Background

Matrix multiplication is a rather complicated algorithm with a large number of operations. It is important to develop a method of matrix multiplication, which will take into account the problems of the location of the matrices in memory, and will effectively manage the data when reused. The purpose of the paper is to develop a method of fast matrix multiplication of two matrices, as well as multiplying the matrix by the transposed matrix and by a list of vectors (including special case for only one vector), as well as to implement it as a function with optimization for ARM architecture processors. The main ideas of the developed method are simultaneous work with several rows/columns of input matrices and their splitting into blocks, which will allow the algorithm to run on the same memory for a while. A function that performs matrix multiplication by the developed method with the necessary parameters was implemented as a result of the study.

Introduction

Overview of the existing solutions

Description of the proposed method

Implementation using SIMD instructions

The results of the comparison of the proposed method with others

Conclusions