Today’s hardware platforms have parallel processing capabilities and many parallel programming models have been developed. It is necessary to research an efficient implementation of compute-intensive applications using available platforms. Dense matrix-matrix multiplication is an important kernel that is used in many applications, while it is computationally intensive, especially for large matrix sizes. To improve the performance of this kernel, we implement it on the graphics processing unit (GPU) platform using the tiling technique with different tile sizes. Our experimental results show the tiling approach improves speed by 56.89% (2.32× faster) against straightforward (STF). And tile size of 32 has the highest speed compared to other tile sizes of 8 and 16.
Read full abstract