A resource-aware workload scheduling method for unbalanced GEMMs on GPUs

Hangda Liu,Boyu Diao,Wenxin Chen,Yongjun Xu

doi:10.1093/comjnl/bxae110

Abstract

Abstract GEMM (General Matrix Multiplication) serves as a fundamental operator for deep learning computations. Especially in attention-based deep learning models, such as Bert, GPT, and SAM, the sizes of matrices involved in GEMMs exhibit an unbalanced distribution due to the variable input, resulting in the low utilization of hardware resources. To address the issue, this paper proposes inserting a novel GEMM processing layer into the deep learning inference stack and using an adaptive load balancing method to partition and schedule GEMM computation tasks. The method is implemented with hardware runtime resource information, such as the occupancy of computing units, etc. Experiment results show the remarkable performance of our method in unbalanced input GEMM scenarios, achieving an average performance improvement of 2.3x. The method also performs well in attention-based models (GPT-2 and SAM), achieving an average inference speed improvement of 1.1x. These findings highlight the effectiveness of resource-aware algorithm optimization, especially for computation task scheduling.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A resource-aware workload scheduling method for unbalanced GEMMs on GPUs

Abstract

Talk to us

Similar Papers

More From: The Computer Journal

Lead the way for us

Similar Papers

A coordinated tiling and batching framework for efficient GEMM on GPUs
Xiuhong Li ... Yinghan Li
-
Xiuhong Li, et. al.Xiuhong Li ... Yinghan Li
16 Feb 2019
16 Feb 2019

GEMM-Based Quantized Neural Network FPGA Accelerator Design
Muhammad Rifqi Daffa Sudrajat ... Infall Syafalni
-
Muhammad Rifqi Daffa Sudrajat, et. al.Muhammad Rifqi Daffa Sudrajat ... Infall Syafalni
01 Oct 2019
01 Oct 2019

An Efficient Parallel Divide-and-Conquer Algorithm for Generalized Matrix Multiplication
John Eagan ... Matin Pirouz
-
John Eagan, et. al.John Eagan ... Matin Pirouz
08 Mar 2023
08 Mar 2023

GPU GEMM-Kernel Autotuning for scalable machine learners
Johannes Sailer ... Christian Frey
-
Johannes Sailer, et. al.Johannes Sailer ... Christian Frey
18 Dec 2018
18 Dec 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A resource-aware workload scheduling method for unbalanced GEMMs on GPUs

Abstract

Talk to us

Similar Papers

More From: The Computer Journal