Impact study of data locality on task-based applications through the Heteroprio scheduler.

Bérenger Bramas

doi:10.7717/peerj-cs.190

Abstract

The task-based approach has emerged as a viable way to effectively use modern heterogeneous computing nodes. It allows the development of parallel applications with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task distribution. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks’ dependencies. We evaluate the benefit of our method on two linear algebra applications and a stencil code. We show that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution.

Highlights

High-performance computing (HPC) is crucial to make advances and discoveries in numerous domains
Automatic DLAF selection We propose several data locality affinity formulas (DLAF) but only one of them is used to find out the best memory node when a newly ready task is pushed into the scheduler
We have created different formulas to evaluate the locality of a task regarding a memory node, and we found that formulas that omit many parameters provide a low performance; this is probably due to the neglect of the type of accesses of the tasks on the data

Summary

INTRODUCTION

High-performance computing (HPC) is crucial to make advances and discoveries in numerous domains. The contributions of this paper are as follows: We summarize the main ideas of the Heteroprio scheduler and explain how it can be implemented in a simple and efficient manner; We propose new mechanisms to include data locality in the Heteroprio scheduler’s decision model; We define different formulas to express the locality affinity for a given task relative to the different memory nodes Those formulas are based on general information regarding the hardware or the data accesses; We evaluate our approach on two linear algebra applications, QrMumps and SpLDLT, and a stencil application, and analyze the effect of the different parameters. We evaluate our approach in the section “Performance Study” by plugging in the LAHeteroprio inside StarPU to execute two different linear algebra applications using up to four GPUs

BACKGROUND

Findings

CONCLUSION