Taming data locality for task scheduling under memory constraint in runtime systems

Maxime Gonthier,Loris Marchal,Samuel Thibault

doi:10.1016/j.future.2023.01.024

Abstract

A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processing time through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction of previously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete, and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtime system. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order, randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understand these performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.

Full Text