Design of self‐adaptable data parallel applications on multicore clusters automatically optimized for performance and energy through load distribution

Ravi Reddy Manumachu,Alexey L Lastovetsky

doi:10.1002/cpe.4958

Abstract

SummarySelf‐adaptability is a highly preferred feature in HPC applications. A crucial building block of a self‐adaptable application is a data partitioning algorithm that must possess several essential qualities apart from low runtime and memory costs. On modern platforms composed of multicore CPU processors, data partitioning algorithms striving to solve the bi‐objective optimization problem for performance and energy (BOPPE) face a formidable challenge. They must take into account the new complexities inherent in these platforms such as severe resource contention and non‐uniform memory access (NUMA). Novel model‐based methods and data partitioning algorithms have been proposed that address the challenge. However, these methods take as input full functional performance and energy models (FPM and FEM), which have prohibitively high model construction costs. Therefore, they are not suitable for employment in self‐adaptable applications. In this paper, we present a self‐adaptable data partitioning algorithm called ADAPTALEPH, which solves BOPPE on homogeneous clusters of multicore CPUs. Unlike the state‐of‐the‐art solving BOPPE that take as inputs full FPM and FEM, it constructs partial FPM and FEM during its execution using all the available processors. It returns a locally Pareto‐optimal set of solutions, which are the heterogeneous workload distributions that achieve inter‐node optimization of data‐parallel applications for performance and energy. We experimentally study the efficiency of ADAPTALEPH for three data‐parallel applications, ie, matrix‐vector multiplication, matrix‐matrix multiplication, and fast Fourier transform, on a modern multicore CPU and simulations for homogeneous clusters of such CPUs. We demonstrate that the locally Pareto‐optimal front approaches the globally Pareto‐optimal front as the number of points in the partial discrete FPM and FEM functions are increased. The number of points in the partial FPM/FEM when the locally Pareto‐optimal front becomes the globally Pareto‐optimal front is considerably less than the number of points in the full FPM/FEM thereby suggesting development of methods that can leverage this finding to drastically reduce the model construction times.

Full Text