This article explores and evaluates variants of state-of-the-art work distribution schemes adapted for scientific applications running on hybrid systems. A hybrid implementation (multi-GPU and multi-CPU) of the NASA Parallel Benchmarks - MutiZone (NPB-MZ) benchmarks is described to study the different elements that condition the execution of this suite of applications when parallelism is spread over a set of computing units (CUs) of different computational power (e.g., GPUs and CPUs). This article studies the influence of the work distribution schemes on the data placement across the devices and the host, which in turn determine the communications between the CUs, and evaluates how the schedulers are affected by the relationship between data placement and communications. We show that only the schedulers aware of the different computational power of the CUs and minimize communications are able to achieve an appropriate work balance and high performance levels. Only then does the combination of GPUs and CPUs result in an effective parallel implementation that boosts the performance of a non-hybrid multi-GPU implementation. The article describes and evaluates the schedulers static-pcf , Guided, and Clustered Guided to solve the previously mentioned limitations that appear in hybrid systems. We compare them against state-of-the-art static and memorizing dynamic schedulers. Finally, on a system with an AMD EPYC 7742 at 2.250GHz (64 cores, 2 threads per core, 128 threads) and two AMD Radeon Instinct MI50 GPUs with 32GB, we have observed that hybrid executions speed up from 1.1 × up to 3.5 × with respect to a non-hybrid GPU implementation.
Read full abstract