Abstract

SummaryAs core counts in processors increases, it becomes harder to schedule and distribute work in a timely and scalable manner. This article enhances the scalability of parallel loop schedulers by specializing schedulers for fine‐grain loops. We propose a low‐overhead work distribution mechanism for a static scheduler that uses no atomic operations. We integrate our static scheduler with the Intel OpenMP and Cilkplus parallel task schedulers to build hybrid schedulers. Compiler support enables efficient reductions for Cilk, without changing the programming interface of Cilk reducers. Detailed, quantitative measurements demonstrate that our techniques achieve scalable performance on a 48‐core machine and the scheduling overhead is 43% lower than Intel OpenMP and 12.1× lower than Cilk. We demonstrate consistent performance improvements on a range of HPC and data analytics codes. Performance gains are more important as loops become finer‐grain and thread counts increase. We observe consistently 16%–30% speedup on 48 threads, with a peak of 2.8× speedup.

Highlights

  • While Moore’s Law remains active, every new processor generation has an increasing number of CPU cores

  • 3) Workers initialize local copies of reduction variables and execute work sent by the master 4) The master thread waits for the workers to complete, and partial results are reduced for reduction variables

  • For the parallel loop model, the worker threads are associated to a specific master which makes some synchronization steps redundant

Read more

Summary

Introduction

While Moore’s Law remains active, every new processor generation has an increasing number of CPU cores. Scheduling and distributing work load on large scale shared-memory machines becomes increasingly important to make efficient use of the hardware. The runtime overhead caused by scheduling, work distribution and synchronization [1] can make some parallel codes too fine-grain to make parallel execution worthwhile. This overhead, growing with the degree of parallelism, can affect the scalability of schedulers. This work focuses on fine-grain, micro-second-scale parallel loops, comparable in duration to the overhead of stateof-the-art schedulers on current hardware. We reason on commonly used loop scheduling techniques and propose a “half-barrier” pattern to remove redundant synchronisation

Contribution
Experimental Evaluation
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call