OpenMP task scheduling strategies for multicore NUMA systems

Stephen L Olivier,Jan F Prins,Allan K Porterfield,Michael Spiegel,Kyle B Wheeler

doi:10.1177/1094342011434065

Abstract

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In order to evaluate scheduling strategies, we extended the open source Qthreads threading library to implement different scheduler designs, accepting OpenMP programs through the ROSE compiler. Our comprehensive performance study of diverse OpenMP task-parallel benchmarks compares seven different task-parallel run-time scheduler implementations on an Intel Nehalem multi-socket multicore system: our proposed hierarchical work-stealing scheduler, a per-core work-stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the Qthreads round-robin scheduler. In addition, we compare our results against the Intel and GNU OpenMP implementations. Our hierarchical scheduling strategy leverages different scheduling methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, the scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well as between a parent task and its newly created child tasks. In the performance evaluation, our Qthreads hierarchical scheduler is competitive on all benchmarks tested. On five of the seven benchmarks, it demonstrates speedup and absolute performance superior to both the Intel and GNU OpenMP run-time systems. Our run-time also demonstrates similar performance benefits on AMD Magny Cours and SGI Altix systems, enabling several benchmarks to successfully scale to 192 CPUs of an SGI Altix.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

OpenMP task scheduling strategies for multicore NUMA systems

Abstract

Talk to us

Similar Papers

More From: The International Journal of High Performance Computing Applications

Lead the way for us

Journal: The International Journal of High Performance Computing Applications	Publication Date: Feb 7, 2012
Citations: 125

Similar Papers

Scheduling task parallelism on multi-socket multicore systems
Stephen L. Olivier ... Kyle B. Wheeler
-
Stephen L. Olivier, et. al.Stephen L. Olivier ... Kyle B. Wheeler
31 May 2011
31 May 2011

Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree
Iulia Știrb
Computers | VOL. 7
Iulia ȘtirbIulia Știrb
03 Dec 2018
Computers | VOL. 7

Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms
João Vicente Ferreira Lima ... Issam Raïs
The International Journal of High Performance Computing Applications | VOL. 33
João Vicente Ferreira Lima, et. al.João Vicente Ferreira Lima ... Issam Raïs
09 Aug 2018
The International Journal of High Performance Computing Applications | VOL. 33

Scalable Task Parallelism for NUMA
Andi Drebes ... Albert Cohen
-
Andi Drebes, et. al.Andi Drebes ... Albert Cohen
11 Sep 2016
11 Sep 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

OpenMP task scheduling strategies for multicore NUMA systems

Abstract

Talk to us

Similar Papers

More From: The International Journal of High Performance Computing Applications