Abstract

Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, andwork time inflation– additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.

Highlights

  • Multicore computing has led to a renaissance for shared memory parallel programming models

  • Our results demonstrate that locality-aware task scheduling significantly improves the performance of task parallel programming on Non-Uniform Memory Access (NUMA) systems: up to 2X over locality-oblivious scheduling within the same OpenMP implementation and up to a 3X over Intel’s commercial implementation

  • We use HPCToolkit [3] to measure the time spent by all threads in executions of task parallel programs from the Barcelona OpenMP Tasks Suite (BOTS) [13]:

Read more

Summary

Introduction

Multicore computing has led to a renaissance for shared memory parallel programming models. Tasks perform useful computation, while idle time results from load imbalance and overhead time includes task creation, scheduling and synchronization. They show that coarsening the granularity of tasks can decrease overhead time and, using finer-grained tasks can decrease idle time. Load imbalance and overhead do not account for all observed performance loss This paper explores another major cause, work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work sequentially – that can dominate performance loss in some applications. The characterization of lost efficiency in the execution of task parallel computations due to thread idleness, overhead costs in the scheduler and work time inflation, along with the sources of work time inflation. Our results demonstrate that locality-aware task scheduling significantly improves the performance of task parallel programming on NUMA systems: up to 2X over locality-oblivious scheduling within the same OpenMP implementation and up to a 3X over Intel’s commercial implementation

Diagnosing sources of lost efficiency
Work time inflation and the impact of NUMA
First touch and scheduling
A framework for locality-based scheduling
A concise API for programmer-specified scheduling
Run time scheduling policy and implementation
Evaluation
Detailed performance measurement
Visualizing observed task schedules
Related work
Conclusions and future work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call