Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree

Iulia Știrb

doi:10.3390/computers7040066

Abstract

The paper presents a Non-Uniform Memory Access (NUMA)-aware compiler optimization for task-level parallel code. The optimization is based on Non-Uniform Memory Access—Balanced Task and Loop Parallelism (NUMA-BTLP) algorithm Ştirb, 2018. The algorithm gets the type of each thread in the source code based on a static analysis of the code. After assigning a type to each thread, NUMA-BTLP Ştirb, 2018 calls NUMA-BTDM mapping algorithm Ştirb, 2016 which uses PThreads routine pthread_setaffinity_np to set the CPU affinities of the threads (i.e., thread-to-core associations) based on their type. The algorithms perform an improve thread mapping for NUMA systems by mapping threads that share data on the same core(s), allowing fast access to L1 cache data. The paper proves that PThreads based task-level parallel code which is optimized by NUMA-BTLP Ştirb, 2018 and NUMA-BTDM Ştirb, 2016 at compile-time, is running time and energy efficiently on NUMA systems. The results show that the energy is optimized with up to 5% at the same execution time for one of the tested real benchmarks and up to 15% for another benchmark running in infinite loop. The algorithms can be used on real-time control systems such as client/server based applications which require efficient access to shared resources. Most often, task parallelism is used in the implementation of the server and loop parallelism is used for the client.

Highlights

Non-Uniform Memory Access (NUMA) systems overcome the drawbacks of the Uniform MemoryAccess (UMA) systems such as lack of performance with increasing the number of processing units and need for synchronization barriers to ensure the correctness of the shared memory accesses [1].NUMA systems have a memory subsystem for each CPU, considering that the CPU is placed on the last level in the tree-based memory hierarchy [2]
After applying a prototype of NUMA-BTLP [5] on the benchmark, threads are all set as side-by-side and mapped on the same core
The benchmark has two main programs. Both main programs create a thread which is considered of type autonomous, according to NUMA-BTLP [5]

Summary

Introduction

Non-Uniform Memory Access (NUMA) systems overcome the drawbacks of the Uniform MemoryAccess (UMA) systems such as lack of performance with increasing the number of processing units and need for synchronization barriers to ensure the correctness of the shared memory accesses [1].NUMA systems have a memory subsystem for each CPU, considering that the CPU is placed on the last level in the tree-based memory hierarchy [2]. Non-Uniform Memory Access (NUMA) systems overcome the drawbacks of the Uniform Memory. Access (UMA) systems such as lack of performance with increasing the number of processing units and need for synchronization barriers to ensure the correctness of the shared memory accesses [1]. While the number of processing units is continuously growing, NUMA systems avoid congested data traffic, which is one of the main factors that influence performance. Thread mapping algorithms have been developed to benefit from the advantage of NUMA systems of allowing the more cost effective local accesses, instead of the remote accesses [3]. Improved thread mapping requires several threads to access efficiently the same data [4]. Efficient access requires using the cache and the interconnection so that performance and energy consumption is improved [3]

Methods

Results

Discussion

Conclusion