Abstract

Task-parallel systems have been widely used to parallelize programs. They provide automatic load balancing and programmers can easily parallelize sequential programs, including irregular ones, without considering task placement to physical processors. Despite the success of shared memory task parallelism, task parallelism on large-scale distributed memory environments is still challenging. The focuses of our work are flexibility of task model and scalability of inter-node load balancing. General task models provide functionalities for suspending and resuming tasks at any program point, and such a model enables us flexible task scheduling to achieve higher processor utilization, locality-aware task placement, etc. To realize such a task model, we have to employ a thread---an execution context containing register values and stack frames---as a representation of a task, and implement thread migration for inter-node load balancing. However, an existing thread migration scheme, iso-address, has a scalability limitation: it requires virtual memory proportional to the number of processors in each node. In large-scale distributed memory environments, this results in a huge virtual memory usage beyond the virtual address space limit of current 64bit CPUs. Furthermore, this huge virtual memory consumption makes it impossible to implement one-sided work stealing with Remote Direct Memory Access (RDMA) operations. One-sided work stealing is a popular approach to achieving high efficiency of load balancing; therefore this also limits scalability of distributed memory task parallelism. In this paper, we propose uni-address, a new thread management scheme for distributed memory task parallelism. It significantly reduces virtual memory usage for thread migration and enables us to implement RDMA-based work stealing. We implement a lightweight multithread library supporting RDMA-based work stealing based on the uni-address scheme, and demonstrate its lightweight thread operations and scalable work stealing on Fujitsu FX10 supercomputing system with three benchmarks: Binary Task Creation, Unbalanced Tree Search, and NQueens solver. As a result, we confirmed all the benchmarks works with less than 144KB virtual memory for thread migration in each processor and achieved more than 95% parallel efficiency on 3840 processing cores, relative to the results on 480 processing cores.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.