Uni-Address Threads

Shigeki Akiyama,Kenjiro Taura

doi:10.1145/2749246.2749272

Abstract

Task-parallel systems have been widely used to parallelize programs. They provide automatic load balancing and programmers can easily parallelize sequential programs, including irregular ones, without considering task placement to physical processors. Despite the success of shared memory task parallelism, task parallelism on large-scale distributed memory environments is still challenging. The focuses of our work are flexibility of task model and scalability of inter-node load balancing. General task models provide functionalities for suspending and resuming tasks at any program point, and such a model enables us flexible task scheduling to achieve higher processor utilization, locality-aware task placement, etc. To realize such a task model, we have to employ a thread---an execution context containing register values and stack frames---as a representation of a task, and implement thread migration for inter-node load balancing. However, an existing thread migration scheme, iso-address, has a scalability limitation: it requires virtual memory proportional to the number of processors in each node. In large-scale distributed memory environments, this results in a huge virtual memory usage beyond the virtual address space limit of current 64bit CPUs. Furthermore, this huge virtual memory consumption makes it impossible to implement one-sided work stealing with Remote Direct Memory Access (RDMA) operations. One-sided work stealing is a popular approach to achieving high efficiency of load balancing; therefore this also limits scalability of distributed memory task parallelism. In this paper, we propose uni-address, a new thread management scheme for distributed memory task parallelism. It significantly reduces virtual memory usage for thread migration and enables us to implement RDMA-based work stealing. We implement a lightweight multithread library supporting RDMA-based work stealing based on the uni-address scheme, and demonstrate its lightweight thread operations and scalable work stealing on Fujitsu FX10 supercomputing system with three benchmarks: Binary Task Creation, Unbalanced Tree Search, and NQueens solver. As a result, we confirmed all the benchmarks works with less than 144KB virtual memory for thread migration in each processor and achieved more than 95% parallel efficiency on 3840 processing cores, relative to the results on 480 processing cores.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Uni-Address Threads

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A Performance Study to Guide RDMA Programming Decisions
Patrick Macarthur ... Robert D Russell
-
Patrick Macarthur, et. al.Patrick Macarthur ... Robert D Russell
01 Jun 2012
01 Jun 2012

Experimental Study of Thread Scheduling Libraries on Degraded CPU
Christophe Cerin ... Hazem Fkaier
-
Christophe Cerin, et. al.Christophe Cerin ... Hazem Fkaier
01 Dec 2008
01 Dec 2008

Userspace RDMA Verbs on Commodity Hardware Using DPDK
Patrick Macarthur
-
Patrick MacarthurPatrick Macarthur
01 Aug 2017
01 Aug 2017

RVMA: Remote Virtual Memory Access
Ryan E Grant ... Matthew G.F Dosanjh
-
Ryan E Grant, et. al.Ryan E Grant ... Matthew G.F Dosanjh
01 May 2021
01 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Uni-Address Threads

Abstract

Talk to us

Similar Papers