User-level Threads Research Articles

The efficient solution of sparse, linear systems that arise through the discretization of partial differential equations remains a key challenge for a range of high performance scientific simulations. One approach for reducing data movement and improving performance is by exposing and exploiting structure in a problem through the use of robust structured multilevel solvers. By choosing coarsening that preserves the structure of the problem, these methods maintain efficient structured computation and communication throughout the multigrid hierarchy. However, when coarsening is not permitted to be dependent on the operator, anisotropy must be addressed by the smoother — producing error compatible for coarse-grid correction with structured coarsening. In this paper, the components required in a scalable parallel structured solver are described with a focus on memory and communication efficiency of robust smoothers. While the implementation of communication and memory reduction techniques in smoothers integrated in a complete 3D solver present a significant engineering challenge, a novel approach is proposed that addresses these challenges systematically through a change to the solver’s execution model. Enabled by user-level threading paired with a set of data and communication abstractions, this approach permits seamless aggregation of communication in plane smoothers — directly reusing code for a 2D distributed multilevel cycle. Results show an effective reduction in communication costs for coarse-grid problems, and result in a speedup of 8.7× in smoothing routines shown in Fig. 12 using this approach. This produces a significant improvement to strong scalability while maintaining favorable weak scaling behavior. Finally, a parallel scaling study using a series of refined meshes is included that demonstrates the effectiveness of this approach in an application of interest.

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors. We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority. More precisely, suppose that a job has work T 1 and span T ∞ . On a machine with P processors, A-STEAL completes the job in an expected duration of O ( T 1 / P˜ + T ∞ + L lg P ) time steps, where L is the length of a scheduling quantum, and P˜ denotes the O ( T ∞ + L lg P )-trimmed availability. This quantity is the average of the processor availability over all time steps except the O ( T ∞ + L lg P ) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, P˜ < T 1 / T ∞ , the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal. We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP.

User-level Threads Research Articles

Related Topics

Articles published on User-level Threads

Taking the MPI standard and the open MPI library to exascale

ComposableThreads: Rethinking User-level Threads with Composability and Parametricity in C++

Synch: A framework for concurrent data-structures and benchmarks

Scalable line and plane relaxation in a parallel structured multigrid solver

Analyzing the Performance Trade-Off in Implementing User-Level Threads

User-level Threading

User-level Threading

Measuring Overhead of Concurrency and Virtual Memory

Argobots: A Lightweight Low-Level Threading and Tasking Framework

Fiber-based architecture for NFV cloud databases

Rethink Scalable M:N Threading on Modern Operating Systems

Molecule

Preserving the original MPI semantics in a virtualized processor environment

Performance analysis of N-computing device under various load conditions

Application‐specific thread schedulers for internet server applications

Application‐specific thread schedulers for distributed applications

Exploitation of the EDF Scheduling in the Wireless Sensors Networks

Provably Efficient Online Nonclairvoyant Adaptive Scheduling

Adaptive work-stealing with parallelism feedback

Implementation of threads as an operating systems project

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

User-level Threads Research Articles

Related Topics

Articles published on User-level Threads

Taking the MPI standard and the open MPI library to exascale

ComposableThreads: Rethinking User-level Threads with Composability and Parametricity in C++

Synch: A framework for concurrent data-structures and benchmarks

Scalable line and plane relaxation in a parallel structured multigrid solver

Analyzing the Performance Trade-Off in Implementing User-Level Threads

User-level Threading

User-level Threading

Measuring Overhead of Concurrency and Virtual Memory

Argobots: A Lightweight Low-Level Threading and Tasking Framework

Fiber-based architecture for NFV cloud databases

Rethink Scalable M:N Threading on Modern Operating Systems

Molecule

Preserving the original MPI semantics in a virtualized processor environment

Performance analysis of N-computing device under various load conditions

Application‐specific thread schedulers for internet server applications

Application‐specific thread schedulers for distributed applications

Exploitation of the EDF Scheduling in the Wireless Sensors Networks

Provably Efficient Online Nonclairvoyant Adaptive Scheduling

Adaptive work-stealing with parallelism feedback

Implementation of threads as an operating systems project