Intel Threading Building Blocks Research Articles

We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst) for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for fine-grain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and ease-of-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning. We compare LBS to Auto-Partitioner (AP), the latest default scheduler of TBB, which does not require manual tuning either but lacks context portability, and outperform it by 38.9% using TBB's default AP configuration, and by 16.2% after we tuned AP to our experimental platform. We also compare LBS to SP by manually finding SP's sst using a training dataset and then running both on a different execution dataset. LBS outperforms SP by 19.5% on average. while allowing for improved performance portability without requiring tedious manual tuning. LBS also outperforms SP with sst=1 , its default value when undefined, by 56.7%, and serializing work-stealing (SWS), another work-stealer by 54.7%. Finally, compared to serializing inner parallelism (SI) which has been used by OpenMP, LBS is 54.2% faster.

Read full abstract

With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computing systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster. This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93% across a range of architectures. We also demonstrate two applications of our predictor. First, we show how Intel's Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBB's task-stealing decisions improves performance by 13-32% for TBB-based PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15% while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.

Read full abstract

Intel Threading Building Blocks Research Articles

Articles published on Intel Threading Building Blocks

Parallelization libraries

Multicore Desktop Programming with Intel Threading Building Blocks

Application of parallel ant colony algorithm based on TBB and Cilk++ in path optimization

Lazy binary-splitting

VERTAF/Multi‐Core: A sysml‐based application framework for multi‐core embedded software development

Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Intel threading building blocks

The Foundations for Scalable Multicore Software in Intel Threading Building Blocks

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Intel Threading Building Blocks Research Articles

Articles published on Intel Threading Building Blocks

Parallelization libraries

Multicore Desktop Programming with Intel Threading Building Blocks

Application of parallel ant colony algorithm based on TBB and Cilk++ in path optimization

Lazy binary-splitting

VERTAF/Multi‐Core: A sysml‐based application framework for multi‐core embedded software development

Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors

Intel threading building blocks

The Foundations for Scalable Multicore Software in Intel Threading Building Blocks