We present Lazy Binary Splitting (LBS), a user-level scheduler of nested parallelism for shared-memory multiprocessors that builds on existing Eager Binary Splitting work-stealing (EBS) implemented in Intel's Threading Building Blocks (TBB), but improves performance and ease-of-programming. In its simplest form (SP), EBS requires manual tuning by repeatedly running the application under carefully controlled conditions to determine a stop-splitting-threshold (sst) for every do-all loop in the code. This threshold limits the parallelism and prevents excessive overheads for fine-grain parallelism. Besides being tedious, this tuning also over-fits the code to some particular dataset, platform and calling context of the do-all loop, resulting in poor performance portability for the code. LBS overcomes both the performance portability and ease-of-programming pitfalls of a manually fixed threshold by adapting dynamically to run-time conditions without requiring tuning. We compare LBS to Auto-Partitioner (AP), the latest default scheduler of TBB, which does not require manual tuning either but lacks context portability, and outperform it by 38.9% using TBB's default AP configuration, and by 16.2% after we tuned AP to our experimental platform. We also compare LBS to SP by manually finding SP's sst using a training dataset and then running both on a different execution dataset. LBS outperforms SP by 19.5% on average. while allowing for improved performance portability without requiring tedious manual tuning. LBS also outperforms SP with sst=1 , its default value when undefined, by 56.7%, and serializing work-stealing (SWS), another work-stealer by 54.7%. Finally, compared to serializing inner parallelism (SI) which has been used by OpenMP, LBS is 54.2% faster.
Read full abstract