Multi-threaded Workloads Research Articles

Design of an efficient thread-safe concurrent data structure is a balancing act between its implementation complexity and performance. Lock-based concurrent data structures, which are relatively easy to derive from their sequential counterparts and to prove thread-safe, suffer from poor throughput under even light multi-threaded workload. At the same time, lock-free concurrent structures allow for high throughput, but are notoriously difficult to get right and require careful reasoning to formally establish their correctness. In this work, we explore a solution to this conundrum based on a relatively old idea of batch parallelism---an approach for designing high-throughput concurrent data structures via a simple insight: efficiently processing a batch of a priori known operations in parallel is easier than optimising performance for a stream of arbitrary asynchronous requests. Alas, batch-parallel structures have not seen wide practical adoption due to (i) the inconvenience of having to structure multi-threaded programs to explicitly group operations and (ii) the lack of a systematic methodology to implement batch-parallel structures as simply as lock-based ones. We present OBatcher---a Multicore OCaml library that streamlines the design, implementation, and usage of batch-parallel structures. OBatcher solves the first challenge (how to use) by suggesting a new lightweight implicit batching design pattern that is built on top of generic asynchronous programming mechanisms. The second challenge (how to implement) is addressed by identifying a family of strategies for converting common sequential structures into the corresponding efficient batch-parallel versions, and by providing a library of functors that embody those strategies. We showcase OBatcher with a diverse set of benchmarks ranging from Red-Black and AVL trees to van Emde Boas trees, skip lists, and a thread-safe implementation of a Datalog solver. Our evaluation of all the implementations on large asynchronous workloads shows that (a) they consistently outperform the corresponding coarse-grained lock-based implementations---the only ones available in OCaml to date, and that (b) their throughput scales reasonably with the number of processors.

To fully exploit the scaling performance in Chip Multiprocessors, applications must be divided into semi-independent processes that can run concurrently on multiple cores within a system. One major class of such applications, shared-memory, multi-threaded applications, requires programmers insert thread synchronization primitives (i.e., locks, barriers, and condition variables) in their critical sections to synchronize data access between processes. For this class of applications, scaling performance requires balanced per-thread workloads with little time spent in critical sections. In practice, however, threads often waste significant time waiting to acquire locks/barriers in their critical sections, leading to thread imbalance and poor performance scaling. Moreover, critical sections often stall data prefetchers that mitigate the effects of long critical section stalls by ensuring data is preloaded in the core caches when the critical section is complete. In this paper we examine a pure hardware technique to enable safe data prefetching beyond synchronization points in CMPs. We show that successful prefetching beyond synchronization points requires overcoming two significant challenges in existing prefetching techniques. First, we find that typical data prefetchers are designed to trigger prefetches based on current misses. This approach this works well for traditional, continuously executing, single-threaded applications. However, when a thread stalls on a synchronization point, it typically does not produce any new memory references to trigger a prefetcher. Second, even in the event that a prefetch were to be correctly directed to read beyond a synchronization point, it will likely prefetch shared data from another core before this data has been written. While this prefetch would be considered “accurate” it is highly undesirable, because such a prefetch would lead to three extra “ping-pong” movements back and forth between private caches in the producing and consuming cores, incurring more latency and energy overhead than without prefetching. We develop a new data prefetcher, Multi-Thread B-Fetch (MTB-Fetch), built as an extension to a previous single-threaded data prefetcher. MTB-Fetch addresses both issues in prefetching for shared memory multi-threaded workloads. MTB-Fetch achieves a speedup of 9.3 percent for multi-threaded applications with little additional hardware.

Multi-threaded Workloads Research Articles

Related Topics

Articles published on Multi-threaded Workloads

Concurrent Data Structures Made Easy

Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling

Dynamic Thermal Management of 3D Memory through Rotating Low Power States and Partial Channel Closure

Learning-based Phase-aware Multi-core CPU Workload Forecasting

A Pressure-Aware Policy for Contention Minimization on Multicore Systems

Exploiting Long-Term Temporal Cache Access Patterns for LRU Insertion Prioritization

Toward a general framework for jointly processor-workload empirical modeling

On the use of many-core Marvell ThunderX2 processor for HPC workloads

Architecture-Aware Approximate Computing

An efficient cache flat storage organization for multithreaded workloads for low power processors

Architecture-Aware Approximate Computing

Custard: ASIC Workload-Aware Reliable Design for Multicore IoT Processors

A novel power model for future heterogeneous 3D chip-multiprocessors in the dark silicon age

MTB-Fetch: Multithreading Aware Hardware Prefetching for Chip Multiprocessors

멀티쓰레드 워크로드를 위한 DVFS 기반 메모리 경합 인지 스케줄링 기법

Exploring System Availability During Software-Based Self-Testing of Multi-core CPUs

TLB Shootdown Mitigation for Low-Power Many-Core Servers with L1 Virtual Caches

VarCatcher: A Framework for Tackling Performance Variability of Parallel Workloads on Multi-Core

Thread Data Sharing in Cache

Scale & Cap

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multi-threaded Workloads Research Articles

Related Topics

Articles published on Multi-threaded Workloads

Concurrent Data Structures Made Easy

Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling

Dynamic Thermal Management of 3D Memory through Rotating Low Power States and Partial Channel Closure

Learning-based Phase-aware Multi-core CPU Workload Forecasting

A Pressure-Aware Policy for Contention Minimization on Multicore Systems

Exploiting Long-Term Temporal Cache Access Patterns for LRU Insertion Prioritization

Toward a general framework for jointly processor-workload empirical modeling

On the use of many-core Marvell ThunderX2 processor for HPC workloads

Architecture-Aware Approximate Computing

An efficient cache flat storage organization for multithreaded workloads for low power processors

Architecture-Aware Approximate Computing

Custard: ASIC Workload-Aware Reliable Design for Multicore IoT Processors

A novel power model for future heterogeneous 3D chip-multiprocessors in the dark silicon age

MTB-Fetch: Multithreading Aware Hardware Prefetching for Chip Multiprocessors

멀티쓰레드 워크로드를 위한 DVFS 기반 메모리 경합 인지 스케줄링 기법

Exploring System Availability During Software-Based Self-Testing of Multi-core CPUs

TLB Shootdown Mitigation for Low-Power Many-Core Servers with L1 Virtual Caches

VarCatcher: A Framework for Tackling Performance Variability of Parallel Workloads on Multi-Core

Thread Data Sharing in Cache

Scale &amp; Cap

Scale & Cap