Cache Misses Research Articles

The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server’s cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks. Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an ×86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.

Read full abstract

Power and energy have become increasingly important concerns in the design and implementation of today's multicore/manycore chips. In this paper, we present two priority-based CPU scheduling algorithms, Algorithm Cache Miss Priority CPU Scheduler ( <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">${ \mmb {\cal CM}}$</tex></formula> -PCS) and Algorithm Context Switch Priority CPU Scheduler ( <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX"> ${\cal CS}$</tex></formula> -PCS), which take advantage of often ignored dynamic performance data, in order to reduce power consumption by over 20 percent with a significant increase in performance. Our algorithms utilize Linux cpusets and cores operating at different fixed frequencies. Many other techniques, including dynamic frequency scaling, can lower a core's frequency during the execution of a non-CPU intensive task, thus lowering performance. Our algorithms match processes to cores better suited to execute those processes in an effort to lower the average completion time of all processes in an entire task, thus improving performance. They also consider a process's cache miss/cache reference ratio, number of context switches and CPU migrations, and system load. Finally, our algorithms use dynamic process priorities as scheduling criteria. We have tested our algorithms using a real AMD Opteron 6134 multicore chip and measured results directly using the “KillAWatt” meter, which samples power periodically during execution. Our results show not only a power (energy/execution time) savings of 39 watts (21.43 percent) and 38 watts (20.88 percent), but also a significant improvement in the performance, performance per watt, and execution time <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$\cdot$ </tex></formula> watt (energy) for a task consisting of 24 concurrently executing benchmarks, when compared to the default Linux scheduler and CPU frequency scaling governor.

Read full abstract

Cache Misses Research Articles

Related Topics

Articles published on Cache Misses

Fast and Portable Locking for Multicore Architectures

Windowed multipole for cross section Doppler broadening

Improving Cache Power and Performance Using Deterministic Naps and Early Miss Detection

Accelerating asynchronous programs through event sneak peek

Designing a time predictable memory hierarchy for single-path code

Efficient FIB caching using minimal non-overlapping prefixes

Tuning compilations by multi-objective optimization: Application to Apache web server

Engineering Efficient Paging Algorithms

CASA: Contention-Aware Scratchpad Memory Allocation for Online Hybrid On-Chip Memory Management

Software-Based Self-Test for Small Caches in Microprocessors

In-cache query co-processing on coupled CPU-GPU architectures

Retention Benefit Based Intelligent Cache Replacement

A formal approach to the WCRT analysis of multicore systems with memory contention under phase-structured task sets

주메모리 접근을 고려한 CPU 주파수 조정 제한

Improving multiprocessor performance with fine-grain coherence bypass

An approach to multicore parallelism using functional programming: A case study based on Presburger Arithmetic

The Direct-to-Data (D2D) cache

Cache-related preemption delay analysis for FIFO caches

CPU Scheduling for Power/Energy Management on Multicore Processors Using Cache Miss and Context Switch Data

Midpoint cell method for hybrid (MPI+OpenMP) parallelization of molecular dynamics simulations

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Cache Misses Research Articles

Related Topics

Articles published on Cache Misses

Fast and Portable Locking for Multicore Architectures

Windowed multipole for cross section Doppler broadening

Improving Cache Power and Performance Using Deterministic Naps and Early Miss Detection

Accelerating asynchronous programs through event sneak peek

Designing a time predictable memory hierarchy for single-path code

Efficient FIB caching using minimal non-overlapping prefixes

Tuning compilations by multi-objective optimization: Application to Apache web server

Engineering Efficient Paging Algorithms

CASA: Contention-Aware Scratchpad Memory Allocation for Online Hybrid On-Chip Memory Management

Software-Based Self-Test for Small Caches in Microprocessors

In-cache query co-processing on coupled CPU-GPU architectures

Retention Benefit Based Intelligent Cache Replacement

A formal approach to the WCRT analysis of multicore systems with memory contention under phase-structured task sets

주메모리 접근을 고려한 CPU 주파수 조정 제한

Improving multiprocessor performance with fine-grain coherence bypass

An approach to multicore parallelism using functional programming: A case study based on Presburger Arithmetic

The Direct-to-Data (D2D) cache

Cache-related preemption delay analysis for FIFO caches

CPU Scheduling for Power/Energy Management on Multicore Processors Using Cache Miss and Context Switch Data

Midpoint cell method for hybrid (MPI+OpenMP) parallelization of molecular dynamics simulations