Fair and adaptive online set-based cache partitioning

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

In a multi-processing environment extra misses are endured in the last level shared cache because each process is trying to utilize the whole cache space, resulting in high interference between their working sets, and thus a degraded performance. In this paper, we propose a new cache partitioning scheme, which opposed to prior partitioning schemes in the literature, offers both high performance and fair allocation even with increasing number of simultaneously running processes and decreased cache associativity. The proposed scheme takes advantage of the broad number of sets in the cache by partitioning it dynamically set-wise. The partitioning presented is adaptive to accommodate to the changing requirements of the mix of running processes without the need for any offline information and with an eye on fair allocation of cache resources. Performance is evaluated by experimentation on different cache sizes with different associativity and comparison with the Least Recently Used baseline cache and partitioning schemes. We experimented on mixes from the Splash Benchmark Suite. The results shows a performance speedup up to 44% with an average speedup of 15% over all mixes, and also fairness improvement up to 46% in terms of the weighted harmonic mean, with an average of 14% over all mixes.

Similar Papers
  • Research Article
  • Cite Count Icon 70
  • 10.1145/2086696.2086732
Writeback-aware partitioning and replacement for last-level caches in phase change main memory systems
  • Jan 1, 2012
  • ACM Transactions on Architecture and Code Optimization
  • Miao Zhou + 4 more

Phase-Change Memory (PCM) has emerged as a promising low-power main memory candidate to replace DRAM. The main problems of PCM are that writes are much slower and more power hungry than reads, write bandwidth is much lower than read bandwidth, and limited write endurance. Adding an extra layer of cache, which is logically the last-level cache (LLC), can mitigate the drawbacks of PCM. However, writebacks from the LLC might (a) overwhelm the limited PCM write bandwidth and stall the application, (b) shorten lifetime, and (c) increase energy consumption. Cache partitioning and replacement schemes are important to achieve high throughput for multi-core systems. However, we noted that no existing partitioning and replacement policy takes into account the writeback information. This paper proposes two writeback-aware schemes to manage the LLC for PCM main memory systems. Writeback-aware Cache Partitioning (WCP) is a runtime mechanism that partitions a shared LLC among multiple applications. Unlike past partitioning schemes, our scheme considers the reduction in cache misses as well as writebacks. Write Queue Balancing (WQB) replacement policy manages the cache partition of each application intelligently so that the writebacks are distributed evenly among PCM write queues. In this way, applications rarely stall due to unbalanced PCM write traffic among write queues. Our evaluation shows that WCP and WQB result in, on average, 21% improvement in throughput, 49% reduction in PCM writes, and 14% reduction in energy over a state-of-the-art cache partitioning scheme.

  • Research Article
  • Cite Count Icon 3
  • 10.3390/electronics7090172
Access Adaptive and Thread-Aware Cache Partitioning in Multicore Systems
  • Sep 1, 2018
  • Electronics
  • Kai Huang + 4 more

Cache partitioning is a successful technique for saving energy for a shared cache and all the existing studies focus on multi-program workloads running in multicore systems. In this paper, we are motivated by the fact that a multi-thread application generally executes faster than its single-thread counterpart and its cache accessing behavior is quite different. Based on this observation, we study applications running in multi-thread mode and classify data of the multi-thread applications into shared and private categories, which helps reduce the interferences among shared and private data and contributes to constructing a more efficient cache partitioning scheme. We also propose a hardware structure to support these operations. Then, an access adaptive and thread-aware cache partitioning (ATCP) scheme is proposed, which assigns separate cache portions to shared and private data to avoid the evictions caused by the conflicts from the data of different categories in the shared cache. The proposed ATCP achieves a lower energy consumption, meanwhile improving the performance of applications compared with the least recently used (LRU) managed, core-based evenly partitioning (EVEN) and utility-based cache partitioning (UCP) schemes. The experimental results show that ATCP can achieve 29.6% and 19.9% average energy savings compared with LRU and UCP schemes in a quad-core system. Moreover, the average speedup of multi-thread ATCP with respect to single-thread LRU is at 1.89.

  • Conference Article
  • Cite Count Icon 6
  • 10.1145/3337821.3337895
CPpf
  • Aug 5, 2019
  • Jun Xiao + 2 more

Hardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, leading to a performance degradation for the overall system. To study the interaction between hardware prefetching and LLC cache management, we first analyze the variation of application performance when varying the effective LLC space in the presence and absence of hardware prefetching. We observe that hardware prefetching can compensate the application performance loss due to the reduced effective cache space. Motivated by this observation, we classify applications into two categories, prefetching sensitive (PS) and non prefetching sensitive (NPS) applications, by the degree of performance benefit they experience from hardware prefetchers. To address the cache contention and also to mitigate the potential prefetch-related cache interference, we propose CPpf, a cache partitioning approach for improving the shared cache management in the presence of hardware prefetching. CPpf consists of a method using Precise Event-Based Sampling techniques for the online classification of PS and NPS applications and a cache partitioning scheme using Cache Allocation technology to distribute the cache space among PS and NPS applications. We implemented CPpf as a user-level runtime system on Linux. Compared with a non-partitioning approach, CPpf achieves speedups of up to 1.20, 1.08 and 1.06 for workloads with 2, 4 and 8 single-threaded applications, respectively. Moreover, it achieves speedups of up to 1.22 and 1.11 for workloads composed of two applications with 4 threads and 8 threads, respectively.

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/2554850.2554992
Creating heterogeneity at run time by dynamic cache and bandwidth partitioning schemes
  • Mar 24, 2014
  • Aryabartta Sahu + 1 more

A heterogeneous chip multiprocessor (CMP) architecture consists of processor cores and caches of varying size and complexity. In a multi-programmed computing environment, threads of execution exhibit different run time characteristics and hardware resource requirements. So heterogeneous multiprocessor significantly out perform homogeneous multiprocessor system. Issues in designing and managing heterogeneity in multiprocessor have significant impact on overall system cost and performance. These issues are (a) replicating standard cores is an efficient strategy in homogeneous CMP design but in heterogeneous CMP architecture, particularly a fully custom heterogeneous processor not necessarily composed of pre-existing cores, incurs additional costs in design, verification, and testing, (b) in order to take advantage of a heterogeneous architecture, an appropriate policy to map running tasks to processor cores must be determined to maximize the performance of the whole system by accurately exploiting its resources, so a very good software require to take advantage of heterogeneity, and (c) processor speeds are improving at a much faster than the memories speed, as a result the data access time dominates the execution times of many programs. And in multiprocessor environment this gap increasing, as core count in chip multiprocessor increase, on-chip cache and also the off-chip memory bandwidth get scarcer to the cores. In this paper, we propose a method of creating heterogeneity at run time by partitioning cache and memory bandwidth. In this case, we can take advantage of using pre-existing standard core in designing multiprocessor and a use of basic scheduler with out considering heterogeneity as heterogeneity is created at run time by partitioning cache and bandwidth. Also we have described a method of creating heterogeneity of system by coordinated partitioning of shared last level cache and off-chip memory bandwidth. We have proposed an efficient low overhead approach to partition the cache based on set wise partitioning by separating addressing part and data part, and along with graceful space acquirement policy. This approach quickly re-partitions the cache with minimum overhead and with smaller granularity. Also we have extended the bandwidth partition model which is based on CPI model to handle read/write access behavior of applications. Finally we have analyzed and experimentally evaluated six different cache partitioning schemes and concluded that partition based on available bandwidth partitioning and access frequency of L2 out perform others.

  • Research Article
  • Cite Count Icon 5
  • 10.1007/s11227-019-02891-w
Time-sensitivity-aware shared cache architecture for multi-core embedded systems
  • May 18, 2019
  • The Journal of Supercomputing
  • Myoungjun Lee + 1 more

In embedded systems such as automotive systems, multi-core processors are expected to improve performance and reduce manufacturing cost by integrating multiple functions on a single chip. However, inter-core interference in shared last-level cache (LLC) results in increased and unpredictable execution times for time-sensitive tasks (TSTs), which have (soft) timing constraints, thereby increasing the deadline miss rates of such systems. In this paper, we propose a time-sensitivity-aware dead block-based shared LLC architecture to mitigate these problems. First, a time-sensitivity indication bit is added to each cache block, which allows the proposed LLC architecture to be aware of instructions/data belonging to TSTs. Second, portions of the LLC space are allocated to general tasks without interfering with TSTs by developing a time-sensitivity-aware dead block-based cache partitioning technique. Third, to reduce the deadline miss rate of TSTs further, we propose a task matching in shared caches and a cache partitioning scheme that considers the memory access characteristics and the time-sensitivity of tasks (TATS). The TATS is combined with our proposed dead block-based scheme. Our evaluation shows that the proposed schemes reduce deadline miss rates of TSTs compared to conventional shared caches. On a dual-core system, compared to a baseline, equal partitioning, and state-of-the-art quality-of-service-aware cache partitioning, our proposed dead block-based cache partitioning provides 9.3%, 30.5%, and 2.6% lower average deadline miss rates, respectively. On a quad-core system, compared to the baseline, equal partitioning, and state-of-the-art quality-of-service-aware cache partitioning, the combination of our proposed schemes provides 21.2%, 17.7%, and 4.1% lower average deadline miss rates, respectively.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/iccd56317.2022.00066
A Lightweight and Adaptive Cache Partitioning Scheme for Content Delivery Networks
  • Oct 1, 2022
  • Peng Wang + 5 more

Allocating exclusive resources for different applications in content delivery networks (CDNs) allows for a higher overall hit ratio. The cache partitioning schemes on Last-Level Cache (LLC) are promising solutions that dynamically split cache sizes into partitions corresponding to threads by the miss ratio curve (MRC). Nonetheless, due to the sheer number of applications and various item sizes in CDNs, partitioning via MRC will cause high computational overheads and performance fluctuations. As a result, in this paper, we propose a lightweight and adaptive cache partitioning scheme (LAP) for CDNs. LAP establishes a shadow cache for each partition, where the size of the partition and its shadow cache is equal to the size of the integral cache. The average number of hits on the granularity unit in the shadow caches, where the size of the granularity equals the size of the probable largest item, is used to sort N partitions in decreasing order. When resizing partitions, LAP transfers a capacity of the size of granularity from the (N – k + 1)-th $\left( {k \leq \frac{N}{2}} \right)$ partition into the k-th partition. Meanwhile,we provide a threshold that neglects partition resizing and improves partitioning efficiency. This lightweight scheme can enhance resource utilization by progressively adapting to workload variations. We have deployed LAP in PicCloud of Company-T and LAP can improve the OHR by 9.34% and reduce the average user access latency by 12.5ms. Then, we verify LAP in the public trace from Akamai and the real trace from PicCloud. Experimental results demonstrate that LAP outperforms other cache partitioning schemes and tackles the performance cliff problem with little overhead.

  • Conference Article
  • Cite Count Icon 375
  • 10.1109/hpca.2008.4658653
Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems
  • Feb 1, 2008
  • Proceedings - International Symposium on High-Performance Computer Architecture/Proceedings
  • Jiang Lin + 5 more

Cache partitioning and sharing is critical to the effective utilization of multicore processors. However, almost all existing studies have been evaluated by simulation that often has several limitations, such as excessive simulation time, absence of OS activities and proneness to simulation inaccuracy. To address these issues, we have taken an efficient software approach to supporting both static and dynamic cache partitioning in OS through memory address mapping. We have comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS). Our software approach makes it possible to run the SPEC CPU2006 benchmark suite to completion. Besides confirming important conclusions from previous work, we are able to gain several insights from whole-program executions, which are infeasible from simulation. For example, giving up some cache space in one program to help another one may improve the performance of both programs for certain workloads due to reduced contention for memory bandwidth. Our evaluation of previously proposed fairness metrics is also significantly different from a simulation-based study. The contributions of this study are threefold. (1) To the best of our knowledge, this is a highly comprehensive execution- and measurement-based study on multicore cache partitioning. This paper not only confirms important conclusions from simulation-based studies, but also provides new insights into dynamic behaviors and interaction effects. (2) Our approach provides a unique and efficient option for evaluating multicore cache partitioning. The implemented software layer can be used as a tool in multicore performance evaluation and hardware design. (3) The proposed schemes can be further refined for OS kernels to improve performance.

  • Book Chapter
  • Cite Count Icon 27
  • 10.1007/978-3-642-19448-1_8
Power-Aware Dynamic Cache Partitioning for CMPs
  • Jan 1, 2011
  • Isao Kotera + 4 more

Cache partitioning and power-gating schemes are major research topics to achieve a high-performance and low-power shared cache for next generation chip multiprocessors(CMPs). We propose a power-aware cache partitioning mechanism, which is a scheme to realize both low power and high performance using power-gating and cache partitioning at the same time. The proposed cache mechanism is composed of a way-allocation function and power control function; each function works based on the cache locality assessment. The performance evaluation results show that the proposed cache mechanism with a performance-oriented parameter setting can reduce energy consumption by 20% while keeping the performance, and the mechanism with an energy-oriented parameter setting can reduce 54% energy consumption with a performance degradation of 13%. The hardware implementation results indicate that the delay and area overheads to control the proposed mechanism are negligible, and therefore hardly affect both the entire chip design and performance.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/iccd53106.2021.00068
Premier: A Concurrency-Aware Pseudo-Partitioning Framework for Shared Last-Level Cache
  • Oct 1, 2021
  • Xiaoyang Lu + 2 more

As the number of on-chip cores and application demands increase, efficient management of shared cache resources becomes imperative. Cache partitioning techniques have been studied for decades to reduce interference between applications in a shared cache and provide performance and fairness guarantees. However, there are few studies on how concurrent memory accesses affect the effectiveness of partitioning. When concurrent memory requests exist, cache miss does not reflect concurrency overlapping well. In this work, we first introduce pure misses per kilo instructions (PMPKI), a metric that quantifies the cache efficiency considering concurrent access activities. Then we propose Premier, a dynamically adaptive concurrency-aware cache pseudo-partitioning framework. Premier provides insertion and promotion policies based on PMPKI curves to achieve the benefits of cache partitioning. Finally, our evaluation of various workloads shows that Premier outperforms state-of-the-art cache partitioning schemes in terms of performance and fairness. In an 8-core system, Premier achieves 15.45% higher system performance and 10.91% better fairness than the UCP scheme.

  • Conference Article
  • Cite Count Icon 28
  • 10.1145/1654059.1654066
A case for integrated processor-cache partitioning in chip multiprocessors
  • Nov 14, 2009
  • Shekhar Srikantaiah + 4 more

Existing cache partitioning schemes are designed in a manner oblivious to the implicit processor partitioning enforced by the operating system. This paper examines an operating system directed integrated processor-cache partitioning scheme that partitions both the available processors and the shared cache in a chip multiprocessor among different multi-threaded applications. Extensive simulations using a set of multiprogrammed workloads show that our integrated processor-cache partitioning scheme facilitates achieving better performance isolation as compared to state of the art hardware/software based solutions. Specifically, our integrated processor-cache partitioning approach performs, on an average, 20.83% and 14.14% better than equal partitioning and the implicit partitioning enforced by the underlying operating system, respectively, on the fair speedup metric on an 8 core system. We also compare our approach to processor partitioning alone and a state-of-the-art cache partitioning scheme and our scheme fares 8.21% and 9.19% better than these schemes on a 16 core system.

  • Conference Article
  • Cite Count Icon 12
  • 10.1145/2024724.2024936
A helper thread based dynamic cache partitioning scheme for multithreaded applications
  • Jun 5, 2011
  • Mahmut Kandemir + 2 more

Focusing on the problem of how to partition the cache space given to a multithreaded application across its threads, we show that different threads of a multithreaded application can have different cache space requirements, propose a fully automated, dynamic, intra-application cache partitioning scheme targeting emerging multicores with multilayer cache hierarchies, present a comprehensive experimental analysis of the proposed scheme, and show average improvements of 17.1% and 18.6% in SPECOMP and PARSEC suites.

  • Conference Article
  • Cite Count Icon 23
  • 10.1109/ipdps.2010.5470416
Intra-application cache partitioning
  • Apr 1, 2010
  • Sai Prashanth Muralidhara + 2 more

Efficient management of shared on-chip resources such as the shared level 2 (L2) cache has become an important problem with the emergence of chip multiprocessors (CMPs). Partitioning the shared cache in chip multiprocessors (CMPs) among concurrently executing applications can provide important benefits such as throughput improvement, fairness guarantees, and quality of service (QoS) enhancements. In this paper, we pose an interesting related question, which is, if partitioning the shared cache space among concurrently executing threads of the same application can enhance the application performance. We address this problem by identifying and speeding up the slowest thread, also termed as the critical path thread, during each execution interval since the overall performance of a multithreaded application is determined by the critical path thread. To do so, we propose a dynamic, runtime system based, cache partitioning scheme that partitions the shared cache space dynamically among the individual threads of a given application. In a nutshell, we wish to take some cache space away from the faster threads and give it to the critical path thread at each execution interval. We show that speeding up the critical path thread this way, results in overall performance enhancement of the application execution in the long term. Our experimental evaluation indicates that, the proposed dynamic cache partitioning scheme yields benefits up to 15% over a shared cache with no partitions, up to 23% over a statically partitioned cache (private cache) and up to 20% over a throughput-oriented scheme.

  • Conference Article
  • Cite Count Icon 170
  • 10.1145/2830772.2830803
The application slowdown model
  • Dec 5, 2015
  • Lavanya Subramanian + 4 more

In a multi-core system, interference at shared resources (such as caches and main memory) slows down applications running on different cores. Accurately estimating the slowdown of each application has several benefits: e.g., it can enable shared resource allocation in a manner that avoids unfair application slowdowns or provides slowdown guarantees. Unfortunately, prior works on estimating slowdowns either lead to inaccurate estimates, do not take into account shared caches, or rely on a priori application knowledge. This severely limits their applicability. In this work, we propose the Application Slowdown Model (ASM), a new technique that accurately estimates application slowdowns due to interference at both the shared cache and main memory, in the absence of a priori application knowledge. ASM is based on the observation that the performance of each application is strongly correlated with the rate at which the application accesses the shared cache. Thus, ASM reduces the problem of estimating slowdown to that of estimating the shared cache access rate of the application had it been run alone on the system. To estimate this for each application, ASM periodically 1) minimizes interference for the application at the main memory, 2) quantifies the interference the application receives at the shared cache, in an aggregate manner for a large set of requests. Our evaluations across 100 workloads show that ASM has an average slowdown estimation error of only 9.9%, a 2.97× improvement over the best previous mechanism. We present several use cases of ASM that leverage its slowdown estimates to improve fairness, performance and provide slowdown guarantees. We provide detailed evaluations of three such use cases: slowdown-aware cache partitioning, slowdown-aware memory bandwidth partitioning and an example scheme to provide soft slowdown guarantees. Our evaluations show that these new schemes perform significantly better than state-of-the-art cache partitioning and memory scheduling schemes.

  • Research Article
  • 10.1088/1757-899x/767/1/012059
Shared Cache Partitioning Based on Performance Gain Estimations
  • Feb 1, 2020
  • IOP Conference Series: Materials Science and Engineering
  • N Mahrom + 1 more

In multiprocessor systems, dynamic cache distribution has been used to increase system performance by effectively partitioning the cache resources. However, different performance metrics used at runtime used to dynamically decide the partition sizes can give different impacts on performance, as well as varying impacts on the hardware cost of the system. In this paper, we propose an Adaptive CPI-based Cache Partitioning (ACCP) scheme to provide better utilisation of the shared cache resources among the competing applications in the system. ACCP uses performance gain estimations of the cache, without incurring significant hardware overhead. It aims to allow all applications in the system to run at approximately the same speed by accelerating the slowest application without significantly decelerating the others. We evaluated the ACCP on a quad-core system on which it achieved on average 23% reduction in miss rate, compared to an unpartitioned shared cache. ACCP also yields a similar IPC throughput improvement to a well-known UCP scheme, and better performance compared to the CPI by Muralidhara et al. Overall, the throughput of the system is improved at minimal complexity without yielding significant additional hardware cost. Hence, ACCP shows better overall performance in managing the hardware overhead compared to the UCP scheme.

  • Research Article
  • 10.4028/www.scientific.net/kem.439-440.1587
Dynamic Partition of Shared Cache for Multi-Thread Application in Multi-Core System
  • Jun 1, 2010
  • Key Engineering Materials
  • Shuo Li + 1 more

In a chip-multiprocessor with a shared cache structure , the competing accesses from different applications degrade the system performance.The accesses degrade the performance and result in non-predicting executing time. Cache partitioning techniques can exclusively partition the shared cache among multiple competing applications. In this paper, the authors design the framework of Process priority-based Multithread Cache Partitioning(PP-MCP),a dynamic shared cache partitioning mechanism to improve the performance of multi-threaded multi-programmed workloads. The framework includes a miss rate monitor , called Application-oriented Miss Rate Monitor (AMRM) , which dynamically collect s miss rate information of multiple multi-threaded applications on different cache partitions , and process priority-based weighted cache partitioning algorithm ,which extends traditional miss rate oriented cache partition algorithms.The algorithm allocates Cache in sequence of the value of the process priority and it ensures that the highest priority process will get enough Cache space; and the applications with more threads tend to get more shared cache in order to improve t he overall system performance. Experiments show that PP-MCP has better IPC throughput and weighted speedup. Specifically , for multi-threaded multi-programmed scientific computing workloads , PP-MCP-1 improves throughput by up to 20% and on average 10 % over PP-MCP-0.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant