Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Cache sharing on modern Chip Multiprocessors (CMPs) reduces communication latency among corunning threads, and also causes interthread cache contention. Most previous studies on the influence of cache sharing have concentrated on the design or management of shared cache. The observed influence is often constrained by the reliance on simulators, the use of out-of-date benchmarks, or the limited coverage of deciding factors. This paper describes a systematic measurement of the influence with most of the potentially important factors covered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions in the PARSEC benchmark suite, regardless of the types of parallelism, input data sets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch between the software design (and compilation) of multithreaded applications and CMP architectures. By performing source code transformations on the programs in a cache-sharing-aware manner, we observe up to 53 percent performance increase when the threads are placed on cores appropriately, confirming the software-hardware mismatch as a main reason for the observed insignificance of the influence from cache sharing, and indicating the important role of cache-sharing-aware transformations-a topic only sporadically studied so far-for exerting the power of shared cache.

Similar Papers
  • Research Article
  • Cite Count Icon 15
  • 10.1145/1837853.1693482
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
  • Jan 9, 2010
  • ACM SIGPLAN Notices
  • Eddy Z Zhang + 2 more

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood. In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

  • Conference Article
  • Cite Count Icon 124
  • 10.1145/1693453.1693482
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
  • Jan 9, 2010
  • Eddy Z Zhang + 2 more

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

  • Book Chapter
  • Cite Count Icon 4
  • 10.1007/978-3-642-19595-2_7
Array Regrouping on CMP with Non-uniform Cache Sharing
  • Jan 1, 2011
  • Yunlian Jiang + 4 more

Array regrouping enhances program spatial locality by interleaving elements of multiple arrays that tend to be accessed closely. Its effectiveness has been systematically studied for sequential programs running on unicore processors, but not for multithreading programs on modern Chip Multiprocessor (CMP) machines.On one hand, the processor-level parallelism on CMP intensifies memory bandwidth pressure, suggesting the potential benefits of array regrouping for CMP computing. On the other hand, CMP architectures exhibit extra complexities—especially the hierarchical, heterogeneous cache sharing among hyperthreads, cores, and processors—that impose new challenges to array regrouping.In this work, we initiate an exploration to the new opportunities and challenges. We propose cache-sharing-aware reference affinity analysis for identifying data affinity in multithreading applications. The analysis consists of affinity-guided thread scheduling and hierarchical reference-vector merging, handles cache sharing among both hyperthreads and cores, and offers hints for array regrouping and the avoidance of false sharing. Preliminary experiments demonstrate the potential of the techniques in improving locality of multithreading applications on CMP with various pitfalls avoided.KeywordsCode UnitCache LineFrequency VectorCache SharingThread ScheduleThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Conference Article
  • Cite Count Icon 565
  • 10.5555/1025127.1026001
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
  • Sep 29, 2004
  • Seongbeom Kim + 2 more

This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4/spl times/, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/siphotonics.2015.10
High-Speed Optical Cache Memory as Single-Level Shared Cache in Chip-Multiprocessor Architectures
  • Jan 1, 2015
  • Pavlos Maniotis + 3 more

We present an optical bus-based Chip Multiprocessor architecture where the processing cores share an optical single-level cache unit. Physically, the optical cache is implemented externally in a separate chip located next to the CPU die. The cache interconnection system is realized through WDM optical interfaces that connect the shared cache module with the processing cores and the Main Memory via spatial-multiplexed optical waveguides; hence, the CPU-DRAM communication completely takes place in the optical domain. To evaluate the shared optical cache approach, we carry out system-level simulations of 6 realistic processor parallel workloads via the Gem5 platform. The optical cache architecture is compared against the conventional electronic Chip Multiprocessor topology that uses dedicated on-chip L1 electronic caches and a shared L2 cache. The results show significant reduction in the L1 miss rate of up to 96% for certain cases; on average, a performance speed-up of up to 20.53% or a reduction of up to 65.8% in cache capacity requirements is attained. Combined with high-bandwidth CPU-DRAM bus solutions based on optical interconnects, the proposed design is a quite promising system architecture that bridges the gap between high-speed optically connected CPU-DRAM schemes and high-speed optical memory technologies. © 2015 IEEE.

  • Research Article
  • Cite Count Icon 3
  • 10.1109/tpds.2007.70723
Editorial: Special Section on CMP Architectures
  • Aug 1, 2007
  • IEEE Transactions on Parallel and Distributed Systems
  • Ravi Iyer + 1 more

CHIP multiprocessor (CMP) architectures are formed when multiple compute cores are integrated onto the same chip, forming a single, powerful, computational entity. Nearly every major high-performance processor manufacturer has at least two cores (dual-core) on the die, and their roadmaps are increasingly multicore, signaling that the era of big, monolithic uniprocessors has ended. This results from the fact that ever-larger uniprocessors do not scale well in power/performance, area/performance, or design complexity/performance. Continued performance scaling of these processors will thus be focused primarily on increasing multithreaded throughput. The rapid adoption of small-scale CMP platforms and the quest for high performance continues to accelerate the rate at which processor manufacturers are considering adding more cores on the die. Over the last decade, there has been significant progress in research and development in both academia and industry on CMP architecture and design for client and server platforms. And, while we have successfully entered the era of CMP, there are a significant set of challenges and opportunities that are yet to be investigated deeply. Some of the broad research areas being investigated include CMP architecture alternatives (for core, cache, interconnect, and memory), CMP design and technologies (process implications, new technologies like 3D-stacking, voltage/clock domain management, etc.), CMP performance evaluation (new simulation and modeling techniques, emerging applications and execution environments like virtualization), and novel CMP architectures and use cases (asymmetric or heterogeneous architectures, accelerators, etc.). There are many questions that are still to be answered for CMP architectures. Below, we list a few of the most compelling ones.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icect.2009.74
Load Balance Scheduling Algorithm for CMP Architecture
  • Feb 1, 2009
  • Qingsong Shi + 3 more

Chip MultiProcessor (CMP) has been the main stream in microprocessor design. Shared on-chip L2 caches are widely used in processors with homogeneous CMP architecture. In the paper, we propose a scheduling algorithm for such processors and the shared L2 caches are taken into account in this algorithm. First, the processor cores on chip will be divided into different core groups. The scheduling domain is also constructed according to these core groups. And then the load vectors for load balance are defined. Then a scheduling algorithm is designed and implemented for load balance on CMP architecture. We have compared our algorithm with the CMP scheduling algorithm of Linux. The experimental results show that, when there are multi-threads in execution, the load balancing between processors is achieved by our algorithm, the total execution time is reduced by 3%, and the miss rate of L2 cache is reduced by 0.2% as well.

  • Conference Article
  • Cite Count Icon 10
  • 10.1145/1509084.1509088
A shared cache for a chip multi vector processor
  • Oct 26, 2008
  • Akihiro Musa + 6 more

This paper discusses the design of a chip multi vector processor (CMVP), especially examining the effects of an on-chip cache when the off-chip memory bandwidth is limited. As chip multiprocessors (CMPs) have become the mainstream in commodity scalar processors, the CMP architecture will be adopted to design of vector processors in the near future for harnessing a large number of transistors on a chip. To keep a higher sustained performance in execution of scientific and engineering applications, a vector processor (core) generally requires the ratio of the memory bandwidth to the arithmetic performance of at least 4 bytes/flop (B/FLOP). However, vector supercomputers have been encountering the memory wall problem due to the limited pin bandwidth. Therefore, we propose an on-chip shared cache to maintain the effective memory bandwidth for a CMVP. We evaluate the performance of the CMVP based on the NEC SX vector architecture using real scientific applications. Especially, we examine the caching effect on the sustained performance when the B/FLOP rate is decreased. The experimental results indicate that an 8 MB on-chip shared cache can improve the performance of a four-core CMVP by 15% to 40%, compared with that without the cache. This is because the shared cache can increase cache hit rates of multi-threads. Here, the shared cache employs a miss status handling registers, which has the potential for accelerating difference schemes in scientific and engineering applications. Moreover, we show that the 2 B/FLOP is enough for the CMVP to achieve a high scalability when the on-chip cache is employed.

  • Conference Article
  • Cite Count Icon 13
  • 10.1109/iccd.2008.4751875
Efficiency of thread-level speculation in SMT and CMP architectures - performance, power and thermal perspective
  • Oct 1, 2008
  • Venkatesan Packirisamy + 5 more

Computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000psilas. However, because of the lack of compilers and other related software technologies, most of the general-purpose applications today still cannot take advantage of such architectures to improve their performance. Thread-level speculation (TLS) has been proposed as a way of using these multi-threaded architectures to parallelize general-purpose applications. Both simultaneous multithreading (SMT) and chip multiprocessors (CMP) have been extended to implement TLS. While the characteristics of SMT and CMP have been widely studied under multi-programmed and parallel workloads, their behavior under TLS workload is not well understood. The TLS workload due to speculative nature of the threads which could potentially be rollbacked and due to variable degree of parallelism available in applications, exhibits unique characteristics which makes it different from other workloads. In this paper, we present a detailed study of the performance, power consumption and thermal effect of these multithreaded architectures against that of a Superscalar with equal chip area. A wide spectrum of design choices and tradeoffs are also studied using commonly used simulation techniques. We show that the SMT based TLS architecture performs about 21% better than the best CMP based configuration while it suffers about 16% power overhead. In terms of Energy-Delay-Squared product (ED <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> ), SMT based TLS performs about 26% better than the best CMP based TLS configuration and 11% better than the superscalar architecture. But the SMT based TLS configuration, causes more thermal stress than the CMP based TLS architectures.

  • Conference Article
  • Cite Count Icon 28
  • 10.1109/pact.2009.14
SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors
  • Sep 1, 2009
  • Lei Jin + 1 more

This paper proposes a new software-oriented approach for managing the distributed shared L2 caches of a chip multiprocessor (CMP) for latency-oriented multithreaded applications. The conventional shared cache scheme loses performance due to the blind distribution of data predominantly accessed by a single thread. SOS, our software-oriented distributed shared cache management approach, infers a programpsilas data affinity hints through a novel machine learning based analysis of its L2 cache access behavior. The OS utilizes the hints to guide proper data placement in the L2 cache with page coloring. The derived hints are independent of the program input and can be used for multiple runs. By off-loading the cache management task onto software, SOS deviates substantially from previously proposed hardware based strategies and opens up a new opportunity for the CMP cache optimization. Our experimental results demonstrate that SOS is very effective in reducing the number of remote cache accesses. By using the hints for guiding page coloring alone, SOS achieves an average speedup of 10% and up to 23% over the shared cache scheme. When hints are used to direct data replication, SOS secures an additional performance gain of 9%, performing 19% better than the shared cache scheme on average.

  • Conference Article
  • Cite Count Icon 64
  • 10.5555/2523721.2523752
Jigsaw: scalable software-defined caches
  • Oct 7, 2013
  • Nathan Beckmann + 1 more

Shared last-level caches, widely used in chip-multiprocessors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency. We present Jigsaw, a technique that jointly addresses the scalability and interference problems of shared caches. Hardware lets software define shares, collections of cache bank partitions that act as virtual caches, and map data to shares. Shares give software full control over both data placement and capacity allocation. Jigsaw implements efficient hardware support for share management, monitoring, and adaptation. We propose novel resource-management algorithms and use them to develop a system-level runtime that leverages Jigsaw to both maximize cache utilization and place data close to where it is used.We evaluate Jigsaw using extensive simulations of 16- and 64-core tiled CMPs. Jigsaw improves performance by up to 2.2x (18% avg) over a conventional shared cache, and significantly outperforms state-of-the-art NUCA and partitioning techniques.

  • Research Article
  • Cite Count Icon 6
  • 10.1007/s11227-016-1665-3
Reliability aware throughput management of chip multi-processor architecture via thread migration
  • Feb 18, 2016
  • The Journal of Supercomputing
  • Fatemeh Pouyan + 3 more

Integrating the large number of transistor in a single chip leads to significant improvement on the performance of processors. More performance is achieved by putting multiple CPU cores on a single chip which is named as chip multiprocessor (CMP) architecture. On the other hand, miniaturization and integration of the large number of transistors in new silicons such as CMPs increase susceptibility to soft errors and degrade the reliability. Previous researches have exploited traditional redundancy techniques such as dual and triple cores redundancy to tolerate fault in CMP architecture while these methods impose significant performance and energy overheads. In this paper, we present a performance efficient soft error protection scheme for CMP architecture which is based on simultaneous multithreading. Fortunately, some of soft errors are masked at architectural level and don't cause visible output error. Soft error masking effect can be used to decrease a lot of overheads in reliability enhancement techniques against soft errors. Recently, architectural vulnerability factor (AVF) is widely used for estimating the portion of soft errors which are masked. In this article, we propose a reliability aware CMP architecture which use online AVF estimation to specify level of protection. To meet system reliability demands, the estimated AVF is used to exploit partial redundancy against soft errors which leads to significant performance improvement. Also, we introduce a dynamic scheduling method for mapping threads on the cores to enhance total throughput of CMP architecture. Our dynamic scheduling applies thread migration among cores by simultaneous considering to the total vulnerability and throughput of cores. Thread migration between cores balances loads between cores and improves performance. Our experimental results on SPEC CPU2006 show up to 38 % improvement in core throughput in different phases of thread migration compared to static mapping of threads on the cores.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/iwia.2004.10020
Memory Management for Data Localization on OSCAR Chip Multiprocessor
  • Jan 12, 2004
  • H Nakano + 3 more

Chip multiprocessor (CMP) architecture has attracting much attention as a next-generation microprocessor architecture and many kinds of CMP are widely being researched. However, CMP architectures several difficulties for effective use of memory, especially cache or local memory near a processor core. The authors have proposed OSCAR CMP architecture, which cooperatively works with multigrain parallelizing compiler which gives us much higher parallelism than instruction level parallelism or loop level parallelism and high productivity of application programs. To support the compiler optimization for effective use of cache or local memory, OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory (DSM) for synchronization and fine grain data transfers among processors, in addition to centralized shared memory (CSM) to support dynamic task scheduling. This paper proposes a static coarse grain task scheduling scheme for data localization using live variable analysis. Furthermore, remote memory data transfer scheduling scheme using information of live variable analysis is also described. The proposed scheme is implemented on OSCAR FORTRAN multigrain parallelizing compiler and is evaluated on OSCAR CMP using Tomcatv and Swim in SPEC CFP 95 benchmark

  • Book Chapter
  • Cite Count Icon 8
  • 10.1007/11796435_37
Preventing Denial-of-Service Attacks in Shared CMP Caches
  • Jan 1, 2006
  • Georgios Keramidas + 4 more

Denial-of-Service (DoS) attacks try to exhaust some shared resources (e.g. process tables, functional units) of a service-centric provider. As Chip Multi-Processors (CMPs) are becoming mainstream architecture for server class processors, the need to manage on-chip resources in a way that can provide QoS guarantees becomes a necessity. Shared resources in CMPs typically include L2 cache memory. In this paper, we explore the problem of managing the on-chip shared caches in a CMP workstation where malicious threads or just cache “hungry” threads try to hog the cache giving rise to DoS opportunities. An important characteristic of our method is that there is no need to distinguish between malicious and “healthy” threads. The proposed methodology is based on a statistical model of a shared cache that can be fed with run-time information and accurately describe the behavior of the shared threads. Using this information, we are able to understand which thread (malicious or not) can be “compressed” into less space with negligible damage and to drive accordingly the underlying replacement policy of the cache. Our results show that the proposed attack-resistant replacement algorithm can be used to enforce high-level policies such as policies that try to maximize the “usefulness” of the cache real estate or assign custom space-allocation policies based on external QoS needs.

  • Research Article
  • Cite Count Icon 298
  • 10.1145/1080695.1070001
Optimizing Replication, Communication, and Capacity Allocation in CMPs
  • May 1, 2005
  • ACM SIGARCH Computer Architecture News
  • Zeshan Chishti + 2 more

Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant