Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4/spl times/, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.
- Research Article
3
- 10.1109/tpds.2007.70723
- Aug 1, 2007
- IEEE Transactions on Parallel and Distributed Systems
CHIP multiprocessor (CMP) architectures are formed when multiple compute cores are integrated onto the same chip, forming a single, powerful, computational entity. Nearly every major high-performance processor manufacturer has at least two cores (dual-core) on the die, and their roadmaps are increasingly multicore, signaling that the era of big, monolithic uniprocessors has ended. This results from the fact that ever-larger uniprocessors do not scale well in power/performance, area/performance, or design complexity/performance. Continued performance scaling of these processors will thus be focused primarily on increasing multithreaded throughput. The rapid adoption of small-scale CMP platforms and the quest for high performance continues to accelerate the rate at which processor manufacturers are considering adding more cores on the die. Over the last decade, there has been significant progress in research and development in both academia and industry on CMP architecture and design for client and server platforms. And, while we have successfully entered the era of CMP, there are a significant set of challenges and opportunities that are yet to be investigated deeply. Some of the broad research areas being investigated include CMP architecture alternatives (for core, cache, interconnect, and memory), CMP design and technologies (process implications, new technologies like 3D-stacking, voltage/clock domain management, etc.), CMP performance evaluation (new simulation and modeling techniques, emerging applications and execution environments like virtualization), and novel CMP architectures and use cases (asymmetric or heterogeneous architectures, accelerators, etc.). There are many questions that are still to be answered for CMP architectures. Below, we list a few of the most compelling ones.
- Research Article
6
- 10.1007/s11227-016-1665-3
- Feb 18, 2016
- The Journal of Supercomputing
Integrating the large number of transistor in a single chip leads to significant improvement on the performance of processors. More performance is achieved by putting multiple CPU cores on a single chip which is named as chip multiprocessor (CMP) architecture. On the other hand, miniaturization and integration of the large number of transistors in new silicons such as CMPs increase susceptibility to soft errors and degrade the reliability. Previous researches have exploited traditional redundancy techniques such as dual and triple cores redundancy to tolerate fault in CMP architecture while these methods impose significant performance and energy overheads. In this paper, we present a performance efficient soft error protection scheme for CMP architecture which is based on simultaneous multithreading. Fortunately, some of soft errors are masked at architectural level and don't cause visible output error. Soft error masking effect can be used to decrease a lot of overheads in reliability enhancement techniques against soft errors. Recently, architectural vulnerability factor (AVF) is widely used for estimating the portion of soft errors which are masked. In this article, we propose a reliability aware CMP architecture which use online AVF estimation to specify level of protection. To meet system reliability demands, the estimated AVF is used to exploit partial redundancy against soft errors which leads to significant performance improvement. Also, we introduce a dynamic scheduling method for mapping threads on the cores to enhance total throughput of CMP architecture. Our dynamic scheduling applies thread migration among cores by simultaneous considering to the total vulnerability and throughput of cores. Thread migration between cores balances loads between cores and improves performance. Our experimental results on SPEC CPU2006 show up to 38 % improvement in core throughput in different phases of thread migration compared to static mapping of threads on the cores.
- Conference Article
14
- 10.1145/2483028.2483052
- May 2, 2013
In recent years, NVM (non-volatile memory) technologies, such as STT-RAM (spin transfer torque RAM) and PRAM (phase change RAM), have drawn a lot of attention due to their low leakage and high density. However, both NVMs suffer from high write latency and limited endurance problems. To overcome these problems, the SRAM/NVM hybrid cache architecture has been proposed, and the write pressure on NVM can be mitigated with appropriate write management policy. Moreover, many wear leveling techniques have been proposed to extend the lifetime of NVM in the hybrid cache. In this paper, we proposed a hybrid cache design that includes SRAM cache, STT-RAM cache, and STT-RAM/SRAM hybrid cache banks for CMP (chip multi-processors) architecture. We also propose a partition-level wear leveling scheme and access-aware policies to mitigate unbalanced wear-out of STT-RAM lines within a partition and among different cache partitions. Experimental results show that, our proposed scheme and policies can achieve an average of 89 times improvement in cache lifetime and are able to save 58% power consumption compared to SRAM cache.
- Book Chapter
3
- 10.1007/11859802_20
- Jan 1, 2006
The Data-Driven Multithreading Chip Multiprocessor (DDM-CMP) architecture has been shown to overcome the power and memory wall limitations by combining two key technologies: the use of the Data-Driven Multithreading (DDM) model of execution, and the Chip-Multiprocessor architecture. DDM is able to hide memory and synchronization latencies providing significant performance gains whereas the use of of the CMP architecture offers high-degree of parallelism at low complexity design and is therefore power efficient. This paper presents the hardware budget analysis and the runtime support system for the DDM-CMP architecture. The hardware analysis shows that the DDM benefits may be achieved with only a 17% hardware cost increase compared to a traditional chip-multiprocessor implementation. The support for the runtime system was designed in such a way that allows the DDM applications to execute on the DDM-CMP chip using a regular, non-modified, Operating System and CPU cores.
- Conference Article
4
- 10.1109/iwia.2004.10020
- Jan 12, 2004
Chip multiprocessor (CMP) architecture has attracting much attention as a next-generation microprocessor architecture and many kinds of CMP are widely being researched. However, CMP architectures several difficulties for effective use of memory, especially cache or local memory near a processor core. The authors have proposed OSCAR CMP architecture, which cooperatively works with multigrain parallelizing compiler which gives us much higher parallelism than instruction level parallelism or loop level parallelism and high productivity of application programs. To support the compiler optimization for effective use of cache or local memory, OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory (DSM) for synchronization and fine grain data transfers among processors, in addition to centralized shared memory (CSM) to support dynamic task scheduling. This paper proposes a static coarse grain task scheduling scheme for data localization using live variable analysis. Furthermore, remote memory data transfer scheduling scheme using information of live variable analysis is also described. The proposed scheme is implemented on OSCAR FORTRAN multigrain parallelizing compiler and is evaluated on OSCAR CMP using Tomcatv and Swim in SPEC CFP 95 benchmark
- Conference Article
14
- 10.5555/1356802.1356817
- Jan 21, 2008
We address performance maximization of independent task sets under energy constraint on chip multi-processor (CMP) architectures that support multiple voltage/frequency operating states for each core. We prove that the problem is strongly NP-hard. We propose polynomial time 2-approximation algorithms for homogeneous and heterogeneous CMPs. To the best of our knowledge, our techniques offer the tightest bounds for energy constrained design on CMP architectures. Experimental results demonstrate that our techniques are effective and efficient under various workloads on several CMP architectures.
- Conference Article
13
- 10.1109/aspdac.2008.4484026
- Jan 1, 2008
We address performance maximization of independent task sets under energy constraint on chip multi-processor (CMP) architectures that support multiple voltage/frequency operating states for each core. We prove that the problem is strongly NP-hard. We propose polynomial time 2-approximation algorithms for homogeneous and heterogeneous CMPs. To the best of our knowledge, our techniques offer the tightest bounds for energy constrained design on CMP architectures. Experimental results demonstrate that our techniques are effective and efficient under various workloads on several CMP architectures.
- Conference Article
2
- 10.1109/icect.2009.74
- Feb 1, 2009
Chip MultiProcessor (CMP) has been the main stream in microprocessor design. Shared on-chip L2 caches are widely used in processors with homogeneous CMP architecture. In the paper, we propose a scheduling algorithm for such processors and the shared L2 caches are taken into account in this algorithm. First, the processor cores on chip will be divided into different core groups. The scheduling domain is also constructed according to these core groups. And then the load vectors for load balance are defined. Then a scheduling algorithm is designed and implemented for load balance on CMP architecture. We have compared our algorithm with the CMP scheduling algorithm of Linux. The experimental results show that, when there are multi-threads in execution, the load balancing between processors is achieved by our algorithm, the total execution time is reduced by 3%, and the miss rate of L2 cache is reduced by 0.2% as well.
- Research Article
9
- 10.1109/tpds.2011.130
- Feb 1, 2012
- IEEE Transactions on Parallel and Distributed Systems
Cache sharing on modern Chip Multiprocessors (CMPs) reduces communication latency among corunning threads, and also causes interthread cache contention. Most previous studies on the influence of cache sharing have concentrated on the design or management of shared cache. The observed influence is often constrained by the reliance on simulators, the use of out-of-date benchmarks, or the limited coverage of deciding factors. This paper describes a systematic measurement of the influence with most of the potentially important factors covered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions in the PARSEC benchmark suite, regardless of the types of parallelism, input data sets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch between the software design (and compilation) of multithreaded applications and CMP architectures. By performing source code transformations on the programs in a cache-sharing-aware manner, we observe up to 53 percent performance increase when the threads are placed on cores appropriately, confirming the software-hardware mismatch as a main reason for the observed insignificance of the influence from cache sharing, and indicating the important role of cache-sharing-aware transformations-a topic only sporadically studied so far-for exerting the power of shared cache.
- Conference Article
- 10.1109/iccrd.2011.5763842
- Mar 1, 2011
On a CMP (Chip Multi-Processor) architecture, cache sharing impacts threads non-uniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as throughput decreasing, cache thrashing. This paper proposes an architectural support predicting method (ASPM) to predict inter-thread cache contention, and schedules threads based on the results of the predicting on the CMP architecture. The architectural support consists of novel tagging which is remembering each cache line's interval time of reaccessing. The ASPM is very little loss of accuracy basis at low overhead (<;1%). We use this method for CMP scheduling, and find the performance is improving 9% on average.
- Conference Article
3
- 10.1109/siphotonics.2015.10
- Jan 1, 2015
We present an optical bus-based Chip Multiprocessor architecture where the processing cores share an optical single-level cache unit. Physically, the optical cache is implemented externally in a separate chip located next to the CPU die. The cache interconnection system is realized through WDM optical interfaces that connect the shared cache module with the processing cores and the Main Memory via spatial-multiplexed optical waveguides; hence, the CPU-DRAM communication completely takes place in the optical domain. To evaluate the shared optical cache approach, we carry out system-level simulations of 6 realistic processor parallel workloads via the Gem5 platform. The optical cache architecture is compared against the conventional electronic Chip Multiprocessor topology that uses dedicated on-chip L1 electronic caches and a shared L2 cache. The results show significant reduction in the L1 miss rate of up to 96% for certain cases; on average, a performance speed-up of up to 20.53% or a reduction of up to 65.8% in cache capacity requirements is attained. Combined with high-bandwidth CPU-DRAM bus solutions based on optical interconnects, the proposed design is a quite promising system architecture that bridges the gap between high-speed optically connected CPU-DRAM schemes and high-speed optical memory technologies. © 2015 IEEE.
- Conference Article
10
- 10.1145/1509084.1509088
- Oct 26, 2008
This paper discusses the design of a chip multi vector processor (CMVP), especially examining the effects of an on-chip cache when the off-chip memory bandwidth is limited. As chip multiprocessors (CMPs) have become the mainstream in commodity scalar processors, the CMP architecture will be adopted to design of vector processors in the near future for harnessing a large number of transistors on a chip. To keep a higher sustained performance in execution of scientific and engineering applications, a vector processor (core) generally requires the ratio of the memory bandwidth to the arithmetic performance of at least 4 bytes/flop (B/FLOP). However, vector supercomputers have been encountering the memory wall problem due to the limited pin bandwidth. Therefore, we propose an on-chip shared cache to maintain the effective memory bandwidth for a CMVP. We evaluate the performance of the CMVP based on the NEC SX vector architecture using real scientific applications. Especially, we examine the caching effect on the sustained performance when the B/FLOP rate is decreased. The experimental results indicate that an 8 MB on-chip shared cache can improve the performance of a four-core CMVP by 15% to 40%, compared with that without the cache. This is because the shared cache can increase cache hit rates of multi-threads. Here, the shared cache employs a miss status handling registers, which has the potential for accelerating difference schemes in scientific and engineering applications. Moreover, we show that the 2 B/FLOP is enough for the CMVP to achieve a high scalability when the on-chip cache is employed.
- Research Article
4
- 10.1109/tc.2021.3099688
- Jul 1, 2022
- IEEE Transactions on Computers
Real-time inference of deep learning models on embedded and energy-efficient devices becomes increasingly desirable with the rapid growth of artificial intelligence on edge. Specifically, to achieve superb energy-efficiency and scalability, efficient parallelization of single-pass deep neural network (DNN) inference on chip multiprocessor (CMP) architectures is urgently required by many time-sensitive applications. However, as the number of processing cores scales up and the performance of cores has grown much fast, the on-chip inter-core data movement is prone to be a performance bottleneck for computation. To remedy this problem and further improve the performance of network inference, in this work, we introduce a communication-aware DNN parallelization technique called CAP, by exploiting the elasticity and noise-tolerance of deep learning algorithms on CMP. Moreover, in the hope that the conducted studies can provide new design values for real-time neural network inference on embedded chips, we also have evaluated the proposed approach on both multi-core Neural Network Accelerators (NNA) chips and general-purpose chip-multiprocessors. Our experimental results show that the proposed CAP can achieve 1.12×-1.65× system speedups and 1.14×-2.70× energy efficiency for different neural networks while maintaining the inference accuracy, compared to baseline approaches.
- Conference Article
1
- 10.1109/iccsn.2011.6013973
- May 1, 2011
On a Chip Multi-Processor (CMP) architecture, cache sharing impacts threads non-uniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as throughput decreasing, cache thrashing. This paper proposes a new predicting inter-thread cache contention model, FOM (Frequency of Miss), and schedules threads based on the results of FOM on the CMP architecture. The input to our model is the L2 cache misses number of each thread. The output of the model is the extra L2 cache misses for each thread due to cache sharing. We use the output of the model to guide scheduling. We use the multi2sim simulator to compare the throughput of FOA (Frequency of Access) with our FOM, and find our method improving performance up to 13%.
- Conference Article
21
- 10.1109/hoti.2006.20
- Aug 23, 2006
Chip multiprocessor (CMP) architectures are fast becoming the dominant design for general purpose processors. Whereas current generation server and desktop processors use homogenous CMP architectures, network processors (NPs) have used heterogeneous CMP architectures for years. At the same time, the failure of network stacks in traditional processors to scale with increased network bandwidths has spawned numerous proposals for new approaches to accelerate network processing. This paper looks at moving network stack processing from the main CPU to a series of smaller, closely coupled, and more efficient processors in a heterogeneous CMP by implementing such an architecture on an Intel IXP network processor. Our experiments show that the close coupling and flexible nature of the IXP?s microengines allow them to greatly accelerate network processing for a small cost in area.