Abstract

The future chip-multiprocessors (CMPs) with a large number of cores faces difficult issues in efficient utilizing on-chip storage space. Tradeoffs between data accessibility and effective on-chip capacity have been studied extensively. It requires costly simulations to understand a wide-spectrum of design spaces. In this paper, we first develop an abstract model for understanding the performance impact with respect to the degree of data replication. To overcome the lack of real-time interactions among multiple cores in the abstract model, we propose an efficient single-pass stack simulation method to study the performance of a variety of cache organizations on CMPs. The proposed global stack logically incorporates a shared stack and per-core private stacks to collect shared/private reuse (stack) distances for every memory reference in a single simulation pass. With the collected reuse distances, performance in terms of hits/misses and average memory access times can be calculated for multiple cache organizations. The basic stack simulation results can further derive other CMP cache organizations with various degrees of data replication. We verify both the modeling and the stack results against individual execution-driven simulations that consider realistic cache parameters and delays using a set of commercial multithreaded workloads. We also compare the simulation time saving with the stack simulation. The results show that stack simulation can accurately model the performance of various studied cache organizations with 2-9% error margins using only about 8% of the simulation time. The results also show that the effectiveness of various techniques for optimizing the CMP on-chip storage is closely related to the working sets of the workloads as well as the total cache sizes

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call