Router-integrated Cache Hierarchy Design for Highly Parallel Computing in Efficient CMP Systems

Zhao Zhao,Watanabe Watanabe,Jia Jia

doi:10.3390/electronics8111363

Zhao Zhao, Watanabe Watanabe + Show 1 more

Open Access

https://doi.org/10.3390/electronics8111363

Copy DOI

Abstract

In current Chip Multi-Processor (CMP) systems, data sharing existing in cache hierarchy acts as a critical issue which costs plenty of clock cycles for maintaining data coherence. Along with the integrated core number increasing, the only shared cache serves too many processing threads to maintain sharing data efficiently. In this work, an enhanced router network is integrated within the private cache level for fast interconnecting sharing data accesses existing in different threads. All sharing data in private cache level can be classified into seven access types by experimental pattern analysis. Then, both shared accesses and thread-crossed accesses can be rapidly detected and dealt with in the proposed router network. As a result, the access latency of private cache is decreased, and a conventional coherence traffic problem is alleviated. The process in the proposed path is composed of three steps. Firstly, the target accesses can be detected by exploring in the router network. Then, the proposed replacement logic can handle those accesses for maintaining data coherence. Finally, those accesses are delivered in the proposed data deliverer. Thus, the harmful data sharing accesses are solved within the first chip layer in 3D-IC structure. The proposed system is also implemented into a cycle-precise simulation platform, and experimental results illustrate that our model can improve the Instructions Per Cycle (IPC) of on-chip execution by maximum 31.85 percent, while energy consumption can be saved by about 17.61 percent compared to the base system.

Highlights

In modern Chip Multi-Processor (CMP) systems, cache hierarchies are too large to support highly-parallel computing as the number of cores stacked in one chip is increasing rapidly
For different benchmarks executed in the base system, their throughput values range from 43.5 Gbps to 120.7 Gbps while the average throughput value among all criteria is about 77.1 Gbps
Once the proposed router network is applied into the simulation platform, throughput values in all benchmarks are improved greatly, and the average value is about 100.5 Gbps with 16 history buffer entries allocated, and in case of 32 entries allocated, the average value reaches the peak throughput by 117.9 Gbps which is improved by 52.92 percent over the one of base system

Summary

Introduction

In modern Chip Multi-Processor (CMP) systems, cache hierarchies are too large to support highly-parallel computing as the number of cores stacked in one chip is increasing rapidly. Since those cores and caches crowd into the single-chip area, data flows in many concurrent threads become too complicated to achieve the desired throughput [1]. The recently proposed stacked on-chip architectures can efficiently work by allocating cores and cache hierarchies into different layers, which leads to on-chip components in adjacent layers in 3D by Through Silicon Vias (TSVs) [3], thereby decreasing the physical routing distance and latency. As shown in Electronics 2019, 8, 1363; doi:10.3390/electronics8111363 www.mdpi.com/journal/electronics

Results

Discussion

Conclusion