Abstract

As the number of streaming multiprocessors (SMs) in GPUs increases, in order to gain better performance, the reply network faces heavy traffic. This causes congestion on Network-on-Chip (NoC) routers and memory controller’s (MC) buffers. By taking advantage of cooperative thread arrays (CTAs) that are scheduled locally in clusters, there is a high probability of finding the same copy of data in other SM’s [Formula: see text] cache in the same cluster. In order to make this feasible, it is necessary for the SMs to have access to local [Formula: see text] cache of the neighboring SMs. There is a considerable congestion in NoC due to unique traffic pattern called many-to-few-to-many. Thanks to the reduced number of requests that is attained by our proposed Intra-Cluster Locality-Aware (ICLA) unit, this congested replying network traffic becomes many-to-many traffic pattern and the replied data goes through the less-utilized core-to-core communication that mitigates the NoC traffic. The proposed architecture in this paper has been evaluated using 15 different workloads from CUDA SDK, Rodinia, and ISPASS2009 benchmarks. The proposed ICLA unit has been modeled and simulated in the GPGPU-Sim. The results show about 23.79% (up to 49.82%) reduction in average network latency, 15.49% (up to 36.82%) reduction in average [Formula: see text] cache access, and 18.18% (up to 58.1%) average improvement in the instruction per cycle (IPC).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call