Abstract

This paper presents a novel energy-efficient cache design for massively parallel, throughput-oriented architectures like GPUs. Unlike L1 data cache on modern GPUs, L2 cache shared by all the streaming multiprocessors is not the primary performance bottleneck but it does consume a large amount of chip energy. We observe that L2 cache is significantly under-utilized by spending 95.6% of the storing useless data. If such time on L2 is identified and reduced, L2’s energy efficiency can be drastically improved. Fortunately, we discover that the SIMT programming model of GPUs provides a unique feature among threads: instruction-level data locality similarity, which can be used to accurately predict the data re-reference counts at L2 cache block level. We propose a simple design that leverages this Locality Similarity to build an energy-efficient GPU L2 Cache, named LoSCache. Specifically, LoSCache uses the data locality information from a small group of CTAs to dynamically predict the L2-level data re-reference counts of the remaining CTAs. After that, specific L2 cache lines can be powered off if they are predicted to be dead after certain accesses. Experimental results on a wide range of applications demonstrate that our proposed design can significantly reduce the L2 cache energy by an average of 64% with only 0.5% performance loss.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.