Abstract

An efficient use of the memory system on multi-cores is critical to improving data locality and achieving better program performance. But the hierarchical memory system with caches often works in a “black-box” manner, which automatically moves data across memory layers, and makes code optimization a daunting task. In this article, we dissect the memory system of the Phytium 2000+ many-core with micro-benchmarks. We measure the latency and bandwidth of moving cachelines across memory levels on a single core or two distinct cores. We design a set of micro-benchmarks by using the pointer-chasing method to measure latency, and using the chunk-accessing method to measure bandwidth. During measurement, we have to place the cacheline on the specified memory layer and set its initial consistency state. The experimental results on Phytium 2000+ provide a quantified form of its actual memory performance, and reveal undocumented performance data and micro-architectural details. To conclude, our work will provide quantitative guidelines for optimizing the Phytium 2000+ memory accesses.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.