Abstract

Recent work in software pipelining in the presence of uncertain memory latencies has shown that using compilergenerated cache-reuse analysis to determine proper load latencies can improve performance significantly [14, 19, 9]. Even with reuse information, references with a stride-one access pattern in the cache (called self-spatial loads) have been treated as all cache hits or all cache misses rather than as a single cache miss followed by a few cache hits in the rest of the cache line. In this paper, we show how hardware support for loading two consecutive cache lines with one instruction (called a prefetching load) when directed by the compiler can significantly improve software pipelining for scientific program loops. On set of 79 Fortran loops when using prefetching loads, we observed an average performance improvement of 7% over assuming all self-spatial loads are cache misses (assuming all hits often gives worse performance than assuming all misses [14]). In addition, prefetching loads reduced floating-point register pressure by 31% and integer register pressure by 20%. As a result, we were able to software pipeline 31% more loops within modern register constraints (32 integer/32 floating-point registers) with prefetching loads. These results show that specialized prefetching load instructions have considerable potential to improve software pipelining for array-based scientific codes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.