Abstract

Hardware prefetching has proven very effective for hiding memory latency and can speed up some applications by more than 40%. However, this speedup comes at the cost of often prefetching a significant volume of useless data which wastes shared last level cache space and off-chip bandwidth. This directly impacts the performance of co-scheduled applications which compete for shared resources in multicores. This paper explores how a resource-efficient prefetching scheme can benefit performance by conserving shared resources in multicores. We present a framework that uses fast cache modeling to accurately identify memory instructions that benefit most from prefetching. The framework inserts software prefetches in the application only when they benefit performance, and employs cache bypassing whenever possible. These properties help reduce off-chip bandwidth consumption and last-level cache pollution. While single-thread performance remains comparable to hardware prefetching, the full advantage of the scheme is realized when several cores are used and demand for shared resources grows.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call