Abstract

Shared-memory architectures have become predominant in modern multi-core microprocessors in all market segments, from embedded to high performance computing. Correctness of these architectures is ensured by means of coherence protocols and consistency models. Performance and scalability of shared-memory systems is usually limited by the amount and size of the messages used to keep the memory subsystem coherent. Moreover, we believe that blindly keeping coherence for all memory accesses can be counterproductive, since it incurs in unnecessary overhead for data that will remain coherent after the access. Having this in mind, in this paper we propose the use of dedicated caches for private (+shared read-only) and shared data. The private cache (L1P) will be independent for each core while the shared cache (L1S) will be logically shared but physically distributed for all cores. This separation should allow us to simplify the coherence protocol, reduce the on-chip area requirements and reduce invalidation time with minimal impact on performance. The dedicated cache design requires a classification mechanism to detect private and shared data. In our evaluation we will use a classification mechanism that operates at the operating system (OS) level (page granularity). Results show two drawbacks to this approach: first, the selected classification mechanism has too many false positives, thus becoming an important limiting factor. Second, a traditional interconnection network is not optimal for accessing the L1S, and a custom network design is needed. These drawbacks lead to important performance degradation due to the additional latency when accessing the shared data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call