Global address space, non-uniform bandwidth: a memory system performance characterization of parallel systems

T Stricker,T Cross

doi:10.1109/hpca.1997.569658

Abstract

Many parallel systems offer a simple view of memory: all storage cells are addressed uniformly. Despite a uniform view of the memory, the machines differ significantly in their memory system performance (and may offer slightly different consistency models). Cached and local memory accesses are much faster than remote read accesses to data generated by another processor or remote write to data intentionally pushed to memories close to another processor. The bandwidth from/to cache and local memory can be an order of magnitude (or more) higher than the bandwidth to/from remote memory. The situation is further complicated by the heavy influence of the access pattern (i.e. the spatial locality of reference) on both the local and the remote memory system bandwidth. In these modern machines, a compiler for a parallel system is faced with a number of options to accomplish a data transfer most efficiently. The decision for the best option requires a cost benefit model, obtained in an empirical evaluation of the memory system performance. We evaluate three DEC Alpha based parallel systems, to demonstrate the practicality of this approach. The common DEC-Alpha processor architecture facilitates a direct comparison of memory system performance. These systems are the DEC 8400, the Cray T3D, and the Cray T3E. The three systems differ in their clock speed, their scalability and in the amount of coherency they provide.

Full Text