Abstract

While the common trend in building large-scale multiprocessors is to use commodity compute nodes that are increasingly powerful and have deep memory hierarchies, the Tera MTA uses a different design point, with a relatively flat memory system, no processor caches, and hardware support for light-weight multithreading, which is used to mask memory latency. In this paper we explore the implementation of Titanium, a language with coarse-grained SPMD parallelism, onto the MTA. The major concerns in obtaining high performance on the MTA are sufficient degrees of parallelism, good load balance, and low synchronization overhead. We show that by adding loop level parallelism, Titanium applications have sufficient parallelism for the MTA, and as expected, application writers do not need to orchestrate data layout. We evaluate multiple implementations of the Titanium synchronization constructs, which include barriers and monitors. We then explore several scheduling strategies, and find that the distinction between SPMD and loop level parallelism proves to be surprisingly useful. The two-level parallelism structure can be used to throttle thread migration, which lowers thread creation overhead and synchronization. We use a combination of micro-benchmarks and applications to demonstrate these results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call