Large-scale Multiprocessors Research Articles

To design effective large-scale multiprocessors, designers need to understand the characteristics of the applications that will use the machines. Application characteristics of particular interest include the amount of communication relative to computation, the structure of the communication, and the local cache and memory requirements, as well as how these characteristics scale with larger problems and machines. One important class of applications is based on hierarchical N-body methods, which are used to solve a wide range of scientific and engineering problems efficiently. Important characteristics of these methods include the nonuniform and dynamically changing nature of the domains to which they are applied, and their use of long-range, irregular communication. This article examines the key architectural implications of representative applications that use the two dominant hierarchical N-body methods: the Barnes-Hut Method and the Fast Multipole Method. We first show that exploiting temporal locality on accesses to communicated data is critical to obtaining good performance on these applications and then argue that coherent caches on shared-address-space machines exploit this locality both automatically and very effectively. Next, we examine the implications of scaling the applications to run on larger machines. We use scaling methods that reflect the concerns of the application scientist and find that this leads to different conclusions about how communication traffic and local cache and memory usage scale than scaling based only on data set size. In particular, we show that under the most realistic form of scaling, both the communication-to-computation ratio as well as the working-set size (and hence the ideal cache size per processor) grow slowly as larger problems are run on larger machines. Finally, we examine the effects of using the two dominant abstractions for interprocessor communication: a shared address space and explicit message passing between private address spaces. We show that the lack of an efficiently supported shared address space will substantially increase the programming complexity and performance overheads for these applications.

Read full abstract

Through analysis and experiments, this paper investigates two-phase waiting algorithms to minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two-phase algorithm, a thread first waits by polling a synchronization variable. If the cost of polling reaches a limit L poll and further waiting is necessary, the thread is blocked, incurring an additional fixed cost, B . The choice of L poll is a critical determinant of the performance of two-phase algorithms. We focus on methods for statically determining L poll because the run-time overhead of dynamically determining L poll can be comparable to the cost of blocking in large-scale multiprocessor systems with lightweight threads. Our experiments show that always-block ( L poll = 0) is a good waiting algorithm with performance that is usually close to the best of the algorithms compared. We show that even better performance can be achieved with a static choice of L poll based on knowledge of likely wait-time distributions. Motivated by the observation that different synchronization types exhibit different wait-time distributions, we prove that a static choice of L poll can yield close to optimal on-line performance against an adversary that is restricted to choosing wait times from a fixed family of probability distributions. This result allows us to make an optimal static choice of L poll based on synchronization type. For exponentially distributed wait times, we prove that setting L poll = 1n(e-1) B results in a waiting cost that is no more than e/(e-1) times the cost of an optimal off-line algorithm. For uniformly distributed wait times, we prove that setting L poll =1/2(square root of 5 -1) B results in a waiting cost that is no more than (square root of 5 + 1)/2 (the golden ratio) times the cost of an optimal off-line algorithm. Experimental measurements of several parallel applications on the Alewife multiprocessor simulator corroborate our theoretical findings.

Read full abstract

Large-scale Multiprocessors Research Articles

Related Topics

Articles published on Large-scale Multiprocessors

MAD kernels: an experimental testbed to study multiprocessor memory system behavior

Scalable concurrent counting

High performance synchronization algorithms for multiprogrammed multiprocessors

Implications of hierarchical N-body methods for multiprocessor architectures

Interleaving

Interleaving

Application scheduling and processor allocation in multiprogrammed parallel processing systems

Spin-lock synchronization on the Butterfly and KSR1

Waiting algorithms for synchronization in large-scale multiprocessors

Sparcle: an evolutionary processor design for large-scale multiprocessors

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

Real time control of a multiple arm system by large scale multiprocessor : T. Hoshino, K. Furuta, pp 37–42

Hiding shared memory reference latency on the Galactica Net distributed shared memory architecture

Design of an adaptive cache coherence protocol for large scale multiprocessors

Active messages

A cost-effective combining structure for large-scale shared-memory multiprocessors

Page placement policies for NUMA multiprocessors

Correction to 'Directory-based cache coherence in large-scale multiprocessor'

Directory-based cache coherence in large-scale multiprocessors

APRIL

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large-scale Multiprocessors Research Articles

Related Topics

Articles published on Large-scale Multiprocessors

MAD kernels: an experimental testbed to study multiprocessor memory system behavior

Scalable concurrent counting

High performance synchronization algorithms for multiprogrammed multiprocessors

Implications of hierarchical N-body methods for multiprocessor architectures

Interleaving

Interleaving

Application scheduling and processor allocation in multiprogrammed parallel processing systems

Spin-lock synchronization on the Butterfly and KSR1

Waiting algorithms for synchronization in large-scale multiprocessors

Sparcle: an evolutionary processor design for large-scale multiprocessors

Working sets, cache sizes, and node granularity issues for large-scale multiprocessors

Real time control of a multiple arm system by large scale multiprocessor : T. Hoshino, K. Furuta, pp 37–42

Hiding shared memory reference latency on the Galactica Net distributed shared memory architecture

Design of an adaptive cache coherence protocol for large scale multiprocessors

Active messages

A cost-effective combining structure for large-scale shared-memory multiprocessors

Page placement policies for NUMA multiprocessors

Correction to 'Directory-based cache coherence in large-scale multiprocessor'

Directory-based cache coherence in large-scale multiprocessors

APRIL