PRAM-on-chip

Xingzhi Wen,Uzi Vishkin

doi:10.1145/1248377.1248427

Abstract

Introduction The eXplicit Multi-Threading (XMT) on-chip general-purpose computer architecture is aimed at the classic goal of reducing single task completion time. It is a parallel algorithmic architecture in the sense that: (i) it seeks to provide good performance for parallel programs derived from Parallel Random Access Machine/Model (PRAM) algorithms, and (ii) a methodology for advancing from PRAM algorithms to XMT programs, along with a performance metric and its empirical validation are provided [1]. Ease of parallel programming is now widely recognized as the main stumbling block for extending commodity computer performance growth (e.g., using multicores). XMT provides a unique answer to this challenge. This brief announcement (BA) reports first commitment to silicon of XMT. A 64-processor, 75MHz computer based on fieldprogrammable gate array (FPGA) technology was built at the University of Maryland (UMD). XMT was introduced in SPAA’98. An architecture simulator and speed-up results on several kernels were reported in SPAA’01. The new computer is a significant milestone for the broad PRAM-On-Chip project at UMD. In fact, contributions in the current BA include several stages since SPAA’01: completion of the design using a hardware description language (HDL), synthesis into gate level “netlist”, as well as validation of the design in real hardware. This overall progress, its context and uses of the much faster hardware over a simulator are the focus of this BA. The PRAM virtual model of computation assumes that any number of concurrent accesses to a shared memory take the same time as a single access. In the Arbitrary Concurrent-Read Concurrent-Write (CRCW) PRAM concurrent access to the same memory location for reads or writes are allowed. Reads are resolved before writes and an arbitrary write unknown in advance succeeds. Design of an efficient parallel algorithm for the Arbitrary CRCW PRAM model would seek to optimize the total number of operations the algorithms performs (“work”) and its parallel time (“depth”) assuming unlimited hardware. Given such an algorithm, an XMT program is written in XMTC, which is a modest single-program multiple-data (SPMD) multi-threaded extension of C that includes 3 commands: Spawn, Join and PS, for Prefix-Sum—a Fetch-and-Increment-like command. The program seeks to optimize: (i) the length of the (longest) sequence of round trips to memory (LSRTM), (ii) queuing delay to the

Full Text