Abstract

The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.

Highlights

  • MotivationHigh-end supercomputers are on the steady path of growth in size and complexity

  • The Parallel Unified Linear algebra with Systolic ARrays (PULSAR) programming model relies on five abstractions to define the computation: Virtual Systolic Array (VSA), Virtual Data Processor (VDP), channel, packet, tuple; and on two abstractions to map the computation to the actual hardware: thread, device

  • The VSA is the main object in PULSAR, containing all the top-level information about the system, including: the total number of nodes and the rank of the local node, the number of CPU threads launched per node, and the number of GPU devices used per node

Read more

Summary

Motivation

High-end supercomputers are on the steady path of growth in size and complexity. One can get a fairly reasonable picture of the road that lies ahead by examining the platforms that will be brought online under the DOEs CORAL initiative. Summit and Sierra will follow the hybrid computing model, by coupling powerful latencyoptimized processors with highly parallel throughput-optimized accelerators They will rely on IBM Power CPUs, NVIDIA Volta GPUs, and NVIDIA NVLink interconnect to connect the hybrid devices within each node, and a Mellanox Dual-Rail EDR Infiniband interconnect to connect the nodes. All platforms will benefit from recent advances in 3D-stacked memory technology Overall, both types of systems promise major performance improvements: CPU memory bandwidth is expected to be between 200 GB/s and 300 GB/s using HMC; GPU memory bandwidth is expected to approach 1 TB/s using HBM; GPU memory capacity is expected to reach 60 GB (NVIDIA Volta); NVLink is expected to deliver no less than 80 GB/s, and possibly as high at 200 GB/s, of CPU to GPU bandwidth.

Background
Related Work
Programming Model
Packet
Channel
Construction and Operation
VSA Construction and Launching
VDP Creation and Insertion
Channel Creation and Insertion
Mapping of VDPs to Threads and Devices
VDP Operation
Channel Deactivation and Reactivation
Handling of Tuples
Runtime Implementation
Threads and Devices
Software Engineering
Cannon’s Matrix Multiplication
Performance Experiments
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call