Design and Implementation of the PULSAR Programming System for Large Scale Computing

Jakub Kurzak ,Ichitaro Yamazaki ,Yves Robert ,Jack Dongarra ,Piotr Łuszczek

doi:10.14529/jsfi170101

Abstract

The objective of the PULSAR project was to design a programming model suitable for large scale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR programming model is quite simple, with point-to-point channels as the main communication abstraction. The runtime implementation is very lightweight and fully distributed, and provides multithreading, message-passing and multi-GPU offload capabilities. Performance evaluation shows good scalability up to one thousand nodes with one thousand GPU accelerators.

Highlights

MotivationHigh-end supercomputers are on the steady path of growth in size and complexity
The Parallel Unified Linear algebra with Systolic ARrays (PULSAR) programming model relies on five abstractions to define the computation: Virtual Systolic Array (VSA), Virtual Data Processor (VDP), channel, packet, tuple; and on two abstractions to map the computation to the actual hardware: thread, device
The VSA is the main object in PULSAR, containing all the top-level information about the system, including: the total number of nodes and the rank of the local node, the number of CPU threads launched per node, and the number of GPU devices used per node

Summary

Motivation

High-end supercomputers are on the steady path of growth in size and complexity. One can get a fairly reasonable picture of the road that lies ahead by examining the platforms that will be brought online under the DOEs CORAL initiative. Summit and Sierra will follow the hybrid computing model, by coupling powerful latencyoptimized processors with highly parallel throughput-optimized accelerators They will rely on IBM Power CPUs, NVIDIA Volta GPUs, and NVIDIA NVLink interconnect to connect the hybrid devices within each node, and a Mellanox Dual-Rail EDR Infiniband interconnect to connect the nodes. All platforms will benefit from recent advances in 3D-stacked memory technology Overall, both types of systems promise major performance improvements: CPU memory bandwidth is expected to be between 200 GB/s and 300 GB/s using HMC; GPU memory bandwidth is expected to approach 1 TB/s using HBM; GPU memory capacity is expected to reach 60 GB (NVIDIA Volta); NVLink is expected to deliver no less than 80 GB/s, and possibly as high at 200 GB/s, of CPU to GPU bandwidth.

Background

Related Work

Programming Model

Packet

Channel

Construction and Operation

VSA Construction and Launching

VDP Creation and Insertion

Channel Creation and Insertion

Mapping of VDPs to Threads and Devices

VDP Operation

Channel Deactivation and Reactivation

Handling of Tuples

Runtime Implementation

Threads and Devices

Software Engineering

Cannon’s Matrix Multiplication

Performance Experiments

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Supercomputing frontiers and innovations	Publication Date: Mar 1, 2017
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Design and Implementation of the PULSAR Programming System for Large Scale Computing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Supercomputing frontiers and innovations

Lead the way for us

Similar Papers

Performance driven programmimg models
W.D Gropp
-
W.D GroppW.D Gropp
12 Nov 1997
12 Nov 1997

LU Factorization with Partial Pivoting for a Multicore System with Accelerators
J Kurzak ... M Faverge
IEEE Transactions on Parallel and Distributed Systems | VOL. 24
J Kurzak, et. al.J Kurzak ... M Faverge
09 Apr 2013
IEEE Transactions on Parallel and Distributed Systems | VOL. 24

Co-designing OpenMP Features Using OMPT and Simulation Tools
Matthew Baker ... Oscar Hernandez
-
Matthew Baker, et. al.Matthew Baker ... Oscar Hernandez
01 Jan 2020
01 Jan 2020

Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software
Erik Elmroth ... Fred Gustavson
SIAM Review | VOL. 46
Erik Elmroth, et. al.Erik Elmroth ... Fred Gustavson
01 Jan 2004
SIAM Review | VOL. 46

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Design and Implementation of the PULSAR Programming System for Large Scale Computing

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Supercomputing frontiers and innovations