Abstract

As the push towards exascale hardware has increased the diversity of system architectures, performance portability has become a critical aspect for scientific software. We describe the Kokkos Performance Portable Programming Model that allows developers to write single source applications for diverse high-performance computing architectures. Kokkos provides key abstractions for both the compute and memory hierarchy of modern hardware. We describe the novel abstractions that have been added to Kokkos version 3 such as hierarchical parallelism, containers, task graphs, and arbitrary-sized atomic operations to prepare for exascale era architectures. We demonstrate the performance of these new features with reproducible benchmarks on CPUs and GPUs.

Highlights

  • OVER the last decade, the High Performance Computing (HPC) hardware landscape has diversified significantly

  • Long dominated by NVIDIA alone, the first generation of upcoming exascale platforms will deploy AMD and Intel GPUs instead. All of this means that it is becoming more difficult to write code which can leverage all of the HPC systems that users have access to

  • With the lifetime of the most important HPC applications measured in decades, and far exceeding the lifetime of any given machine, the demand for performance

Read more

Summary

INTRODUCTION

OVER the last decade, the High Performance Computing (HPC) hardware landscape has diversified significantly. Since the publication of [9], the need to support more complex applications has resulted in significant extensions of the programming model, which are the focus of the current paper. These additions, developed as part of the Kokkos version 3 release cycle, are focused on exposing more parallelism, asynchronicity and advanced hardware capabilities, which are relevant to fully leverage the upcoming exascale era architectures. Demonstration of the flexibility and the performance of the programming model through carefully chosen benchmarks on CPU and GPU architectures Unique functionality such as arbitrary-sized atomic operations in a portable manner. We provide a small insight into practical performance portability achieved by users of Kokkos, based on a number of studies these users published

BENCHMARK REPRODUCIBILITY INFORMATION
FUNDAMENTAL CAPABILITIES
Memory Spaces
Memory Layouts
Memory Traits
ADVANCED REDUCTIONS
GENERIC ATOMICS
CONTAINERS
ScatterView
MDRANGEPOLICY
HIERARCHICAL PARALLELISM
Team Synchronization Semantics
Vector Level Parallelism
Team Scratch Memory
EXECUTION SPACE INSTANCES
10 KOKKOS GRAPHS
12 BACKENDS
Findings
13 PERFORMANCE PORTABILITY IN PRACTICE
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call