Abstract

Heterogeneous systems are the core architecture of most computing systems, from high-performance computing nodes to embedded devices, due to their excellent performance and energy efficiency. Efficiently programming these systems has become a major challenge due to the complexity of their architectures and the efforts required to provide them with co-execution capabilities that can fully exploit the applications. There are many proposals to simplify the programming and management of acceleration devices and multi-core CPUs. However, in many cases, portability and ease of use compromise the efficiency of different devices—even more so when co-executing. Intel oneAPI, a new and powerful standards-based unified programming model, built on top of SYCL, addresses these issues. In this paper, oneAPI is provided with co-execution strategies to run the same kernel between different devices, enabling the exploitation of static and dynamic policies. This work evaluates the performance and energy efficiency for a well-known set of regular and irregular HPC benchmarks, using two heterogeneous systems composed of an integrated GPU and CPU. Static and dynamic load balancers are integrated and evaluated, highlighting single and co-execution strategies and the most significant key points of this promising technology. Experimental results show that co-execution is worthwhile when using dynamic algorithms and improves the efficiency even further when using unified shared memory.

Highlights

  • In recent years, with the quest to constantly improve the performance and energy efficiency of computing systems, together with the diversity of architectures and computing devices, it has become possible to exploit an interesting variety of problems due to heterogeneous systems

  • We propose to provide oneAPI with mechanisms that allow the implementation of co-execution without additional effort for the programmer

  • The proposed Coexecutor Runtime is built on top of oneAPI as a runtime library to allow the parallel exploitation of the CPU along with multiple hardware accelerators that facilitate the implementation of workload balancing algorithms

Read more

Summary

Introduction

With the quest to constantly improve the performance and energy efficiency of computing systems, together with the diversity of architectures and computing devices, it has become possible to exploit an interesting variety of problems due to heterogeneous systems. The oneAPI’s cross-architecture language Data Parallel C++ (DPC++) [25], based on SYCL standard for heterogeneous programming in C++, provides a single, unified open development model for productive heterogeneous programming and cross-vendor support. It allows code reuse across hardware targets while permitting custom tuning for a specific accelerator. This article addresses a new challenge in improving the usability and exploitation of heterogeneous systems, providing oneAPI with the capacity for co-execution This is defined as the collaboration of all the devices in the system (including the CPU) to execute a single massively data-parallel kernel [14,26,27,28].

Background
Platform Model
Execution Model
Memory Model
Kernel Programming Model
Motivation
OneAPI Coexecutor Runtime
Static Co-Execution
Dynamic Co-Execution
Load Balancing Algorithms
API Design
Validation
Performance
Scalability
Energy
NBoby Benchmark
Related Work
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call