Abstract

In this paper, we present a hybrid execution backend for the skeleton programming framework SkePU. The backend is capable of automatically dividing the workload and simultaneously executing the computation on a multi-core CPU and any number of accelerators, such as GPUs. We show how to efficiently partition the workload of skeletons such as Map, MapReduce, and Scan to allow hybrid execution on heterogeneous computer systems. We also show a unified way of predicting how the workload should be partitioned based on performance modeling. With experiments on typical skeleton instances, we show the speedup for all skeletons when using the new hybrid backend. We also evaluate the performance on some real-world applications. Finally, we show that the new implementation gives higher and more reliable performance compared to an old hybrid execution implementation based on dynamic scheduling.

Highlights

  • The ever-growing demand for higher performance in computing, puts requirements on modern programming tools

  • To show the advantages of the static workload partitioning in the new hybrid execution backend, the experimental StarPU integration from SkePU 1 was ported to SkePU 2

  • We have presented a new hybrid execution backend for the skeleton programming framework SkePU, while preserving the existing API

Read more

Summary

Introduction

The ever-growing demand for higher performance in computing, puts requirements on modern programming tools. We introduce workload partitioning implementations for all data parallel skeletons in SkePU 2, capable of dividing the work between an arbitrary number of CPU cores and accelerators. The hybrid skeleton implementations use OpenMP, where the first thread will manage the accelerators and the rest of the threads will work on the CPU partition. To reduce duplication of code within SkePU, the accelerator partition is computed by the already existing CUDA or OpenCL backend implementations. Each CPU thread and the accelerator backend reduce their block of the input data to produce a temporary array of partial reductions. A more sophisticated partitioning for matrices would be hard to realize, especially for the MapOverlap skeleton, due to the many corner cases and complex data access patterns

StarPU backend implementation
Single skeleton evaluation
Generic application evaluation
Comparison to dynamic hybrid scheduling using StarPU
Related work
Findings
Conclusions and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.