Abstract

The latest versions of OpenMP have introduced constructs for exploiting heterogeneous compute units alongside the main multicore cpu. The offloaded program portions (kernels) may generate parallelism within the target device by employing standard OpenMP constructs. However co-processors, especially embedded ones, often have limited resources to provide efficient OpenMP support. Designing an OpenMP infrastructure for such devices is a challenge and a usual design decision is to support OpenMP only partially.In this work, we present a novel solution to this problem. We propose a compiler-assisted, adaptive runtime system organization, which generates application-specific support by implementing only the OpenMP functionality required each time. Full OpenMP support is available if needed. However, in the usual scenario where kernels do not require complex OpenMP functionalities, our method can lead to dramatically reduced executable sizes, which usually offer additional performance benefits. The mechanism is based on preparatory compile-time kernel analysis which generates metrics regarding the OpenMP functionality present in each kernel. These are then fed to a mapper module which, given a set of rules, decides what the optimal runtime configuration is. Our proposal is demonstrated by a complete implementation on the popular Parallella-16 board, exhibiting consistently large size savings and significant performance gains.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call