Abstract

The current trend toward heterogeneous architectures motivates us to reconsider current software and hardware paradigms. The focus is centered around new parallel programming models, compiler design, and runtime resource management techniques to exploit the features of many-core processor architectures. Graphics Processing Units (GPU) have become the platform of choice in this area for accelerating a large range of data-parallel and task-parallel applications. The rapid adoption of GPU computing has been greatly aided by the introduction of high-level programming environments such as CUDA C and OpenCL. However, each vendor implements these programming models differently and we must analyze the internals in order to get a better understanding of the performance results. One of the main differences across implementations is the handling of program control flow by the compiler and the hardware. Some implementations can support unstructured control flow based on branches and labels; others are based on structured control flow relying solely on if-then and while constructs. In this paper we describe a tool that can be used to analyze the difference between these two approaches. We created a dynamic compiler called Caracal that translates applications with unstructured control flow so they can run on hardware that requires structured programs. In order to accomplish this, Caracal builds a control tree of the program and creates single-entry, single-exit regions called hammock graphs. We used this tool to analyze the performance differences between NVIDIA's implementation of CUDA C and AMD's implementation of OpenCL. We found that the requirement for structured control flow can increase the number of registers allocated by 20 registers and impact performance as much as 2x.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call