Abstract

In this paper, aiming at realizing directive-based temporal blocking for out-of-core stencil computation, we present an extension of OpenACC directives and a source-to-source translator capable of accelerating out-of-core stencil computation on a graphics processing unit (GPU). Out-of-core stencil computation here deals with large data that cannot be entirely stored in GPU memory. Given an OpenACC-like code, the proposed translator generates an OpenACC code such that it decomposes large data into smaller chunks, which are then processed in a pipelined manner to hide the data transfer overhead needed for exchanging chunks between the GPU memory and CPU memory. Furthermore, the generated code is optimized with a temporal blocking technique to minimize the amount of CPU-GPU data transfer. In experiments, we apply the proposed translator to three stencil computation codes. The out-of-core performance on a Tesla K40 GPU reaches 73.4 GFLOPS, which is only 13% lower than the in-core performance. Therefore, we think that our directive-based approach is useful for facilitating out-of-core stencil computation on a GPU.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call