Abstract

When optimizing performance on a GPU, control flow divergence of threads in one warp can make up the possible performance bottlenecks. In our hand-coded GPU stencil computation optimization, with a view to remove this control flow divergence brought by conventional mapping method between global memory and shared memory, we devise a new mapping mechanism by modeling the coalesced memory accesses of GPU threads and the aligned ghost zone overheads to remove conditional statements of the boundary XY-tile stencil computation points for improved performance. In addition, we utilize only one XY-tile loaded into registers in every stencil computation iteration, common sub-expression elimination and software prefetching to reduce overheads. Finally, detailed performance evaluation demonstrates that global memory access traffic is close to the idealized lower bound value through our optimized policies, that is to say, in every computed point of one XY-tile the memory access traffic is roughly 6 and 4 % more than 8 bytes per XY-tile point of the idealized lower bound memory access traffic in which ghost zone overheads are not taken into consideration on Tesla C2050 and Kepler K20X respectively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.