Abstract

Stencil code is widely used in the field of scientific computing. Currently, researchers are focusing on performance optimization for stencil applications by data-level parallelism or thread-level parallelism. Using vector/SIMD instructions, which is commonly used to achieve data-level parallelism, could effectively improve the performance of computation with a large number of repetitive operations, but usually limited due to the access memory bandwidth, or data and control dependencies. The Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA), as the new generation of ARM’s vector ISA, could make vectorization more flexible by ignoring the vector register length, and has replaced the older Neon SIMD technology. In this paper we design ARM SVE instructions to implement and optimize 2d5p, 2d9p, 3d7p, and 3d27p stencil codes that are all the most common types using some classical optimization strategies like loop unrolling or data reuse. Our experiments on ARM processors using different vector lengths from 128-bit to 2048-bit show that our program could obtain performance improvements of up to 2.88x over directly vectorized code, 8.91x compared to Neon, and 16.31x for scalar code. In addition, we provide a set of templates that could be flexibly configured when stencil codes change, which can help directly generate efficient ARM SVE instructions. This work will provide great convenience for optimizing other stencil codes.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call