Abstract

This article demonstrates the utility and implementation of software prefetching in an unstructured finite volume computational fluid dynamics code of representative size and complexity to an industrial application and across a number of modern processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect addressing patterns and show how to best implement them in an existing large-scale computational fluid dynamics application. Through this, we show significant full application speed-ups on a range of processors and realistic test cases in both single core/tile and full socket configurations, such as 1.14× on the Intel Xeon Sandy Bridge, 1.09× on the Intel Xeon Broadwell, 1.29× on the Intel Xeon Skylake, 1.99× on the in-order Intel Xeon Phi Knights Corner coprocessor, and 1.51× on the out-of-order Intel Xeon Phi Knights Landing many-core processor.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call