Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures

Ioan Hadade,Feng Wang,Mauro Carnevale,Luca Di Mare

doi:10.1016/j.cpc.2018.07.001

Ioan Hadade, Feng Wang + Show 2 more

Open Access

https://doi.org/10.1016/j.cpc.2018.07.001

Copy DOI

Journal: Computer Physics Communications	Publication Date: Jul 18, 2018
Citations: 27	License type: cc-by

Affiliation: University of Oxford

Abstract

This paper presents a number of optimisations for improving the performance of unstructured computational fluid dynamics codes on multicore and manycore architectures such as the Intel Sandy Bridge, Broadwell and Skylake CPUs and the Intel Xeon Phi Knights Corner and Knights Landing manycore processors. We discuss and demonstrate their implementation in two distinct classes of computational kernels: face-based loops represented by the computation of fluxes and cell-based loops representing updates to state vectors. We present the importance of making efficient use of the underlying vector units in both classes of computational kernels with special emphasis on the changes required for vectorising face-based loops and their intrinsic indirect and irregular access patterns. We demonstrate the advantage of different data layouts for cell-centred as well as face data structures and architectural specific optimisations for improving the performance of gather and scatter operations which are prevalent in unstructured mesh applications. The implementation of a software prefetching strategy based on auto-tuning is also shown along with an empirical evaluation on the importance of multithreading for in-order architectures such as Knights Corner. We explore the various memory modes available on the Intel Xeon Phi Knights Landing architecture and present an approach whereby both traditional DRAM as well as MCDRAM interfaces are exploited for maximum performance. We obtain significant full application speed-ups between 2.8 and 3X across the multicore CPUs in two-socket node configurations, 8.6X on the Intel Xeon Phi Knights Corner coprocessor and 5.6X on the Intel Xeon Phi Knights Landing processor in an unstructured finite volume CFD code representative in size and complexity to an industrial application. Program summaryProgram Title: some_opt_for_unstructured_cfdProgram Files doi:http://dx.doi.org/10.17632/zyh2zkf3jw.1Licensing provisions: GNU General Public License 3 (GPL)Programming language: C/C++Nature of problem: The solution of fluid flow problems in the vicinity of complex geometries mandates the utilisation of unstructured grids. However, this flexibility of unstructured mesh methods in dealing with complicated geometries comes at a cost of increased difficulty in extracting high performance out of modern processors. We provide implementations for a number of optimisations useful for improving the performance of unstructured CFD codes on modern multicore and manycore architectures.Solution method: grid renumbering via Reverse Cuthill–Mckee, code transformations necessary for enabling vectorisation, face colouring/reordering for removing dependencies at the face end-points when accumulating residuals, data layout transformations for reducing cache misses, hand-tuned gather and scatter primitives for in-register transpositions, software prefetching via auto-tuning and multithreading for exploiting SMT features of modern processors.

Highlights

Their work was subsequently extended in the context of the FUN3D code by a number of studies such as Gropp et al [6] which introduced performance models in order to guide the optimisation process by classifying the operational characteristics of the computational kernels and their interaction with the underlying hardware, Mudigere et al [7] who demonstrated shared memory optimisations on modern parallel architectures including vectorisation and threading through a hybrid MPI/OpenMP implementation, Al Farhan et al [8] who presented optimisations specific to the Intel Xeon Phi Knights Corner processor as well as Duffy et al [9] who ported FUN3D for execution on graphical processing units obtaining a factor of two speed-up as a result
As such, performing grid renumbering in unstructured grid applications is mandatory for improving the performance of face-based or edge-based kernels especially when they are already tuned for exploiting the arithmetic units
We have presented a number of optimisations useful for improving the performance of unstructured finite volume CFD codes on a range of multicore and manycore architectures that form the backbone of current and likely future HPC systems

Summary

Introduction

Anderson et al [2] presented the optimisation of FUN3D [3], a tetrahedral vertex-centred unstructured mesh code developed at the NASA Langley Research Center for the solution of the compressible and incompressible Euler and Navier–Stokes equations and for which they received the 1999 Gordon Bell Prize [4] Their optimisations were based on the concept of memory centric computations whereby the aim was to minimise the number of memory references as much as possible in the recognition that flops are cheap relative to memory load and store operations.

Methods

Results

Conclusion