Knights Corner Research Articles

This paper presents a number of optimisations for improving the performance of unstructured computational fluid dynamics codes on multicore and manycore architectures such as the Intel Sandy Bridge, Broadwell and Skylake CPUs and the Intel Xeon Phi Knights Corner and Knights Landing manycore processors. We discuss and demonstrate their implementation in two distinct classes of computational kernels: face-based loops represented by the computation of fluxes and cell-based loops representing updates to state vectors. We present the importance of making efficient use of the underlying vector units in both classes of computational kernels with special emphasis on the changes required for vectorising face-based loops and their intrinsic indirect and irregular access patterns. We demonstrate the advantage of different data layouts for cell-centred as well as face data structures and architectural specific optimisations for improving the performance of gather and scatter operations which are prevalent in unstructured mesh applications. The implementation of a software prefetching strategy based on auto-tuning is also shown along with an empirical evaluation on the importance of multithreading for in-order architectures such as Knights Corner. We explore the various memory modes available on the Intel Xeon Phi Knights Landing architecture and present an approach whereby both traditional DRAM as well as MCDRAM interfaces are exploited for maximum performance. We obtain significant full application speed-ups between 2.8 and 3X across the multicore CPUs in two-socket node configurations, 8.6X on the Intel Xeon Phi Knights Corner coprocessor and 5.6X on the Intel Xeon Phi Knights Landing processor in an unstructured finite volume CFD code representative in size and complexity to an industrial application. Program summaryProgram Title: some_opt_for_unstructured_cfdProgram Files doi:http://dx.doi.org/10.17632/zyh2zkf3jw.1Licensing provisions: GNU General Public License 3 (GPL)Programming language: C/C++Nature of problem: The solution of fluid flow problems in the vicinity of complex geometries mandates the utilisation of unstructured grids. However, this flexibility of unstructured mesh methods in dealing with complicated geometries comes at a cost of increased difficulty in extracting high performance out of modern processors. We provide implementations for a number of optimisations useful for improving the performance of unstructured CFD codes on modern multicore and manycore architectures.Solution method: grid renumbering via Reverse Cuthill–Mckee, code transformations necessary for enabling vectorisation, face colouring/reordering for removing dependencies at the face end-points when accumulating residuals, data layout transformations for reducing cache misses, hand-tuned gather and scatter primitives for in-register transpositions, software prefetching via auto-tuning and multithreading for exploiting SMT features of modern processors.

Read full abstract

Abstract. The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.

Read full abstract

Knights Corner Research Articles

Related Topics

Articles published on Knights Corner

Software Prefetching for Unstructured Mesh Applications

Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi

Optimization strategies for geophysics models on manycore systems

Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures

A Framework for the Automatic Vectorization of Parallel Sort on x86-Based Processors

Accelerated simulation of microwave breakdown in gases on Xeon Phi based cluster-application to self-organized plasma pattern formation

DD-αAMG on QPACE 3

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance

Code modernization strategies to 3-D Stencil-based applications on Intel Xeon Phi: KNC and KNL

GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

Coarray-based load balancing on heterogeneous and many-core architectures

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

Scalable training of 3D convolutional networks on multi- and many-cores

Automated Compiler Optimization of Multiple Vector Loads/Stores

Optimizing the Monte Carlo Neutron Cross-Section Construction Code XSBench for MIC and GPU Platforms

Task-based Cholesky decomposition on Xeon Phi architectures using OpenMP

Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

An improved parallelism scheme for deterministic discrete ordinates transport

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Knights Corner Research Articles

Related Topics

Articles published on Knights Corner

Software Prefetching for Unstructured Mesh Applications

Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi

Optimization strategies for geophysics models on manycore systems

Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures

A Framework for the Automatic Vectorization of Parallel Sort on x86-Based Processors

Accelerated simulation of microwave breakdown in gases on Xeon Phi based cluster-application to self-organized plasma pattern formation

DD-αAMG on QPACE 3

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance

Code modernization strategies to 3-D Stencil-based applications on Intel Xeon Phi: KNC and KNL

GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

Coarray-based load balancing on heterogeneous and many-core architectures

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

Scalable training of 3D convolutional networks on multi- and many-cores

Automated Compiler Optimization of Multiple Vector Loads/Stores

Optimizing the Monte Carlo Neutron Cross-Section Construction Code XSBench for MIC and GPU Platforms

Task-based Cholesky decomposition on Xeon Phi architectures using OpenMP

Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

An improved parallelism scheme for deterministic discrete ordinates transport