Abstract

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.

Highlights

  • Modern research and engineering rely heavily on numerical simulations

  • A detailed description was given on the exploitation of all levels of parallelism available in modern multicore and manycore processors through efficient code SIMDisation and thread parallelism

  • Memory optimisations described in this work included software prefetching, data layout transformations through hybrid data structures such as Array of Structures Structures of Arrays and multi-level cache blocking for the numerical fluxes

Read more

Summary

Introduction

Modern research and engineering rely heavily on numerical simulations. In research, improvements in the speed and accuracy of scientific computations can lead to new discoveries or facilitate the exploitation of recent breakthroughs [1]. Performance gains in scientific and engineering applications have been obtained through advances in hardware engineering which required little or no change to the programming paradigms Examples of such innovations were out-of-order execution, branch prediction, instruction pipelining, deeper memory hierarchies and, the increase in clock frequency [2] all of which guaranteed improvements in serial performance with every new CPU generation and limited code intervention. Those days are gone partly due to the recognition that clock frequency cannot be scaled indefinitely because of power consumption, and partly because circuitry density on the chip is approaching the limit of existing technologies which is problematic as innovations in sequential execution require a high fraction of die real estate [3]

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call