Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Ioan Hadade,Luca Di Mare

doi:10.1016/j.cpc.2016.04.006

Abstract

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.

Highlights

Modern research and engineering rely heavily on numerical simulations
A detailed description was given on the exploitation of all levels of parallelism available in modern multicore and manycore processors through efficient code SIMDisation and thread parallelism
Memory optimisations described in this work included software prefetching, data layout transformations through hybrid data structures such as Array of Structures Structures of Arrays and multi-level cache blocking for the numerical fluxes

Summary

Introduction

Modern research and engineering rely heavily on numerical simulations. In research, improvements in the speed and accuracy of scientific computations can lead to new discoveries or facilitate the exploitation of recent breakthroughs [1]. Performance gains in scientific and engineering applications have been obtained through advances in hardware engineering which required little or no change to the programming paradigms Examples of such innovations were out-of-order execution, branch prediction, instruction pipelining, deeper memory hierarchies and, the increase in clock frequency [2] all of which guaranteed improvements in serial performance with every new CPU generation and limited code intervention. Those days are gone partly due to the recognition that clock frequency cannot be scaled indefinitely because of power consumption, and partly because circuitry density on the chip is approaching the limit of existing technologies which is problematic as innovations in sequential execution require a high fraction of die real estate [3]

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer Physics Communications	Publication Date: Apr 22, 2016
Citations: 13	License type: cc-by

R Discovery Prime

R Discovery Prime

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer Physics Communications

Lead the way for us

Similar Papers

Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures
Ioan Hadade ... Luca Di Mare
Computer Physics Communications | VOL. 235
Ioan Hadade, et. al.Ioan Hadade ... Luca Di Mare
18 Jul 2018
Computer Physics Communications | VOL. 235

Improving main memory hash joins on Intel Xeon Phi processors
Saurabh Jha ... Xuntao Cheng
Proceedings of the VLDB Endowment | VOL. 8
Saurabh Jha, et. al.Saurabh Jha ... Xuntao Cheng
01 Feb 2015
Proceedings of the VLDB Endowment | VOL. 8

A server-side accelerator framework for multi-core CPUs and Intel Xeon Phi co-processor systems
Guohua You ... Xuejing Wang
Cluster Computing | VOL. 23
Guohua You, et. al.Guohua You ... Xuejing Wang
01 Jan 2020
Cluster Computing | VOL. 23

Intel® Xeon Phi™ Coprocessor Architecture and Tools
Rezaur Rahman
-
Rezaur RahmanRezaur Rahman
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computer Physics Communications