Intel Sandy Bridge Research Articles

This paper presents a number of optimisations for improving the performance of unstructured computational fluid dynamics codes on multicore and manycore architectures such as the Intel Sandy Bridge, Broadwell and Skylake CPUs and the Intel Xeon Phi Knights Corner and Knights Landing manycore processors. We discuss and demonstrate their implementation in two distinct classes of computational kernels: face-based loops represented by the computation of fluxes and cell-based loops representing updates to state vectors. We present the importance of making efficient use of the underlying vector units in both classes of computational kernels with special emphasis on the changes required for vectorising face-based loops and their intrinsic indirect and irregular access patterns. We demonstrate the advantage of different data layouts for cell-centred as well as face data structures and architectural specific optimisations for improving the performance of gather and scatter operations which are prevalent in unstructured mesh applications. The implementation of a software prefetching strategy based on auto-tuning is also shown along with an empirical evaluation on the importance of multithreading for in-order architectures such as Knights Corner. We explore the various memory modes available on the Intel Xeon Phi Knights Landing architecture and present an approach whereby both traditional DRAM as well as MCDRAM interfaces are exploited for maximum performance. We obtain significant full application speed-ups between 2.8 and 3X across the multicore CPUs in two-socket node configurations, 8.6X on the Intel Xeon Phi Knights Corner coprocessor and 5.6X on the Intel Xeon Phi Knights Landing processor in an unstructured finite volume CFD code representative in size and complexity to an industrial application. Program summaryProgram Title: some_opt_for_unstructured_cfdProgram Files doi:http://dx.doi.org/10.17632/zyh2zkf3jw.1Licensing provisions: GNU General Public License 3 (GPL)Programming language: C/C++Nature of problem: The solution of fluid flow problems in the vicinity of complex geometries mandates the utilisation of unstructured grids. However, this flexibility of unstructured mesh methods in dealing with complicated geometries comes at a cost of increased difficulty in extracting high performance out of modern processors. We provide implementations for a number of optimisations useful for improving the performance of unstructured CFD codes on modern multicore and manycore architectures.Solution method: grid renumbering via Reverse Cuthill–Mckee, code transformations necessary for enabling vectorisation, face colouring/reordering for removing dependencies at the face end-points when accumulating residuals, data layout transformations for reducing cache misses, hand-tuned gather and scatter primitives for in-register transpositions, software prefetching via auto-tuning and multithreading for exploiting SMT features of modern processors.

Read full abstract

The macro–micro-coupling tool (MaMiCo) was developed to ease the development of and modularize molecular-continuum simulations, retaining sequential and parallel performance. We demonstrate the functionality and performance of MaMiCo by coupling the spatially adaptive Lattice Boltzmann framework waLBerla with four molecular dynamics (MD) codes: the light-weight Lennard-Jones-based implementation SimpleMD, the node-level optimized software ls1 mardyn, and the community codes ESPResSo and LAMMPS. We detail interface implementations to connect each solver with MaMiCo. The coupling for each waLBerla-MD setup is validated in three-dimensional channel flow simulations which are solved by means of a state-based coupling method. We provide sequential and strong scaling measurements for the four molecular-continuum simulations. The overhead of MaMiCo is found to come at 10%–20% of the total (MD) runtime. The measurements further show that scalability of the hybrid simulations is reached on up to 500 Intel SandyBridge, and more than 1000 AMD Bulldozer compute cores. Program summaryProgram title: MaMiCoCatalogue identifier: AEYW_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEYW_v1_0.htmlProgram obtainable from: CPC Program Library, Queen’s University, Belfast, N. IrelandLicensing provisions: BSD LicenseNo. of lines in distributed program, including test data, etc.: 67905No. of bytes in distributed program, including test data, etc.: 1757334Distribution format: tar.gzProgramming language: C, C++II.Computer: Standard PCs, compute clusters.Operating system: Unix/Linux.RAM: Test cases consume ca. 30–50 MBClassification: 7.7.External routines: Scons (http:www.scons.org), ESPResSo, LAMMPS, ls1 mardyn, waLBerlaNature of problem: Coupled molecular-continuum simulation for multi-resolution fluid dynamics: parts of the domain are resolved by molecular dynamics whereas large parts are covered by a CFD solver, e.g. a lattice Boltzmann automatonSolution method: We couple existing MD and CFD solvers via MaMiCo (macro–micro coupling tool). Data exchange and coupling algorithmics are abstracted and incorporated in MaMiCo. Once an algorithm is set up in MaMiCo, it can be used and extended, even if other solvers are used (as soon as the respective interfaces are implemented).Restrictions: Currently, only single-centered Lennard-Jones systems are supported.Running time: Runtime depends on the underlying coupled problem and may range from minutes to days. The provided test cases for all different solver couplings (incl. one complete coupling cycle of avg. domain size) take ca. 10 h on a regular Desktop.

Read full abstract

Intel Sandy Bridge Research Articles

Articles published on Intel Sandy Bridge

Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures

Accelerating binary biclustering on platforms with CUDA-enabled GPUs

Performance Optimization and Comparison of the Alternating Direction Implicit CFD Solver on Multi‐core and Many‐Core Architectures

NanoStreams: A Microserver Architecture for Real-Time Analytics on Fast Data Streams

Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

ALEA

A Software Cache Partitioning System for Hash-Based Caches

Cache Line Aware Algorithm Design for Cache-Coherent Architectures

Optimization of atmospheric transport models on HPC platforms

Algorithm 967

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Evaluating Kernels on Xeon Phi to accelerate Gysela application

MaMiCo: Software design for parallel molecular-continuum flow simulations

An analytical methodology to derive power models based on hardware and software metrics

Performance Assessment of InfiniBand HPC Cloud Instances on Intel Haswell and Intel Sandy Bridge Architectures

LogCA: A Performance Model for Hardware Accelerators

Data mining on vast data sets as a cluster system benchmark

Chip‐level and multi‐node analysis of energy‐optimized lattice Boltzmann CFD simulations

IConn

RETRACTED: Batched matrix computations on hardware accelerators based on GPUs

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Intel Sandy Bridge Research Articles

Articles published on Intel Sandy Bridge

Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures

Accelerating binary biclustering on platforms with CUDA-enabled GPUs

Performance Optimization and Comparison of the Alternating Direction Implicit CFD Solver on Multi‐core and Many‐Core Architectures

NanoStreams: A Microserver Architecture for Real-Time Analytics on Fast Data Streams

Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

ALEA

A Software Cache Partitioning System for Hash-Based Caches

Cache Line Aware Algorithm Design for Cache-Coherent Architectures

Optimization of atmospheric transport models on HPC platforms

Algorithm 967

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

Evaluating Kernels on Xeon Phi to accelerate Gysela application

MaMiCo: Software design for parallel molecular-continuum flow simulations

An analytical methodology to derive power models based on hardware and software metrics

Performance Assessment of InfiniBand HPC Cloud Instances on Intel Haswell and Intel Sandy Bridge Architectures

LogCA: A Performance Model for Hardware Accelerators

Data mining on vast data sets as a cluster system benchmark

Chip‐level and multi‐node analysis of energy‐optimized lattice Boltzmann CFD simulations

IConn

RETRACTED: Batched matrix computations on hardware accelerators based on GPUs