Knights Corner Research Articles

Abstract Shared memory parallelization of the flux kernel of PETSc-FUN3D, an unstructured tetrahedral mesh Euler flow code previously studied for distributed memory and multi-core shared memory, is evaluated on up to 61 cores per node and up to 4 threads per core. We explore several thread-level optimizations to improve flux kernel performance on the state-of-the-art many integrated core (MIC) Intel processor Xeon Phi “Knights Corner,” with a focus on strong thread scaling. While the linear algebraic kernel is bottlenecked by memory bandwidth for even modest numbers of cores sharing a common memory, the flux kernel, which arises in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, is compute-intensive and is known to exploit effectively contemporary multi-core hardware. We extend study of the performance of the flux kernel to the Xeon Phi in three thread affinity modes, namely scatter, compact, and balanced, in both offload and native mode, with and without various code optimizations to improve alignment and reduce cache coherency penalties. Relative to baseline “out-of-the-box” optimized compilation, code restructuring optimizations provide about 3.8x speedup using the offload mode and about 5x speedup using the native mode. Even with these gains for the flux kernel, with respect to execution time the MIC simply achieves par with optimized compilation on a contemporary multi-core Intel CPU, the 16-core Sandy Bridge E5 2670. Nevertheless, the optimizations employed to reduce the data motion and cache coherency protocol penalties of the MIC are expected to be of value for CFD and many other unstructured applications as many-core architecture evolves. We explore large-scale distributed-shared memory performance on the Cray XC40 supercomputer, to demonstrate that optimizations employed on Phi hybridize to this context, where each of thousands of nodes are comprised of two sockets of Intel Xeon Haswell CPUs with 32 cores per node.

In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards (Fermi, Kepler, Maxwell), on the Intel Xeon CPU and Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFLOP/s/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.5× to 1.92× greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data.

Knights Corner Research Articles

Related Topics

Articles published on Knights Corner

Performance analysis of the Kahan‐enhanced scalar product on current multi‐core and many‐core processors

Unstructured computational aerodynamics on many integrated core architecture

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

A polyphase filter for many-core architectures

Knights Landing: Second-Generation Intel Xeon Phi Product

Lessons Learned from Optimizing Science Kernels for Intel's "Knights Corner"' Architecture

Realistic Performance Characterization of CFD Applications on Intel Many Integrated Core Architecture

Microarchitectural performance comparison of Intel Knights Corner and Intel Sandy Bridge with CFD applications

Microarchitectural performance comparison of Intel Knights Corner and Intel Sandy Bridge with CFD applications

Accelerating IDCT Algorithm on Xeon Phi Coprocessor

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Knights Corner Research Articles

Related Topics

Articles published on Knights Corner

Performance analysis of the Kahan‐enhanced scalar product on current multi‐core and many‐core processors

Unstructured computational aerodynamics on many integrated core architecture

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

A polyphase filter for many-core architectures

Knights Landing: Second-Generation Intel Xeon Phi Product

Lessons Learned from Optimizing Science Kernels for Intel's "Knights Corner"' Architecture

Realistic Performance Characterization of CFD Applications on Intel Many Integrated Core Architecture

Microarchitectural performance comparison of Intel Knights Corner and Intel Sandy Bridge with CFD applications

Microarchitectural performance comparison of Intel Knights Corner and Intel Sandy Bridge with CFD applications

Accelerating IDCT Algorithm on Xeon Phi Coprocessor