Intel Knights Landing Processor Research Articles

We investigate several state-of-the-practice shared-memory optimization techniques applied to key routines of an unstructured computational aerodynamics application with irregular memory accesses. We illustrate for the Intel Knights Landing processor, as a representative of the processors in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating point numerics of the original code. We employ low and high-level architecture-specific code optimizations involving thread and data-level parallelism. Our approach is based upon a multi-level hierarchical distribution of work and data across both the threads and the SIMD units within every hardware core. On a 64-core Knights Landing chip, we achieve nearly 2.9x speedup of the dominant routines relative to the baseline. These exhibit almost linear strong scalability up to 64 threads, and thereafter some improvement with hyperthreading. At substantially fewer Watts, we achieve up to 1.7x speedup relative to the performance of 72 threads of a 36-core Haswell CPU and roughly equivalent performance to 112 threads of a 56-core Skylake scalable processor. These optimizations are expected to be of value for many other unstructured mesh PDE-based scientific applications as multi and many-core architecture evolves.

Read full abstract

Traditional scientific and emerging data analytics applications require fast, power-efficient, large, and persistent memories. Combining all these characteristics within a single memory technology is expensive and hence future supercomputers will feature different memory technologies side-by-side. However, it is a complex task to program hybrid-memory systems and to identify the best object-to-memory mapping. We envision that programmers will probably resort to use default configurations that only require minimal interventions on the application code or system settings. In this work, we argue that intelligent, fine-grained data placement can achieve higher performance than default setups. We present an algorithm for data placement on hybrid-memory systems. Our algorithm is based on a set of single-object allocation rules and global data placement decisions. We also present RTHMS, a tool that implements our algorithm and provides recommendations about the object-to-memory mapping. Our experiments on a hybrid memory system, an Intel Knights Landing processor with DRAM and HBM, show that RTHMS is able to achieve higher performance than the default configuration. We believe that RTHMS will be a valuable tool for programmers working on complex hybrid-memory systems.

Read full abstract

Intel Knights Landing Processor Research Articles

Articles published on Intel Knights Landing Processor

Optimizing Coherence Traffic in Manycore Processors Using Closed-Form Caching/Home Agent Mappings

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

RTHMS: a tool for data placement on hybrid memory system

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Intel Knights Landing Processor Research Articles

Articles published on Intel Knights Landing Processor

Optimizing Coherence Traffic in Manycore Processors Using Closed-Form Caching/Home Agent Mappings

Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512

RTHMS: a tool for data placement on hybrid memory system