High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Mikhail Smelyanskiy,Jee Choi,Pradeep Dubey,Bálint Joó,Michael A Clark,Karthikeyan Vaidyanathan,Jatin Chhugani

doi:10.1145/2063384.2063477

Abstract

Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2–3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 323 × 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

RESTRAIN: A dynamic and cost-efficient resource management scheme for addressing performance interference in NFV-based systems
Venkatarami Reddy Chintapalli ... Antony Franklin A
Journal of Network and Computer Applications | VOL. 201
Venkatarami Reddy Chintapalli, et. al.Venkatarami Reddy Chintapalli ... Antony Franklin A
07 Jan 2022
Journal of Network and Computer Applications | VOL. 201

Using Parallel Computing to Calculate Static Interquark Potential in LQCD
Dafina Xhako ... Artan Boriçi
-
Dafina Xhako, et. al.Dafina Xhako ... Artan Boriçi
01 Jan 2014
01 Jan 2014

STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches
Dongyuan Zhan ... Hong Jiang
-
Dongyuan Zhan, et. al.Dongyuan Zhan ... Hong Jiang
01 Dec 2010
01 Dec 2010

Hypervisor-Induced Negative Interference in Virtualized Multi-core Platforms: The P4080 Case
Sourav Dutta ... Harini Ramaprasad
-
Sourav Dutta, et. al.Sourav Dutta ... Harini Ramaprasad
01 Sep 2017
Hypervisor-Induced Negative Interference in Virtualized Multi-core Platforms: The P4080 Case
Sourav Dutta ... Harini Ramaprasad

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Abstract

Talk to us

Similar Papers