Abstract

Lattice Quantum Chromodynamic (QCD) models subatomic interactions based on a four-dimensional discretized space–time continuum. The Lattice QCD computation is one of the grand challenges in physics especially when modeling a lattice with small spacing. In this work, we study the implementation of the main kernel routine of Lattice QCD that dominates the execution time on the Cell Broadband Engine. We tackle the problem of efficient SIMD execution and the problem of limited bandwidth for data transfers with the off-chip memory. For efficient SIMD execution, we present runtime data fusion technique that groups data processed similarly at runtime. We also introduce analysis needed to reduce the pressure on the scarce memory bandwidth that limits the performance of this computation. We studied two implementations for the main kernel routine that exhibit different patterns of accessing the memory and thus allowing different sets of optimizations. We show the attributes that make one implementation more favorable in terms of performance. For lattice size that is significantly larger than the local store, our implementation achieves 31.2 GFlops for single precision computations and 16.6 GFlops for double precision computations on the PowerXCell 8i, an order of magnitude better than the performance achieved on most general-purpose processors.

Highlights

  • Simulating Lattice Quantum Chromodynamic (QCD) aims at understanding the strong interactions that bind sub-nuclear matter together to form stable nuclear matter [19]

  • We introduce an implementation of the main kernel routine for simulating Lattice QCD

  • We investigated the tradeoffs affecting the efficiency of these implementations to the Cell Broadband Engine (BE) both for code SIMDization and for managing direct memory transfers

Read more

Summary

Introduction

Simulating Lattice Quantum Chromodynamic (QCD) aims at understanding the strong interactions that bind sub-nuclear matter (quarks and gluons) together to form stable nuclear matter (hadrons) [19]. Efficient implementation of a main kernel routine, responsible for computing the actions of Wilson– Dirac operator, is of critical importance for the simulation of Lattice Quantum Chromodynamics (Lattice QCD) [4,6,19]. We introduce an implementation of the main kernel routine for simulating Lattice QCD. In this implementation, we try to provide answers to two main questions; the first question is how to SIMDize the computation in an efficient way; the second question is how to distribute the lattice data and how to handle memory efficiently.

Cell Broadband Engine and its software development environment
Lattice QCD main kernel routine
Computation models for the Wilson–Dirac kernel routine
SIMDizing the main kernel computations on the Cell Broadband Engine
Runtime data fusion
Lattice QCD memory management
Contiguity analysis of the data space
Performance with DMA
Performance scaling of the introduced implementation
SPEs utilization
Scaling of the proposed scheme on a large scale system
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call