A modern Fortran implementation of three Dirac operators (Wilson, Brillouin, Susskind) in lattice QCD is presented, based on OpenMP shared-memory parallelization and SIMD pragmas. The main idea is to apply a Dirac operator to Nv vectors simultaneously, to ease the memory bandwidth bottleneck. All index computations are left to the compiler and maximum weight is given to portability and flexibility. The lattice volume, NxNyNzNt, the number of colors, Nc, and the number of right-hand sides, Nv, are parameters defined at compile time. Several memory layout options are compared. The code performs well on modern many-core architectures (480 Gflop/s, 880 Gflop/s, and 780 Gflop/s with Nv=12 for the three operators in single precision on a 72-core KNL processor, a 2×24-core Skylake node yields similar results). Explicit run-time tests with CG/BiCGstab inverters confirm that the memory layout is relevant for the KNL, but less so for the Skylake architecture. The ancillary code distribution contains all routines, including the single, double, and mixed precision Krylov space solvers, to render it self-contained and ready-to-use. Program summaryProgram title: testknl_main.f90CPC Library link to program files:https://doi.org/10.17632/7km9wcspn3.1Licensing provisions: Creative Commons by 4.0Programming language: fortran 2008Supplementary material: Detailed description of each module of the code distribution.Nature of problem: Solving the Dirac equation Du=b for three choices of the Dirac operator D relevant to Lattice QCD. The choices are known as Wilson fermions, Brillouin fermions and Susskind or staggered fermions, respectively.Solution method: Standard solution through sparse iterative solvers CG and BiCGstab, albeit with several right-hand vectors packed into a right-hand matrix. The number of colors, Nc, the number of right-hand sides, Nv, and the box lengths, Nx, Ny, Nz and Nt, can be chosen freely.Additional comments including restrictions and unusual features: The code allows the user to compare several memory layout options for the block of right-hand sides, that is several orderings of the slots pertinent to Nc, Nv and possibly the Dirac spinor, in the array representation of u and b. No machine dependent optimizations are used, only standard OpenMP shared-memory parallelization and SIMD pragmas are used.
Read full abstract