We present a scheme for the parallelization of quantum Monte Carlo method on graphical processing units, focusing on variational Monte Carlo simulation of bosonic systems. We use asynchronous execution schemes with shared memory persistence, and obtain an excellent utilization of the accelerator. The CUDA code is provided along with a package that simulates liquid helium-4. The program was benchmarked on several models of Nvidia GPU, including Fermi GTX560 and M2090, and the Kepler architecture K20 GPU. Special optimization was developed for the Kepler cards, including placement of data structures in the register space of the Kepler GPUs. Kepler-specific optimization is discussed. Program SummaryProgram title: QLCatalogue identifier: AEUP_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEUP_v1_0.htmlProgram obtainable from: CPC Program Library, Queen’s University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 40170No. of bytes in distributed program, including test data, etc.: 1223080Distribution format: tar.gzProgramming language: CUDA-C, C, Fortran.Computer: Any computer with a CUDA-enabled GPU.Operating system: Linux.RAM: Typical execution uses as much RAM as is available on the GPU; usually between 1 GB and 12 GB. Minimal requirement is 1 MB.Classification: 4.12, 7.7.Nature of problem: QL package executes variational Monte Carlo for liquid helium-4 with Aziz II interaction potential and a Jastrow pair product trial wavefunction. Sampling is performed with a Metropolis scheme applied to single-particle updates. With minimal changes, the package can be applied to other bosonic fluids, given a pairwise interaction potential and a wavefunction in the form of a product of one- and two-body correlation factors.Solution method: The program is parallelized for execution with Nvidia GPU. By design, the generation of new configurations is performed with shared memory persistence and the asynchronous execution allows for the CPU load masking.Restrictions: Code is limited to variational Monte Carlo. Due to the limitation of the shared memory of GPU, only systems under 2000 particles can be treated on the Fermi generation cards, and up to 10000 on Kepler cards.Running time: Because of the statistical nature of Monte Carlo calculations, computations may be chained indefinitely to improve statistical accuracy. As an example, using the QL package, the energy of a liquid helium system with 1952 atoms can be computed to within 1 mK per atom in less than 20 min. This corresponds to the relative error of 10−4. It is unlikely that a higher accuracy may be needed.
Read full abstract