Abstract
Strong superlinear speedup has been discovered in large scale simulations of parallel 3D DEM for complex-shaped particles, which is based on an algorithm of spatial domain decomposition, and exhibits the “high-CPU-low-memory” characteristics. The interpretation of this phenomenon requires a careful examination of the speedup theory and practice in the field of parallel computing. The superlinear speedup is investigated from three perspectives: (i) memory footprint per process, (ii) cache miss rates of L1, L2 and L3 level caches, and (iii) uniprocessor performance, using a wide range of problem size (across five orders of magnitude of simulation scale regarding number of particles) and number of compute nodes (1–2048 nodes) on DoD supercomputers. The Performance-API (PAPI) is employed in the source code to measure cache miss rate and FLOPS. The strong scaling measurements show that cache miss rate is sensitive to the memory consumption shrinkage per processor, and the last level cache (LLC) contributes most significantly to the strong superlinear speedup among all of the three cache levels, and this is also revealed in the weak scaling measurements. The findings are associated with the inherently perfect scalability of 3D DEM: its memory scalability function is a nonlinearly decreasing function of the number of processors. In addition, a constant (non-increasing) uniprocessor FLOPS performance w.r.t problem size can also contribute to the superlinear speedup.The superlinear speedup is a common phenomenon for large scale 3D DEM simulations of complex-shaped particles, and the larger the scale, the stronger is the superlinear speedup. DEM researchers should take advantage of this effect to speedup their parallel simulations.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have