Abstract

Recent studies investigated the performance of Parallel Discrete Event Simulation (PDES) on Intel Xeon Phi many-core processors, but generally reported underwhelming performance results, especially at high scales when all cores and thread contexts are fully loaded. While the lack of scalability in an earlier study on a Knights Corner (KC) processor is an artifact of physical limitations of the KC system, performance challenges on a Knights Landing (KNL) system partially stem from a slower global virtual time (GVT) computation algorithm used in that study. In this paper, we re-examine PDES performance on KNL under more efficient GVT algorithms to alleviate the GVT bottleneck. Specifically, we compare a synchronous GVT algorithm based on barrier synchronization, and two asynchronous GVT implementations: a modified Mattern's algorithm for shared memory systems and a recently-proposed wait-free algorithm. Using the ROSS simulator, we demonstrate that minimizing the GVT bottleneck results in significant improvement in scalability, allowing the simulation to scale with performance all the way to 250 threads (per chip). Interestingly, we observe that while for the balanced models the wait-free algorithm is a clear winner, barrier-based GVT provides significantly better results for imbalanced models executed at high scale. We also perform detailed simulation profiling to understand the underlying reasons for these performance trends.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call