Abstract

The primary objective of this project is to perform general research into queuing network models of performance of high end computing systems. A related objective is to investigate and predict how an increase in the number of nodes of a supercomputer will decrease the running time of a user's software package, which is often referred to as the strong scaling problem. We investigate the large, MPI-based Linux cluster MCR at LLNL, running the well-known NAS Parallel Benchmark (NPB) applications. Data is collected directly from NPB and also from the low-overhead LLNL profiling tool mpiP. For a run, we break the wall clock execution time of the benchmark into four components: switch delay, MPI contention time, MPI service time, and non-MPI computation time. Switch delay is estimated from message statistics. MPI service time and non-MPI computation time are calculated directly from measurement data. MPI contention is estimated by means of a queuing network model (QNM), based in part on MPI service time. This model of execution time validates reasonably well against the measured execution time, usually within 10%. Since the number of nodes used to run the application is a major input to the model, we can use the model tomore » predict application execution times for various numbers of nodes. We also investigate how the four components of execution time scale individually as the number of nodes increases. Switch delay and MPI service time scale regularly. MPI contention is estimated by the QNM submodel and also has a fairly regular pattern. However, non-MPI compute time has a somewhat irregular pattern, possibly due to caching effects in the memory hierarchy. In contrast to some other performance modeling methods, this method is relatively fast to set up, fast to calculate, simple for data collection, and yet accurate enough to be quite useful.« less

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call