Abstract

Titan was the flagship supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). It was deployed in late 2012, became the fastest supercomputer in the world and was retired on August 2, 2019. With Titan’s mission complete, this paper provides a first-order examination of the usage of its critical resources (CPU, Memory, GPU, and I/O) over a five-year production period (2015–2019). In particular, we show quantitatively that the majority of CPU time was spent on the large-scale jobs, which is consistent with the policy of driving ground-breaking science through leadership computing. We also corroborate the general observation of the low CPU-memory usage with 95% jobs utilizing only 15% or less available memory. Additionally, we correlate the increase of total job submissions and the decrease of GPU-enabled jobs during 2016 with the GPU reliability issue which impacted the large-scale runs. We further show the surprising read/write ratio over the five-year period, which contradicts the general mindset of the large-scale simulation machines being “write-heavy”. This understanding will have potential impact on how we design our next-generation large-scale storage systems. We believe that our analyses and findings are going to be of great interest to the high-performance computing (HPC) community at large.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call