Improving the efficiency of HPC data movement on container-based virtual cluster

Dan Huang,Yutong Lu

doi:10.1007/s42514-020-00025-w

Abstract

Today, lightweight virtualization technologies have been widely deployed on data centers and HPC clusters to provide highly efficient and elastic resource provisioning. Virtualization has also been extended to the I/O stack in operating system. For example, virtual switch has become the primary provider of I/O services for data movement among various light-weight virtual machines, such as Docker and Kubernetes. However, I/O stack virtualization introduces performance degradation and scalability bottleneck to the data movements of HPC computing framework, such as MPI based collective data movements and bursty asynchronous data movements. In order to study the bottleneck, we quantify and analyze the performance degradation involving with HPC data movements on virtual clusters. Then, we design a set of two-stage methods to proactively adapt the virtual network and data movement procedures. This can enhance the performance of HPC collective data movements by up to 3$$\times $$. Meanwhile, a cross-layer middleware is designed to improve the performance and scalability of bursty asynchronous data movements. Our evaluation shows that it can improve the performance of real scientific application by 34.6%.

Full Text