AbstractThe advent of cloud computing has made access to computing infrastructure available to millions of users that face resource constraints. In the context of high performance computing (HPC), public cloud resources have emerged as a cost‐effective alternative to expensive on‐premises clusters. However, there are several challenges and limitations in adopting this approach. This paper proposes HPC@Cloud , a provider‐agnostic open‐source software toolkit that facilitates the migration, testing, and execution of HPC applications in public clouds. The toolkit takes advantage of various fault tolerance technologies to enable the use of inexpensive transient cloud infrastructure, commonly known as “spot” instances. Also, it features integration with singularity containers, allowing users to run complex applications on virtual HPC clusters in a portable and reproducible way. Finally, it provides a data‐based empirical approach to estimating cloud infrastructure costs for HPC workloads. The results obtained on two public cloud providers (AWS and Vultr) show that: (i) HPC@Cloud can efficiently build virtual HPC clusters on the cloud; (ii) the new adaptive fault tolerance strategy outperforms other existing strategies based on blocking restoration; (iii) the integration of singularity containers into HPC@Cloud improves the portability of HPC applications to public clouds with negligible performance penalty to the applications; (iv) the proposed cost prediction approach can estimate the cost of running the applications on AWS and Vultr with up to 93% accuracy on average.
Read full abstract