Abstract
SummaryVolunteer PC hosts represent massive computation capacity at a low cost but are challenging to employ for general parallel computing. This paper presents the design, execution model, implementation, and evaluation of the Volpex framework for robust execution of parallel codes on volunteer PC grids characterized by system and network heterogeneity, varying availability, and frequent failures. The communication model is based on one‐sided Put/Get calls to an abstract global shared space enhanced to support multiple autonomous instances of the same process at different stages of execution. Our approach customizes and combines the use of replication, checkpointing, and host selection. This presents formidable challenges that are addressed in this work; efficient checkpointing of distributed replicated processes, dynamic management of redundancy, quick restart in a distributed environment, and application specific host selection. The integrated runtime system is shown to effectively execute moderate size, coarse‐grain, communicating codes on a worldwide distributed volunteer environment, a new milestone in volunteer computing. Extensive evaluation is conducted with example scientific codes on a pool of around 600 volunteer hosts. The results demonstrate the trade‐offs in deploying checkpointing, redundancy, and host selection, and how these methods combine to provide application performance that is close to the ideal failure free performance.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have