Resilient parallel computing on volunteer PC grids

Jaspal Subhlok,Hien Nguyen,Mohammad Tanvir Rahman,Edgar Gabriel

doi:10.1002/cpe.4478

Jaspal Subhlok, Hien Nguyen + Show 2 more

https://doi.org/10.1002/cpe.4478

Copy DOI

Export

Save

Cite

Journal: Concurrency and Computation: Practice and Experience	Publication Date: Apr 24, 2018
Citations: 1	License type: publisher-specific, author manuscript

Affiliation: University of Houston

Abstract
Full-Text
Similar Papers

Abstract

Listen

SummaryVolunteer PC hosts represent massive computation capacity at a low cost but are challenging to employ for general parallel computing. This paper presents the design, execution model, implementation, and evaluation of the Volpex framework for robust execution of parallel codes on volunteer PC grids characterized by system and network heterogeneity, varying availability, and frequent failures. The communication model is based on one‐sided Put/Get calls to an abstract global shared space enhanced to support multiple autonomous instances of the same process at different stages of execution. Our approach customizes and combines the use of replication, checkpointing, and host selection. This presents formidable challenges that are addressed in this work; efficient checkpointing of distributed replicated processes, dynamic management of redundancy, quick restart in a distributed environment, and application specific host selection. The integrated runtime system is shown to effectively execute moderate size, coarse‐grain, communicating codes on a worldwide distributed volunteer environment, a new milestone in volunteer computing. Extensive evaluation is conducted with example scientific codes on a pool of around 600 volunteer hosts. The results demonstrate the trade‐offs in deploying checkpointing, redundancy, and host selection, and how these methods combine to provide application performance that is close to the ideal failure free performance.

Full Text