Abstract
Nowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the mainstream C/R methods are either based on Fail-Stop fault model or making the system(or program) do error detection before storing checkpoints, so they can ensure the correctness of every checkpoint. However, the faults occurring in the systems in real world are more accordant with the Byzantine fault model, and in order to pursue the higher practical performance, neither the system nor the program implements any fault detection mechanism. Consequently, there may be errors in the checkpoints. This paper studies the checkpoint selection problem that which checkpoint should be selected as the object of rolling back after system occurring failure, based on Byzantine fault model. We design a framework of checkpoint selection, and then, based on it, propose three checkpoint selection strategies: conservative strategy, aggressive strategy and statistical strategy. The simulation results show that: the conservative strategy shows its superiority when the error latent period is long, while the aggressive strategy behaves oppositely, the statistical strategy has a stable efficiency, only 50% more overhead compared to the ideal checkpoint selection when the checkpoint period is the half of mean time between faults.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.