A virtual memory translation mechanism to support checkpoint and rollback recovery

Nicholas S Bowen,Dhiraj K Pradhan

doi:10.1145/125826.126719

Abstract

Checkpoint and rollback recovery is a technique that allows a system to tolerate a failure by per iodically sav ing the entir e state and if an error occurs, rolling back to the prior checkpoint. This technique zs partic ularly suited to applic ations with long execution times such as those typic ally found m supercomputer environments. This paper presents a technique that embeds the sup port for checkpoint and rollback rec overy dzrectly into the virtua l memory translation hardware. The scheme is general enough to be implemented on various scopes of data such as a portion of an address space, a sin gle address space or multiple address spaces. A basic model is developed which measures the amount of work required by the scheme as a function of the checkpoint interval Stze. Using this model the degree to which the overhead decreases as the interva l size increases is shown.

Full Text