Abstract
Scientific workflows are data and compute intensive thus may run for days or even for weeks on parallel and distributed infrastructures such as HPC systems and cloud. In HPC environment the number of failures that can arise during scientific workflow enactment can be high so the use of fault tolerance techniques is unavoidable. The most frequently used fault tolerance techniques are job replication and checkpointing. While job replication is based on the assumption that the probability of single failures is much higher than of simultaneous failures, the checkpointing saves certain states and the execution can be restarted from that point later on. The effectiveness of the checkpointing method depends on the checkpointing interval. Common technique is to dynamically adapt the checkpointing interval. In this work we give a brief overview of the different checkpointing techniques and propose a new provenance based dynamic checkpointing method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.