Abstract

Fault tolerance is a major issue for parallel applications. Approaches on application-level are gaining increasing attention because they may be more efficient than system-level ones. In this paper, we present a generic reusable framework for fault-tolerant parallelization with the task pool pattern. Users of this framework can focus on coding sequential tasks for their problem, while respecting some framework contracts. The framework is written in Java and deploys the APGAS library as well as Hazelcast's distributed and fault-tolerant IMap. Our fault-tolerance scheme uses two system-wide maps, in which it stores, e.g., backups of local task pools. Framework users may configure the number of backup copies to control how many simultaneous failures are tolerated. The algorithm is correct in the sense that the computed result is the same as in non-failure case, or the program aborts with an error message. In experiments with up to 128 workers, we compared the framework's performance with that of a non-fault-tolerant variant during failure-free operation. For the UTS and BC benchmarks, the overhead was at most 35%. Measured values were similar as for a related, but less flexible fault-tolerant X10 framework, without a clear winner. Raising the number of backup copies to six only marginally improved the overhead.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call