Abstract
Summary form only given. We study if fault tolerance can be made simpler and more efficient by exploiting the structure of the application. More specifically, we study divide-and-conquer parallelism, which is a popular and effective paradigm for writing parallel Grid applications. We have designed a novel fault tolerance mechanism for divide-and-conquer applications that reduces the amount of redundant computation by storing results of the discarded in a global (replicated) table. These results can later be reused, thereby minimizing the amount of work lost as a result of a crash. The execution time overhead of our mechanism is close to zero. Our mechanism can handle crashes of multiple processors or entire clusters at the same time.. It can also handle crashes of the root node that initially started the parallel computation. We have incorporated our fault tolerance mechanism in Satin, which is a Java-based divide-and-conquer system. Satin is implemented on top of the Ibis communication library. The core of Ibis is implemented in pure Java, without using any native libraries. The Satin runtime system and our fault tolerance extension also are written entirely in Java. The resulting system therefore is highly portable allowing the software to run unmodified on a heterogeneous Grid. We evaluated the performance of our fault tolerance scheme on a cluster of the Distributed ASCI Supercomputer 2 (DAS-2). In the first part of our tests, we show that the execution time overhead of our mechanism is close to zero. The results of the second part of our tests show that our algorithm salvages most of the work done by alive processors. Finally, we carried out tests on the European GridLab testbed. We ran one of our applications on a set of six heterogeneous parallel machines (four different operating systems, four different architectures) located in four different European countries. After manually killing one of the sites, the program recovered and finished normally.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have