Abstract

Distributed parallel computing platforms contribute for a large part to some of the most powerful computers. Such architectures are typically based on accelerators (General Purpose computing on Graphics Processing Units, Many Integrated Cores e.g. Xeon Phi co-processors) and/or a large number of interconnected computing nodes. Obviously, they raise new challenges, typically in terms of scalability, robustness, adaptability and security. At the advent of the quest for Ultra scale Computing Systems, this paper addresses the issue of fault tolerance toward Byzantine failures over's such platforms. Indeed, the inherent unpredictable nature of these errors render their detection, not speaking of their correction, hard or even impossible to perform at large-scale. At this level, Algorithm-Based Fault Tolerance (ABFT) techniques where the fault tolerance scheme is tailored to the algorithm performed, seems the most promising approaches to deal with such failures. In this context, Evolutionary Algorithms (EAs), especially panmictic global parallel EAs, exhibit a remarkable resilience against byzantine failures modeled as cheating faults as demonstrated either empirically or theoretically in previous studies. In this paper, we extend this analysis to the case of distributed EAs based on the cellular model leading to distributed Cellular Evolutionary Algorithms (dCEAs). Our empirical study over a set or reference optimization problem confirm the ABFT nature of dCEAs. To our knowledge, this is the first study of dCEAs under the perspective of cheating issues and crash faults in a domain of distributed computations, thus opening new insights and perspectives for the design of competitive ultra-scale system based on evolutionary programming models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call