Abstract

Failure recovery is one of the most essential problems in the Internet of Things (IoT) systems, especially in crucial scenarios like traffic control and healthcare. Meanwhile, with the ever-increasing demand of IoT applications and for latency and security considerations, more and more IoT applications are migrated to large clusters that consist of both cloud and edge servers. However, with the scale of edge-cloud collaborative clusters continue to expand, the risk of system errors and failures is also increasing. The conventional snapshot/rollback method is a powerful way for solving this problem and it is widely used in cloud computing scenarios. But when transplanting to edge-cloud collaborative clusters with the nature of distribution and heterogeneity, it will introduce serious network interruption and guest performance impact. Therefore, in this paper, to address the above problems, we propose a duration-aware cluster snapshot system, named Phalanx, which can take live snapshots of edge-cloud collaborative clusters with low performance overhead. In Phalanx, we use the low-overhead pre-copy model and first propose a VM snapshot duration prediction method that can accurately predict the snapshot duration of each single VM. Then, based on the prediction results, we coordinate the snapshot process to ensure the whole cluster has a consistency-friendly schedule, thereby solving the network interruption problems and finally minimizing the adverse performance impact to the guest IoT applications. We implement the prototype of Phalanx on QEMU/KVM platform and conduct several experiments. The experimental results show that Phalanx offers negligible network interruption while incurring 10.68%-20.9% less performance impact over existing solutions.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call