Abstract

Driven by the increasing computational demands, cluster management systems (e.g., Mesos) are already pervasive for deploying many applications. Unfortunately, despite much effort, existing systems are still difficult to meet the high requirements of critical applications (e.g., trading and military applications), because these applications naturally require high-availability and low performance overhead in deployments. Existing systems typically replicate their job controllers so that these controllers can be highly-available and thus they can handle application failures. However, applications themselves are still often a single point of failure, leaving arbitrary unavailable time windows for themselves. This paper proposes the design of Tripod, a cluster management system that automatically provides high-availability to general applications. Tripod's key to make applications achieve high-availability efficiently is a new Paxos replication protocol that leverages RDMA (Remote Direct Memory Access). Tripod runs replicas of the same job with a replicas of controllers, and controllers agree on job requests efficiently with this protocol. Evaluation shows that Tripod has low performance overhead in both throughput and response time compared to an application's unreplicated execution.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call