Abstract
In the Sprite environment, tolerating faults means recovering from them quickly. Our position is that performance and availability are the desired features of the typical locally-distributed office/engineering environment, and that very fast server recovery is the most cost-effective way of providing such availability. Mechanisms used for reliability are often inappropriate in systems with the primary goal of performance, and some availability-oriented methods using replicated hardware or processes cost too much for these systems. In contrast, availability via fast recovery need not slow down a system, and our experience in Sprite shows that in some cases the same techniques that provide high performance also provide fast recovery. In our first attempt to reduce server recovery times to less than a minute, we take advantage of the distributed state already present in our file system, and a high-performance log-structured file system currently under implementation.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.