On improving recovery performance in erasure code based geo-diverse storage clusters

Pablo Ignacio Serrano Caneleo,Lakshmi J Mohan,Udaya Parampalli,Aaron Harwood

doi:10.1109/drcn.2016.7470846

Abstract

Erasure code based distributed storage systems are increasingly being used by storage providers for big data storage since they offer the same reliability as replication with significant decrease in the amount of storage required. But, when it comes to a storage system with data nodes spread across a very large geographical area, the code's recovery performance is affected by various factors, both network and computation related. In this paper, we expose an XOR based code supplemented with the ideas of parity duplication and rack awareness that could be adopted in such storage clusters to improve the recovery performance during node failures. We have implemented them on the erasure code module of the XORBAS version of the Hadoop Distributed File System (HDFS). For evaluating the performance of the proposed ideas, we employ a geo-diverse cluster on the NeCTAR research cloud. The experimental results show that the techniques aid in bringing down the data read for repair by a factor of 85% and repair duration by a factor of 57% during node failures, though resulting in an increased storage requirement of 21% as compared to the traditional Reed-Solomon codes used in HDFS. The sum of all these ideas could offer a better solution for a code based storage system spanning a wide geographical area that has storage constraints such that a triple replicated system is not affordable and at the same time has strict requirements on ensuring the minimal recovery time.

Full Text