A Cost-Efficient Data Placement Algorithm with High Reliability in Hadoop

Yao Du,Jiahui Jin,Runqun Xiong,Junzhou Luo

doi:10.1109/cbd.2017.25

Abstract

In the big data era, more and more enterprises use Hadoop distributed file system (HDFS) to provide the function of managing and storing big data for upper applications. However, the default three replicas strategy of HDFS brings tremendous storage cost for a data center, as the volume of big data is increasing especially when the cold data is also growing. On the other hand, in the heterogeneous Hadoop clusters, the rack-aware data placement of HDFS ignores the differences of each node, data blocks of high reliability requirements may be placed on the nodes with poor reliability, which may cause that the reliability of the data cannot be guaranteed effectively. In order to solve the above problems, this paper presents a data placement theoretical model and designs a double sort exchange algorithm (DSEC) to guarantee the reliability of cold data and lowers storage cost. Specifically, for the cold data based on erasure code, the algorithm uses the information of nodes which contributes to selecting the result set. Then, through double-sorting the result set and the remaining set, the elements of two sets are exchanged until finding the lowest cost and satisfying reliability. Finally, we make some experiments which shows that DSEC can guarantee the reliability, but also has the lowest storage cost compared with other data placement strategies.

Full Text