HaDaap: A hotness‐aware data placement strategy for improving storage efficiency in heterogeneous Hadoop clusters

Runqun Xiong,Yao Du,Jiahui Jin,Junzhou Luo

doi:10.1002/cpe.4830

Abstract

SummaryEnterprises increasingly use the Hadoop Distributed File System (HDFS) to manage and store big data for many applications. However, HDFS uses triple replication, leading to staggering data center storage costs. As big data increases in volume and its heat levels becomes more sensitive, there comes a point where storing so much cold data actually makes it less accessible and more expensive. Meanwhile, as data centers expand, the heterogeneity of nodes also becomes an issue. Rack‐aware data placement adopted by HDFS results in an unbalanced load and uneven resource allocation because it ignores the data nodes' heterogeneity. Here, we attempt to resolve these problems by proposing a hotness‐aware data placement strategy (named HaDaap). In HaDaap, the first step is to use a hotness‐aware data clustering algorithm to set the data's degree of heat. Then, cold data (with a redundancy of erasure code) are placed through a Double Sort Exchange algorithm to reduce storage costs and increase data availability. Finally, hot data are placed via a dynamic replication placement mechanism that comprehensively factors availability, load, and storage costs. Experimental results show that with these enhancements, HaDaap uses resources rationally and substantially reduces storage costs by considering the difference of data hotness in heterogeneous Hadoop clusters.

Full Text