AbstractData replication is the main fault tolerance mechanism implemented by the Apache Hadoop Distributed File System (HDFS). The placement of the data across the cluster directly affects replica balancing and data locality. The HDFS Balancer is the native solution to rebalance the data distribution by moving the blocks from over‐utilized to under‐utilized nodes. Nevertheless, its current balancing policy does not address the characteristics and specific needs of the applications during data rearrangement. In this work, we present the PRBP, a customized replica balancing policy for the HDFS Balancer. The PRBP is based on a system of priorities, which can be adapted and configured according to different demands of use, either these are related to heterogeneous environments or focused on improving data reliability and availability. The priorities define whether system metrics or aspects of the cluster topology should be considered during the execution of the HDFS Balancer, thus making the process of replica balancing in HDFS more flexible. Based on the priority system, we determine association rules that allow the use of multiple priorities simultaneously. Along with these rules, we present guidelines for using the PRBP as a specialized solution in scenarios that can benefit from reactive replica balancing. In addition, we conducted a practical experimentation to highlight the behavior and the applicability of the guidelines of the PRBP to prioritize replica rearrangement in the file system.
Read full abstract