Automation and prioritization of replica balancing in HDFS

Rhauani Weber Aita Fazul,Patrícia Pitthan Barcelos

doi:10.1145/3412841.3442075

Abstract

The Hadoop Distributed File System (HDFS) is a reliable storage engine designed to run over commodity hardware. To provide reliability and read performance, HDFS has a storage model based on data replication and works best when the file blocks are evenly spread across the cluster. HDFS Balancer is an Apache Hadoop daemon created for replica balancing on the file system. However, the tool is not optimized to meet potential usage demands of reliability and availability during data redistribution, besides requiring to be manually configured and triggered. In this work, we present a solution for replica balancing that takes advantage of the combined use of a proactive and a reactive approach. The former is addressed through the active monitoring of the computational environment by an agent-server structure. The latter is based on the customization of the default operation policy of the HDFS Balancer. As shown by the evaluation results, the solution automates the use of the HDFS Balancer and allows it to execute according to the reliability of the racks and the availability of the data stored in the cluster.

Full Text