ERP: An enhanced read policy for HDFS to improve read performance for files under construction

Junjie He Junjie He,Haopeng Chen Haopeng Chen,Fei Hu Fei Hu

doi:10.1109/pic.2015.7489868

Abstract

With growing demands on big data analysis and data security, traditional storage cannot meet the requirements any more. As a result, more and more companies and individuals move their data to distributed file system, which expands the storage for mass data and provides reliable file access in case of hardware or software failure. Hadoop Distributed File System (HDFS), as a famous distributed file system, has been widely adopted to store data and work as a basic storage system. Since the release of HDFS 2.0, HDFS has supported new features such as appending and reading while writing. However, HDFS implements the read operation with a strong consistent policy that a file under construction cannot be accessed until the requested data are available in all data nodes. As a result, a read operation cannot be executed until all data nodes receive all requested data. In this paper, we present ERP that introduces slight changes to HDFS for obtaining higher availability. ERP may loosen the constraints of original read policy to allow more data access, although only a part of the replications have stored required data. To evaluate the performance of our HDFS with ERP, we focus on the time consumed in last block of a file under construction, from the start time of writing last block to the end time of reading last block. In our experimental results, ERP outperforms original HDFS by 163 %.

Full Text