A Strategy for Small Files Processing in HDFS

Zhenshan Bao,Wenbo Zhang,Jianli Liu,Juncheng Chen,Shikun Xu

doi:10.1007/978-981-10-2053-7_11

Abstract

Hadoop distributed file system (HDFS) as a popular cloud storage platform, benefiting from its scalable, reliable and low-cost storage capability. However it is mainly designed for batch processing of large files, it’s mean that small files cannot be efficiently handled by HDFS. In this paper, we propose a mechanism to store small files in HDFS. In our approach, file size need to be judged before uploading to HDFS. If the file size is less than the size of the block, all correlated small files will be merged into one single file and we will build index for each small file. Furthermore, prefetching and caching mechanism are used to improve the reading efficiency of small files. Meanwhile, for the new small files, we can execute appending operation on the basis of merged file. Contrasting to original HDFS, experimental results show that the storage efficiency of small files is improved.

Full Text