Abstract
Hadoop Distributed File System (HDFS) was originally designed to store big files and has been widely used in big-data ecosystem. However, it may suffer from serious performance issues when handling a large number of small files. In this paper, we propose a novel archive system, referred to as Small File Merger (SFM), to solve small file problems in HDFS. The key idea is to combine small files into large ones and build an index for accessing original files. Unlike traditional archive systems such as Hadoop Archives (Har), SFM allows modification of archived files directly without re-archiving. Considering that most of the reads in HDFS are sequential, we design an adaptive readahead strategy based on the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm to maximize read performance. Furthermore, our system provides an HDFS-compatible interface, which can be used directly without recompiling and redeploying the existing HDFS cluster, hence facilitating convenient deployment for practical use. Preliminary experimental results show that our system achieves better performance than existing methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.