Abstract

is a powerful distributed processing model for large datasets. Hadoop is an open source framework and implementation of MapReduce. Hadoop distributed file system (HDFS) has become very popular to build large scale and high performance distributed data processing system. HDFS is designed mainly to handle big size files, so the processing of massive small files is a challenge in native HDFS. This paper focuses on introducing an approach to optimize the performance of processing of massive small files on HDFS. We design a new HDFS structure model which main idea is to merge the small files and write the small files at source direct into merged file. Experimental results show that the proposed scheme can improve the storage and access efficiencies of massive small files effectively on HDFS. KeywordsMapReduce, HDFS, Big data, Cluster

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call