Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Tharwat El-Sayed,Mohamed Badawy,Ayman El-Sayed

doi:10.21608/mjeer.2019.62728

Tharwat El-Sayed, Mohamed Badawy + Show 1 more

Open Access

https://doi.org/10.21608/mjeer.2019.62728

Copy DOI

Abstract

Hadoop is an open-source framework written by java and used for bigdata processing. It consists of two main components: HadoopDistributed File System (HDFS) and MapReduce. HDFS is used tostore data while MapReduce is used to distribute and process anapplication tasks in a distributed processing form. Recently, severalresearchers employ Hadoop for processing big data. The resultsindicate that Hadoop performs well with Large Files (files larger thanData Node block size). Nevertheless, Hadoop performance decreaseswith small files that are less than its block size. This is because, smallfiles consume the memory of both the DataNode and the NameNode,and increases the execution time of the applications (i.e. decreasesMapReduce performance). In this paper, the problem of the small filesin Hadoop is defined and the existing approaches to solve this problemare classified and discussed. In addition, some open points that mustbe considered when thinking of a better approach to improve theHadoop performance when processing the small files.

Full Text