HAR+: Archive and metadata distribution! Why not both?

Dipayan Dev,Ripon Patgiri

doi:10.1109/iccci.2015.7218119

Abstract

Size of the data used in today's enterprises has been expanding at a huge range from last few years. Simultaneously, the need to process and analyze the large volumes of data has also increased. Hadoop Distributed File System (HDFS), is an open source implementation of Apache, designed for running on commodity hardware to handle applications having large datasets (TB, PB). HDFS architecture is based on single master (Name Node), which handles the metadata for large number of slaves. To get maximum efficiency, Name Node stores all of the metadata in its RAM. So, when dealing with huge number of small files, Name Node often becomes a bottleneck for HDFS as it might run out of memory. Apache Hadoop uses Hadoop ARchive (HAR) to deal with small files. But it is not so efficient for multi-NameNode environment, which requires automatic scaling of metadata. In this paper, we have designed hashtable based architecture, Hadoop ARchive Plus (HAR+) using sha256 as the key, which is a modification of existing HAR. HAR+ is designed to provide more reliability which can also provide auto scaling of metadata. Instead of using one NameNode for storing the metadata, HAR+ uses multiple NameNodes. Our result shows that HAR+ reduces the load of a single NameNode in significant amount. This makes the cluster more scalable, more robust and less prone to failure unlike of Hadoop ARchive.

Full Text