A Novel and Efficient De-duplication System for HDFS

S Ranjitha,P Sudhakar,K.S Seetharaman

doi:10.1016/j.procs.2016.07.374

S Ranjitha, P Sudhakar + Show 1 more

Open Access

https://doi.org/10.1016/j.procs.2016.07.374

Copy DOI

Abstract

Big Data is a frequent generation and updating of large volume of data around the clock across the globe by the users. Handling large volume of data in a real time environment is a challenging task. Distributed File System is one of the strategy to handle large volume of data in the real time. Distributed file system is a collection of independent computers that appear to the users of the system as a single coherent system. In Distributed file system common files can be shared between the nodes, the drawbacks are scalability, replication, availability and very expensive to buy a hardware server. To overcome this issue Hadoop Distributed File System came into existence. Hadoop distributed file system to run on cluster of commodity hardware like personal computer and laptop. HDFS provides the scalable, fault-tolerance, cost-efficient storage for Bigdata. Hadoop Distributed File System support data duplication to achieve high data reliability. However additional utilization of storage space is required due to duplication strategy. HDFS Storage space can be managed efficiently by implementing De-duplication techniques. The objective of the research is to eliminate file duplication by implementing De-duplication strategy. A novel and efficient De-duplication system using HDFS approach is introduced in this research work. To implement De-duplication strategy, hash values are computed for files using MD5 and SHA1 algorithms . The generated hash value for a file is checked with the existing file to identify the presence of duplication. If duplication exists, the system will not allow the user to upload the duplicate copy to the HDFS. Hence memory utilization is handled efficiently in HDFS.

Full Text