Content Deduplication with Granularity Tweak Based on Base and Deviation for Large Text Dataset

S Venkatesh Babu,P Ramya,Jeffin Gracewell,Zhiri Tang

doi:10.1155/2022/9515181

S Venkatesh Babu, P Ramya + Show 2 more

Open Access

https://doi.org/10.1155/2022/9515181

Copy DOI

Abstract

The concept of storage optimization has evolved as one of the hottest research projects in big data which brings out better solutions such as data compression which almost converges towards the deduplication technique. Deduplication is a technique that finds and eliminates duplicate content by storing only the unique copies of data whose efficiency is being qualified based on the amount of duplicate content that they hideout from the data source. The deduplication technique is a well-established storage optimization technique, so in the due course of time, various tweaks have been provided for its betterment, but it quite has some limitations that it cannot determine the tiny changes that occur among similar contents, and the chunks which are generated by segmenting and hashing the data are more sensitive to changes which produce a new chunk for every small change which ruins the concept of storage optimization, so to tackle this, content deduplication with granularity tweak (CDGT) in the Hadoop architecture has been proposed for large text datasets. The CDGT aims to improve the efficiency of deduplication by utilizing the Reed Solomon technique. This pumps out more duplicate content by verifying both intracontent and intercontent as consequence performance enhancements are met, and this system incorporates cluster-based indexing to reduce the time involved in data management activities.

Full Text