A Duplicate Data Detection Approach Based on MapReduce and HDFS

Fang Wei,Zheng Yu,Wen Xue-Zhi

doi:10.2174/2213275910666161207144356

Abstract

Background: With the surge in the volume of collected data, deduplication will undoubtedly become one of the problems faced by researchers. There is significant advantage for deduplication to reduce storage, network bandwidth, and system scalability of coarse-grained redundant data. Since the conventional methods of deleting duplicate data include hash comparison and binary differential incremental. They will lead to several bottlenecks for processing large scale data. And, the traditional Simhash similarity method has less consideration on the natural similarity of text in some specific fields and cannot run in parallel program with large scale text data processing efficiently. This paper examines several most important patents in the area of data detection. Then, this paper will focus on large scale of data deduplication based on MapReduce and HDFS. Keywords: Large scale data sets, deduplication, MapReduce, HDFS, Simhash, shared nearest neighbor.

Full Text