Resemblance and mergence based indexing for high performance data deduplication

Panfeng Zhang,Hua Wang,Ping Huang,Ke Zhou,Xubin He

doi:10.1016/j.jss.2017.02.039

Abstract

A new de-duplication scheme based on two-level index structure and dynamic bloom filter.A fast resemblance approach to index duplicate data based on segments.A novel resemblance mergence strategy that groups segments into bins.A new frequency based cleanup method to avoid storing low-frequent fingerprints.A thorough evaluation of our approach to demonstrate its effectiveness. Data deduplication, a data redundancy elimination technique, has been widely employed in many application environments to reduce data storage space. However, it is challenging to provide a fast and scalable key-value fingerprint index particularly for large datasets, while the index performance is critical to the overall deduplication performance. This paper proposes RMD, a resemblance and mergence based deduplication scheme, which aims to provide quick responses to fingerprint queries. The key idea of RMD is to leverage a bloom filter array and a data resemblance algorithm to dramatically reduce the query range. At data ingesting time, RMD uses a resemblance algorithm to detect resemble data segments and put resemblance segments in the same bin. As a result, at querying time, it only needs to search in the corresponding bin to detect duplicate content, which significantly speeds up the query process. Moreover, RMD uses a mergence strategy to accumulate resemblance segments to relevant bins, and exploits frequency-based fingerprint retention policy to cap the bin capacity to improve query throughput and data deduplication ratio. Extensive experimental results with real-world datasets have shown that RMD is able to achieve high query performance and outperforms several well-known deduplication schemes.

Full Text