RapidCDC

Fan Ni,Song Jiang

doi:10.1145/3357223.3362731

Abstract

I/O deduplication is a key technique for improving storage systems' space and I/O efficiency. Among various deduplication techniques content-defined chunking (CDC) based deduplication is the most desired one for its high deduplication ratio. However, CDC is compute-intensive and time-consuming, and has been recognized as a major performance bottleneck of the CDC-based deduplication system. In this paper we leverage the existence of a property in the duplicate data, named duplicate locality, that reveals the fact that multiple duplicate chunks are likely to occur together. In other words, one duplicate chunk is likely to be immediately followed by a sequence of contiguous duplicate chunks. The longer the sequence, the stronger the locality is. After a quantitative analysis of duplicate locality in real-world data, we propose a suite of chunking techniques that exploit the locality to remove almost all chunking cost for deduplicatable chunks in CDC-based deduplication systems. The resulting deduplication method, named RapidCDC, has two salient features. One is that its efficiency is positively correlated to the deduplication ratio. RapidCDC can be as fast as a fixed-size chunking method when applied on data sets with high data redundancy. The other feature is that its high efficiency does not rely on high duplicate locality strength. These attractive features make RapidCDC's effectiveness almost guaranteed for datasets with high deduplication ratio. Our experimental results with synthetic and real-world datasets show that RapidCDC's chunking speedup can be up to 33x higher than regular CDC. Meanwhile, it maintains (nearly) the same deduplication ratio.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

RapidCDC

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Genetic optimized data deduplication for distributed big data storage systems
Naresh Kumar ... S.C Jain
-
Naresh Kumar, et. al.Naresh Kumar ... S.C Jain
01 Sep 2017
01 Sep 2017

The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems
Wen Xia ... Yu Hua
IEEE Transactions on Parallel and Distributed Systems | VOL. 31
Wen Xia, et. al.Wen Xia ... Yu Hua
01 Sep 2020
IEEE Transactions on Parallel and Distributed Systems | VOL. 31

QuickCDC: A Quick Content Defined Chunking Algorithm Based on Jumping and Dynamically Adjusting Mask Bits
Zhen Xu ... Wenbo Zhang
-
Zhen Xu, et. al.Zhen Xu ... Wenbo Zhang
01 Sep 2021
01 Sep 2021

SuperCDC: A Hybrid Design of High-Performance Content-Defined Chunking for Fast Deduplication
Binzhaoshuo Wan ... Peng Wang
-
Binzhaoshuo Wan, et. al.Binzhaoshuo Wan ... Peng Wang
01 Oct 2022
01 Oct 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

RapidCDC

Abstract

Talk to us

Similar Papers