Abstract

Content-Defined Chunking(CDC) is the key stage of data deduplication since it has a significant impact on deduplication system’s throughput and deduplication efficiency. However, existing CDC algorithms suffer from high computation overhead, weak stability, and poor ability to handle low-entropy strings. In this paper, we propose UltraCDC, a fast and stable, high-efficiency deal with low-entropy strings, CDC algorithm for deduplication-based storage systems. There are four key techniques behind UltraCDC, namely, rolling compute boundary conditions, skipping sub-minimum chunk size, normalized chunking, and jumping to detect low-entropy strings. Using a sliding window to rolling compute boundary conditions not only accelerates the chunking stage but also makes it more resistant to boundary shift, the two techniques of skipping sub-minimum chunk size and normalized chunking can complement each other to speed up chunking without sacrificing deduplication ratio too much, and the jumping detection can detect more low-entropy strings than AE-opt2 without affecting chunking speed. We implemented UltraCDC in Destor, and the experimental results show that using the above four techniques, chunking speed is 1.5–10× faster than the state-of-the-art CDC approaches, while deduplication ratio is comparable or even higher than the classic Rabin-base CDC. In terms of the capability to detect low-entropy strings, UltraCDC is a CDC approach with the highest ability to detect low-entropy strings, 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> × and 2× higher than Rabin-based CDC and AE-opt2, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call