Abstract

Content-Defined Chunking(CDC) is the key stage of data deduplication since it has a significant impact on deduplication system’s throughput and deduplication efficiency. However, existing CDC algorithms suffer from high computation overhead, weak stability, and poor ability to handle low-entropy strings. In this paper, we propose UltraCDC, a fast and stable, high-efficiency deal with low-entropy strings, CDC algorithm for deduplication-based storage systems. There are four key techniques behind UltraCDC, namely, rolling compute boundary conditions, skipping sub-minimum chunk size, normalized chunking, and jumping to detect low-entropy strings. Using a sliding window to rolling compute boundary conditions not only accelerates the chunking stage but also makes it more resistant to boundary shift, the two techniques of skipping sub-minimum chunk size and normalized chunking can complement each other to speed up chunking without sacrificing deduplication ratio too much, and the jumping detection can detect more low-entropy strings than AE-opt2 without affecting chunking speed. We implemented UltraCDC in Destor, and the experimental results show that using the above four techniques, chunking speed is 1.5–10× faster than the state-of-the-art CDC approaches, while deduplication ratio is comparable or even higher than the classic Rabin-base CDC. In terms of the capability to detect low-entropy strings, UltraCDC is a CDC approach with the highest ability to detect low-entropy strings, 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> × and 2× higher than Rabin-based CDC and AE-opt2, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.