A High-performance Post-deduplication Delta Compression Scheme for Packed Datasets

Yucheng Zhang,Nan Jiang,Mengtian Shi,Xinyun Wu,Chunzhi Wang,Hong Jiang

doi:10.1109/iccd53106.2021.00078

Abstract

Data deduplication has become a standard feature in most storage backup systems to reduce storage costs. In real-world deduplication-based backup products, small files are grouped into larger packed files prior to deduplication. For each file, the grouping entails a backup product inserting a metadata block immediately before the file contents. Since the contents of these metadata blocks vary with every backup, different backup streams of the packed files from the same or highly similar small files will contain chunks that are considered mostly unique by conventional deduplication. That is, most of the contents among these unique chunks in different backups are identical, except for metadata blocks. Delta compression is able to remove those redundancy but cannot be applied to backup storage because the extra I/Os required to retrieve the base chunks significantly decrease backup throughput. If there are many grouped small files in the backup datasets, some duplicate chunks, called persistent fragmented chunks (PFCs), may be rewritten repeatedly. We observe that PFCs are often surrounded by substantial unique chunks containing metadata blocks. In this paper, we propose a PFC-inspired delta compression scheme to efficiently perform delta compression for unique chunks surrounding identical PFCs.In the process of deduplication, containers holding previous copies of the chunks being considered for storage will be accessed for prefetching metadata to accelerate the detection of duplicates. The main idea behind our scheme is to identify containers holding PFCs and prefetch chunks in those containers by piggybacking on the reads for prefetching metadata when they are accessed during deduplication. Base chunks for delta compression are then detected from the prefetched chunks, thus eliminating extra I/Os for retrieving the base chunks. Experimental results show that PFC-inspired delta compression attains additional data reduction by about 2x on top of data deduplications and accelerates the restore speed by 8.6%-49.3%, while moderately sacrificing the backup throughput by 0.5%-11.9%.

Full Text