Implementation of a deduplication cache mechanism using content-defined chunking

Yoshihiro Oyama,Osamu Tatebe,Shun Ishiguro,Jun Murakami

doi:10.1504/ijhpcn.2016.076251

Abstract

Many application programs in data-intensive science read and write large files. Large data consume significant memory because the data is loaded into the page cache. Since memory resources are critically valuable in data-intensive computing, reducing the memory footprint consumed by file data is essential. In this paper, we propose a cache deduplication mechanism with content-defined chunking CDC for the Gfarm distributed file system. CDC divides a file into variable-size blocks chunks based on the contents of the file. The client stores the chunks in the local file system as cache files and reuses them during subsequent file accesses. Deduplication of chunks reduces the amount of transmitted data between clients and servers, and reduces storage and memory requirements. The experimental results demonstrate that the proposed mechanism significantly improves the performance of file-read operations and that the introduction of parallelism reduces the overhead of file-write operations.

Full Text