Abstract

Deduplication in storage systems has gained momentum recently for its capability in reducing data footprint. However, deduplication introduces challenges to storage management as storage objects (e.g., files) are no longer independent from each other due to content sharing between these storage objects. In this paper, we present a graph-based framework to address the challenges of storage management due to deduplication. Specifically, we model content sharing among storage objects by content sharing graphs (CSG), and apply graph-based algorithms to two real-world storage management use cases for deduplication-enabled storage systems. First, a quasi-linear algorithm was developed to partition deduplication domains with a minimal amount of deduplication loss (i.e., data replicated across partitioned domains) in commercial deduplication-enabled storage systems, whereas in general the partitioning problem is NP-complete. For a real-world trace of 3 TB data with 978 GB of removable duplicates, the proposed algorithm can partition the data into 15 balanced partitions with only 54 GB of deduplication loss, that is, a 5% deduplication loss. Second, a quick and accurate method to query the deduplicated size for a subset of objects in deduplicated storage systems was developed. For the same trace of 3 TB data, the optimized graph-based algorithm can complete the query in 2.6 s, which is less than 1% of that of the traditional algorithm based on the deduplication metadata.

Highlights

  • Motivated by the current data explosion, data reduction methods like deduplication and compression have become popular features increasingly supported in primary storage systems

  • To illustrate the k-core decomposition and k-core size computation let us take as example the 19 files described in Table 1 and say that we want to partition them into two partitions in such a way that (1) we have a minimal loss in deduplication, (2) each file belongs to only one partition and (3) the partitions have sizes relatively close to each other

  • We propose a relatively small metadata layer that is extracted from the raw deduplication metadata, to help efficiently answer on-demand size queries in a deduplicated storage system

Read more

Summary

Introduction

Motivated by the current data explosion, data reduction methods like deduplication and compression have become popular features increasingly supported in primary storage systems. Representing the shared content only once provides another important property of the SCSG graphs: the deduplicated size of the folder (file system) is the sum of vertex (raw file) sizes minus the sum of connecting edge weights. Some applications need more detailed, chunk-level content sharing information for a set of deduplicated files In row 4, “Deduplication Ratio” means the ratio of the amount of unique chunks over the amount of raw data

Motivation
Results
Related Work
Conclusions and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.