Content Sharing Graphs for Deduplication-Enabled Storage Systems

Maohua Lu,Cornel Constantinescu,Prasenjit Sarkar

doi:10.3390/a5020236

Abstract

Deduplication in storage systems has gained momentum recently for its capability in reducing data footprint. However, deduplication introduces challenges to storage management as storage objects (e.g., files) are no longer independent from each other due to content sharing between these storage objects. In this paper, we present a graph-based framework to address the challenges of storage management due to deduplication. Specifically, we model content sharing among storage objects by content sharing graphs (CSG), and apply graph-based algorithms to two real-world storage management use cases for deduplication-enabled storage systems. First, a quasi-linear algorithm was developed to partition deduplication domains with a minimal amount of deduplication loss (i.e., data replicated across partitioned domains) in commercial deduplication-enabled storage systems, whereas in general the partitioning problem is NP-complete. For a real-world trace of 3 TB data with 978 GB of removable duplicates, the proposed algorithm can partition the data into 15 balanced partitions with only 54 GB of deduplication loss, that is, a 5% deduplication loss. Second, a quick and accurate method to query the deduplicated size for a subset of objects in deduplicated storage systems was developed. For the same trace of 3 TB data, the optimized graph-based algorithm can complete the query in 2.6 s, which is less than 1% of that of the traditional algorithm based on the deduplication metadata.

Highlights

Motivated by the current data explosion, data reduction methods like deduplication and compression have become popular features increasingly supported in primary storage systems
To illustrate the k-core decomposition and k-core size computation let us take as example the 19 files described in Table 1 and say that we want to partition them into two partitions in such a way that (1) we have a minimal loss in deduplication, (2) each file belongs to only one partition and (3) the partitions have sizes relatively close to each other
We propose a relatively small metadata layer that is extracted from the raw deduplication metadata, to help efficiently answer on-demand size queries in a deduplicated storage system

Summary

Introduction

Motivated by the current data explosion, data reduction methods like deduplication and compression have become popular features increasingly supported in primary storage systems. Representing the shared content only once provides another important property of the SCSG graphs: the deduplicated size of the folder (file system) is the sum of vertex (raw file) sizes minus the sum of connecting edge weights. Some applications need more detailed, chunk-level content sharing information for a set of deduplicated files In row 4, “Deduplication Ratio” means the ratio of the amount of unique chunks over the amount of raw data

Motivation

Results

Related Work

Conclusions and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Apr 10, 2012
Citations: 21	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Content Sharing Graphs for Deduplication-Enabled Storage Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Memory efficient sanitization of a deduplicated storage system
...
-
, et. al. ...
12 Feb 2013
12 Feb 2013

DeDu: Building a deduplication storage system over cloud computing
Zhe Sun ... Jianming Yong
-
Zhe Sun, et. al.Zhe Sun ... Jianming Yong
01 Jun 2011
01 Jun 2011

The Grid Enabled Mass Storage System (GEMSS): the Storage and Data management system used at the INFN Tier1 at CNAF.
Pier Paolo Ricci ... Vincenzo Vagnoni
Journal of Physics: Conference Series | VOL. 396
Pier Paolo Ricci, et. al.Pier Paolo Ricci ... Vincenzo Vagnoni
13 Dec 2012
Journal of Physics: Conference Series | VOL. 396

RepEC-Duet: Ensure High Reliability and Performance for Deduplicated and Delta-Compressed Storage Systems
Chunxue Zuo ... Fang Wang
-
Chunxue Zuo, et. al.Chunxue Zuo ... Fang Wang
01 Nov 2019
01 Nov 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Content Sharing Graphs for Deduplication-Enabled Storage Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms