A Content Fingerprint-Based Cluster-Wide Inline Deduplication for Shared-Nothing Storage Systems

Awais Khan,Prince Hamandawana,Youngjae Kim

doi:10.1109/access.2020.3039056

Abstract

Deduplication has been principally employed in distributed storage systems to improve storage space efficiency. Traditional deduplication research ignores the design specifications of shared-nothing distributed storage systems such as no central metadata bottleneck, scalability, and storage rebalancing. Likewise, inline deduplication integration poses serious threats to storage system read/write performance, consistency, and scalability. Mainly, this is due to ineffective and error-prone deduplication metadata, duplicate lookup I/O redirection, and placement of content fingerprints and data chunks. Further, transaction failures after deduplication integration often render inconsistencies in data chunks, deduplication metadata, and garbage data chunks. results in rendering inconsistencies in data chunks, deduplication metadata, and garbage data chunks. In this paper, we propose G rate , a high-performance inline cluster-wide data deduplication, complying with the design constraints of shared-nothing storage systems. In particular, G rate eliminates duplicate copies across the cluster for high storage space efficiency without jeopardizing performance. We employ a distributed deduplication metadata shard , which promises high-performance deduplication metadata and duplicate fingerprint lookup I/Os without introducing a single point of failure. The placement of data and deduplication metadata is made cluster-wide based on the content fingerprint of chunks. We decouple the deduplication metadata shard from read I/O path and replace it with a read manifestation object to further speedup read performance. To guarantee deduplication-enabled transaction consistency and efficient garbage identification, we design a flag-based asynchronous consistency scheme , capable of repairing the missing data chunks on duplicate arrival. We design and implement G rate in Ceph. The evaluation shows an average of 18% performance bandwidth improvement over the content addressable deduplication approach at smaller chunk sizes, i.e., less than 128KB while maintaining high storage space savings.

Highlights

The shared-nothing storage systems (SN-SS) accommodate a large number of storage servers for high performance, scalability, availability, and fault-tolerance [1], [2]
SN-SS such as GlusterFS [2], Sorento [3] and Ceph Object Storage [1] is widely employed in cloud storage due to multiple properties: The associate editor coordinating the review of this manuscript and approving it for publication was Li Wang
EVALUATION This section provides the evaluation of proposed cluster-wide data deduplication framework

Summary

Introduction

The shared-nothing storage systems (SN-SS) accommodate a large number of storage servers for high performance, scalability, availability, and fault-tolerance [1], [2]. The key characteristics of such systems include, i) high performance and scalability, ii) no centralized metadata bottleneck, iii) no single point of failure, and iv) addition and removal of storage servers on the go. Distributed storage systems following shared-nothing architecture such as Ceph [1], [43] and GlusterFS [2], do not have such a single point performance bottleneck issue. Both do not employ a centralized metadata server and instead, use a Distributed Hash Table (DHT) for data placement.

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 57	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Content Fingerprint-Based Cluster-Wide Inline Deduplication for Shared-Nothing Storage Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A Robust Fault-Tolerant and Scalable Cluster-Wide Deduplication for Shared-Nothing Storage Systems
Awais Khan ... Sungyong Park
-
Awais Khan, et. al.Awais Khan ... Sungyong Park
01 Sep 2018
01 Sep 2018

Secure Data Deduplication System with Efficient and Reliable Multi-Key Management in Cloud Storage
R Vignesh R Vignesh ... J Preethi R Vignesh
網際網路技術學刊 | VOL. 23
R Vignesh R Vignesh, et. al.R Vignesh R Vignesh ... J Preethi R Vignesh
01 Jul 2022
網際網路技術學刊 | VOL. 23

PFP: Improving the Reliability of Deduplication-based Storage Systems with Per-File Parity
Suzhen Wu ... Bo Mao
IEEE Transactions on Parallel and Distributed Systems | VOL. 30
Suzhen Wu, et. al.Suzhen Wu ... Bo Mao
01 Sep 2019
IEEE Transactions on Parallel and Distributed Systems | VOL. 30

UHUM: An Efficient Hybrid Update Mechanism in Distributed Storage Systems with Erasure Coding
Qian Luo ... Yun Wang
-
Qian Luo, et. al.Qian Luo ... Yun Wang
01 May 2019
01 May 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Content Fingerprint-Based Cluster-Wide Inline Deduplication for Shared-Nothing Storage Systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access