Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence

Ahmed Sardar M Saeed,Loay E George

doi:10.3390/sym12111841

Abstract

Every second, millions of data are being generated due to the use of emerging technologies. It is very challenging to store and handle such a large amount of data. Data deduplication is a solution for this problem. It is a new technique that eliminates duplicate data and stores only a single copy of data, reducing storage utilization and the cost of maintaining redundant data. Content-defined chunking (CDC) has been playing an important role in data deduplication systems due to its ability to detect high redundancy. In this paper, we focused on deduplication system optimization by tuning relevant factors in CDC to identify chunk cut-points and introduce an efficient fingerprint using a new hash function. We proposed a novel bytes frequency-based chunking (BFBC) algorithm and a new low-cost hashing function. To evaluate the efficiency of the proposed system, extensive experiments were done using two different datasets. In all experiments, the proposed system persistently outperformed the common CDC algorithms, achieving a better storage gain ratio and enhancing both chunking and hashing throughput. Practically, our experiments show that BFBC is 10 times faster than basic sliding window (BSW) and approximately three times faster than two thresholds two divisors (TTTD). The proposed triple hash function algorithm is five times faster than SHA1 and MD5 and achieves a better deduplication elimination ratio (DER) than other CDC algorithms. The symmetry of our work is based on the balance between the proposed system performance parameters and its reflection on the system efficiency compared to other deduplication systems.

Highlights

The amount of digital data is rising explosively, and the forecasted amount of data to be generated by the end of 2020 is about 44 zettabytes
We propose the bytes frequency-based chunking (BFBC) technique as a fast and efficient chunking algorithm, where we expressively reduce the number of computing operations by using multi dynamic optimum parameter divisors (D) with the best threshold value, exploit the multi-operational nature of BFBC to reduce the chunk-size variance, and maximize chunking throughput with an improved deduplication elimination ratio (DER)
We developed a Content-defined chunking (CDC) algorithm based on the frequency of bytes occurrence and a new hashing algorithm based on a mathematical triple hashing function

Summary

Introduction

The amount of digital data is rising explosively, and the forecasted amount of data to be generated by the end of 2020 is about 44 zettabytes. Because of this “data flood,” storing and maintaining backups for such data efficiently and cost-effectively has become one of the most challenging and essential tasks in the big data domain [1,2,3]. Enterprises, IT companies, and industries need to store and operate on an enormous amount of data. The big issue is how to manage these data. To manage data in a proper way, data deduplication techniques are used.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Symmetry	Publication Date: Nov 6, 2020
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Symmetry

Lead the way for us

Similar Papers

Genetic optimized data deduplication for distributed big data storage systems
Naresh Kumar ... Shobha Antwal
-
Naresh Kumar, et. al.Naresh Kumar ... Shobha Antwal
01 Sep 2017
01 Sep 2017

An Effective Way To Reduce Network Transmission In Backup System
Yun Chao ... Jindian Su
-
Yun Chao, et. al.Yun Chao ... Jindian Su
01 Jun 2022
01 Jun 2022

A new content-defined chunking algorithm for data deduplication in cloud storage
Ryan N.S Widodo ... Mohammed Atiquzzaman
Future Generation Computer Systems | VOL. 71
Ryan N.S Widodo, et. al.Ryan N.S Widodo ... Mohammed Atiquzzaman
10 Feb 2017
Future Generation Computer Systems | VOL. 71

Frequency Based Chunking for Data De-Duplication
Guanlin Lu ... Yu Jin
-
Guanlin Lu, et. al.Guanlin Lu ... Yu Jin
01 Aug 2010
01 Aug 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data Deduplication System Based on Content-Defined Chunking Using Bytes Pair Frequency Occurrence

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Symmetry