Function of Content Defined Chunking Algorithms in Incremental Synchronization

Changjian Zhang,Deyu Qi,Wenlin Li,Jing Guo

doi:10.1109/access.2019.2963625

Abstract

Data chunking algorithms divide data into several small data chunks in a certain way, thus transforming the operation of data into the one of multiple small data chunks. Data chunking algorithms have been widely used in duplicate data detection, parallel computing and other fields, but it is seldom used in data incremental synchronization. Aiming at the characteristics of incremental data synchronization, this paper proposes a novel data chunking algorithm. By dividing two data that need synchronization into small data chunks, comparing the contents of these small data chunks, different ones are the incremental data that need to be found. The new algorithm determines to set a cut-point based on the number of 1 contained in the binary format of all bytes in an interval. Thus it improves the resistance against the byte shifting problem at the expense of the chunk size stability, which makes it more suitable for the incremental data synchronization. Comparing this algorithm with several known classical or state of art algorithms, experiments show that the incremental data found by this algorithm can be reduced by 32%~57% compared to the others with same changes between two data. The experimental results based on real-world datasets show that PCI improves the calculation speed of classic Rsync algorithm up to 70%, however, with a drawback of increasing the Transmission compression rate up to 11.8%.

Highlights

Data chunking algorithm reads the data as a byte stream
We show analytically that PCI is more suitable in data incremental synchronization
In the field of data incremental synchronization, chunks are used to search for changed data instead of being stored, so the instability of chunk size does not cause much impact

Summary

INTRODUCTION

Data chunking algorithm reads the data as a byte stream. In the process of reading, a single byte or multiple bytes are selected as a boundary of chunk based on certain conditions. CDC algorithm, known as variable-size chunking, is based on the content of the read bytes to determine whether it acts as a boundary. In this kind of chunking, the data is read as a byte stream, and a data window is set up. Since this kind of algorithm is content-based, when byte shifting occurs, the data in the window, which meets the preset conditions, will still satisfy the ones and be set as boundaries. 4) In Section 5, we experimentally compare PCI with five of state of art CDC algorithms for test items including chunking speed, chunk size distribution, and incremental data discovery

BACKGROUND

TIME AND SPACE COMPLEXITY

EXPERIMENTS OF CHUNKING ALGORITHMS

OBJECTIVE

Findings

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Function of Content Defined Chunking Algorithms in Incremental Synchronization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Similarity based deduplication with small data chunks
L Aronovich ... Y Toaff
Discrete Applied Mathematics | VOL. 212
L Aronovich, et. al.L Aronovich ... Y Toaff
23 Oct 2015
Discrete Applied Mathematics | VOL. 212

MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization
Changjian Zhang ... Deyu Qi
IEEE Access | VOL. 7
Changjian Zhang, et. al.Changjian Zhang ... Deyu Qi
01 Jan 2019
IEEE Access | VOL. 7

Robustness Comparison of Scheduling Algorithms in MapReduce Framework
Amirali Daghighi ... Jim Q Chen
-
Amirali Daghighi, et. al.Amirali Daghighi ... Jim Q Chen
13 Jul 2021
13 Jul 2021

Optimal data partitioning and forwarding in opportunistic mobile networks
Ning Wang ... Jie Wu
-
Ning Wang, et. al.Ning Wang ... Jie Wu
01 Apr 2018
01 Apr 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Function of Content Defined Chunking Algorithms in Incremental Synchronization

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access