Abstract

Data chunking algorithms divide data into several small data chunks in a certain way, thus transforming the operation of data into the one of multiple small data chunks. Data chunking algorithms have been widely used in duplicate data detection, parallel computing and other fields, but it is seldom used in data incremental synchronization. Aiming at the characteristics of incremental data synchronization, this paper proposes a novel data chunking algorithm. By dividing two data that need synchronization into small data chunks, comparing the contents of these small data chunks, different ones are the incremental data that need to be found. The new algorithm determines to set a cut-point based on the number of 1 contained in the binary format of all bytes in an interval. Thus it improves the resistance against the byte shifting problem at the expense of the chunk size stability, which makes it more suitable for the incremental data synchronization. Comparing this algorithm with several known classical or state of art algorithms, experiments show that the incremental data found by this algorithm can be reduced by 32%~57% compared to the others with same changes between two data. The experimental results based on real-world datasets show that PCI improves the calculation speed of classic Rsync algorithm up to 70%, however, with a drawback of increasing the Transmission compression rate up to 11.8%.

Highlights

  • Data chunking algorithm reads the data as a byte stream

  • We show analytically that PCI is more suitable in data incremental synchronization

  • In the field of data incremental synchronization, chunks are used to search for changed data instead of being stored, so the instability of chunk size does not cause much impact

Read more

Summary

INTRODUCTION

Data chunking algorithm reads the data as a byte stream. In the process of reading, a single byte or multiple bytes are selected as a boundary of chunk based on certain conditions. CDC algorithm, known as variable-size chunking, is based on the content of the read bytes to determine whether it acts as a boundary. In this kind of chunking, the data is read as a byte stream, and a data window is set up. Since this kind of algorithm is content-based, when byte shifting occurs, the data in the window, which meets the preset conditions, will still satisfy the ones and be set as boundaries. 4) In Section 5, we experimentally compare PCI with five of state of art CDC algorithms for test items including chunking speed, chunk size distribution, and incremental data discovery

BACKGROUND
TIME AND SPACE COMPLEXITY
EXPERIMENTS OF CHUNKING ALGORITHMS
OBJECTIVE
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.