Double Sliding Window Chunking Algorithm for Data Deduplication in Ocean Observation

Shuai Guo,Xiaodong Mao,Shuang Wang,Meng Sun

doi:10.1109/access.2023.3276785

Shuai Guo, Xiaodong Mao + Show 2 more

Open Access

https://doi.org/10.1109/access.2023.3276785

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2023
Citations: 3	License type: CC BY-NC-ND 4.0

Affiliation: Qingdao University of Technology

Abstract

As an essential means to eliminate redundant data, data deduplication technology significantly affects today’s era of massive data growth. In recent years, due to the rapid development of a series of related industries, such as marine monitoring, the marine monitoring data has exploded, leading to higher storage costs for marine observation stations. In the face of the surge in data size, we first think of using data deduplication technology to reduce the stored data to save storage costs. However, we have many choices for data deduplication technology. Because-block level data deduplication technology can better complete the task, and the core technology of block-level data deduplication technology is how to cut data blocks, this paper proposes a dual sliding window-based segmentation technology. The structure of double sliding windows makes the divided data block size more average to reduce the consumption of the fingerprint table in memory. At the same time, we add a prediction algorithm to the data deduplication system to predict the cutting point of the data block to improve the cutting efficiency. In addition, we propose a more accurate calculation method of the deduplication ratio, which can more accurately compare the algorithm’s performance and obtain the final experimental results of this paper by using this calculation method. Moreover, we propose a model based on Markov prediction to store massive ocean data, which can save more resources. At the end of the article, we compared the commonly used segmentation algorithms through careful experiments. Finally, we obtained and will use the public dataset experiment to compare the same checking rate at the end of this article.

Full Text