Abstract

There is a significant increase in the amount of data that needs to be stored worldwide. More and more companies are turning their attention to deduplication systems, which effectively increase data warehouse volume and reduce storage costs. Deduplication not only reduces the overall amount of information in storage but also reduces the load on networks by eliminating the need to retransmit duplicate data. In this work, we considered the stages that any deduplication system includes, namely chunking, hashing and indexing, mapping. The effectiveness of deduplication systems primarily depends on the choice of the method of dividing the data stream at the chunking stage. We considered the classic Two Threshold Two Divisor (TTTD) method, which is widely used in modern deduplication systems. This method uses Rabin’s fingerprint to find the hash of the substring value. The formula for calculating the hash for the first substring and the formula for calculating the rest of the substring are given. Another method we investigated is Content Based Two Threshold Two Divisor (CB-TTTD) – it uses new hash functions to fragment the data stream, and the corresponding formulas for calculating the first and each subsequent substring are given. To test the effectiveness of these two methods, we developed a test deduplication system, implemented these two fragmentation methods, and tested their performance on two sets of text data. We have modified these methods with the addition of a new string-splitting condition based on the content specification of the data we tested. The results of a comparison of the work of classical and modified methods are given. Using metrics to compare the efficiency of data fragmentation methods, we obtained experimental data, based on which we can make conclusions about the feasibility of using CB-TTTD as an alternative to TTTD in new deduplication systems. The obtained data can be used in the development of new highly efficient data deduplication systems and to improve old solutions

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call