Abstract

Every second, millions of data are being generated due to the use of emerging technologies. It is very challenging to store and handle such a large amount of data. Data deduplication is a solution for this problem. It is a new technique that eliminates duplicate data and stores only a single copy of data, reducing storage utilization and the cost of maintaining redundant data. Content-defined chunking (CDC) has been playing an important role in data deduplication systems due to its ability to detect high redundancy. In this paper, we focused on deduplication system optimization by tuning relevant factors in CDC to identify chunk cut-points and introduce an efficient fingerprint using a new hash function. We proposed a novel bytes frequency-based chunking (BFBC) algorithm and a new low-cost hashing function. To evaluate the efficiency of the proposed system, extensive experiments were done using two different datasets. In all experiments, the proposed system persistently outperformed the common CDC algorithms, achieving a better storage gain ratio and enhancing both chunking and hashing throughput. Practically, our experiments show that BFBC is 10 times faster than basic sliding window (BSW) and approximately three times faster than two thresholds two divisors (TTTD). The proposed triple hash function algorithm is five times faster than SHA1 and MD5 and achieves a better deduplication elimination ratio (DER) than other CDC algorithms. The symmetry of our work is based on the balance between the proposed system performance parameters and its reflection on the system efficiency compared to other deduplication systems.

Highlights

  • The amount of digital data is rising explosively, and the forecasted amount of data to be generated by the end of 2020 is about 44 zettabytes

  • We propose the bytes frequency-based chunking (BFBC) technique as a fast and efficient chunking algorithm, where we expressively reduce the number of computing operations by using multi dynamic optimum parameter divisors (D) with the best threshold value, exploit the multi-operational nature of BFBC to reduce the chunk-size variance, and maximize chunking throughput with an improved deduplication elimination ratio (DER)

  • We developed a Content-defined chunking (CDC) algorithm based on the frequency of bytes occurrence and a new hashing algorithm based on a mathematical triple hashing function

Read more

Summary

Introduction

The amount of digital data is rising explosively, and the forecasted amount of data to be generated by the end of 2020 is about 44 zettabytes. Because of this “data flood,” storing and maintaining backups for such data efficiently and cost-effectively has become one of the most challenging and essential tasks in the big data domain [1,2,3]. Enterprises, IT companies, and industries need to store and operate on an enormous amount of data. The big issue is how to manage these data. To manage data in a proper way, data deduplication techniques are used.

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.