Abstract

A cluster deduplication system can coordinate the work of multiple nodes, which can better alleviate the disk index bottleneck existing in the large-scale data backup system. However, there is a problem of isolated islands of information among nodes during data deduplication. When the servers use the query mode to route data, a large amount of system overhead is required to ensure a high deduplication rate and low throughput rate. At the same time, while the servers cannot obtain a higher deduplication rate if the servers adopt the stateless routing method. Data routing strategy can greatly affect the overall performance of the system. The concept of data frequency is proposed in this paper, and the classified routing strategy is designed. In the metadata server, a byte-shaped Bloom filter for recording the occurrence frequency of data blocks is maintained to record the occurrence frequency of data blocks. The values in the Bloom filter are counted. Then the frequency of the data blocks is compared with the configured threshold value to determine whether the data is “hot data”. We use stateful routing to send “clod data” to the storage nodes and use stateless routing to send the hot data to the storage nodes. Experimental results show that the classifying routing algorithm based on the frequency of data can greatly reduce the overhead of the system while guaranteeing the deduplication rate of the deduplication system as well as improve system throughput and real-time processing capabilities. Compared with the fully stateful routing scheme, our method only loses less than 2% of the deduplication rate, which reduces the communication query overhead by more than 25% and improves the real-time processing capability of the storage system.

Highlights

  • With the rapid development of Internet technology and the advent of big data, the total amount of digital information in the world is growing exponentially

  • Cluster data deduplication technology based on cloud storage systems is a new research field in recent years [3]–[6]

  • Tested the deduplication rate of deduplication clusters in the environment with a different number of nodes without performing feature fingerprint sampling on the superblock and compared the classification routing algorithm based on data frequency (DRDF), the stateful routing of EMC scheme, and Stateless routing, and other methods use the deduplication rate when the number of nodes in the deduplication cluster is 1, 3, 7, 15, 31, 63, and 127

Read more

Summary

INTRODUCTION

With the rapid development of Internet technology and the advent of big data, the total amount of digital information in the world is growing exponentially. Cluster data deduplication technology based on cloud storage systems is a new research field in recent years [3]–[6] This technology builds a data deduplication system in the form of a cluster, which can use multiple nodes to coordinate work simultaneously and better alleviate large-scale disk index bottleneck problems in the data backup system. In this regard, according to the EMC proposal, by using query-style stateful routing, the storage node is queried to ensure that the data to be stored is sent to the most appropriate node for processing before routing data This kind of stateful routine has a large amount of overhead and a low throughput rate, which affects the deduplication performance of the cluster. Methods to reduce calculation and query overhead and increase throughput, thereby improving the overall performance of the system

RELATED WORK HYDR
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call