Abstract

Deduplication is a popular data reduction technology in storage systems which has significant advantages, such as finding and eliminating duplicate data, reducing data storage capacity required, increasing resource utilization, and saving storage costs. The file features are a key factor that is used to calculate the similarity between files, but the similarity calculated by the single feature has some limitations especially for the similar files. The storage node feature reflects the load condition of the node, which is the key factor to be considered in the data routing. This paper introduces a multifeature data routing strategy (DRMF). The routing strategy is made based on the features of the cluster, including routing communication, file similarity calculation, and the determination of the target node. The mutual information exchange is achieved by routing communication, routing servers, and storage nodes. The storage node calculates the similarity between the files stored, and then the file is routed according to the information provided by the routing server. The routing server determines the target node of the route according to the similar results and the node load features. The system prototype is designed and implemented; also, we develop a system to process the feature of cluster and determine the specific parameters of various features of experiments. In the end, we simulate the multifeature data routing and single-feature data routing, respectively, and compare the deduplication rate and data slope between the two strategies. The experimental results show that the proposed data routing strategy using multiple features can improve the deduplication rate of the cluster and maintain a lower data skew rate compared with the single-feature-based routing strategy MCS; DRMF can improve the deduplication rate of the cluster and maintain a lower data skew rate.

Highlights

  • According to the International Data Corporation (IDC) report, global data spheres will grow from 33 ZB in 2018 to 175 ZB in 2025 [1]

  • The director-node first selects a group of storage nodes for each file using file type as criteria; the client node chooses a representative chunk for each super chunk and broadcasts it to all the selected nodes with the node containing a matching chunk chosen as the destination. e communication cost of this system is high because it floods the network with representative chunk fingerprints for every super chunk

  • DRMF introduces a drastic reduction of communication overhead because multifeature-based data routing strategy is utilized. e strategy represents a whole data stream and is sent only to network for similarity computation

Read more

Summary

Introduction

According to the International Data Corporation (IDC) report, global data spheres will grow from 33 ZB in 2018 to 175 ZB in 2025 [1]. Duplicates are replaced with links to old data while data belonging to unique hashes are stored [3] These backup systems with data deduplication usually can only delete duplicate data onto a single node, and the storage capacity of a single node is low, and it is difficult to meet large-scale storage requirements [3]. The Scientific Programming deduplication cluster improves the data operation throughput of the storage backup system, it reduces the deduplication rate of a single node in the cluster. Using a suitable data routing strategy in the cluster aimed to route files containing duplicate data to the same node can ensure a high deduplication rate for the cluster nodes. (3) e experiment simulated the multicharacteristic data routing strategy and the popular single-characteristic data routing strategy. e experiment showed that the multicharacteristic data routing performed better in cluster deduplication rate and load balancing

Related Work
DRMF: Multifeature-Based Data Routing Strategy
Experimental Evaluation
Test Results and Analysis
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.