Abstract

The continued growth of data and high-continuity of application have raised a critical and mounting demand on storage-efficient and high-performance data protection. New technologies, especially the D2D (Disk-to-Disk) deduplication storage are therefore getting wide attention both in academic and industry in the recent years. Existing deduplication systems mainly rely on duplicate locality inside the backup workload to achieve high throughput but suffer from read performance degrading under conditions of poor duplicate locality. This paper presents the design and performance evaluation of a D2D-based de-duplication file backup system, which employs caching techniques to improve write throughput while encoding files as graphs called BP-DAGs (Bi-pointer-based Directed Acyclic Graphs). BP-DAGs not only satisfy the ‘unique’ chunk storing policy of de-duplication, but also help improve file read performance in case of poor duplicate locality workloads. Evaluation results show that the system can achieve comparable read performance than non de-duplication backup systems such as Bacula under representative workloads, and the metadata storage overhead for BP-DAGs are reasonably low.

Highlights

  • Data explosion [1] has been forcing backups to expand storage capacity, which makes modern enterprises face significant cost pressures and data management challenges

  • This paper mainly focuses on data de-duplication and BP-DAGs, and not on the backup job management, so, the rest of the section is dedicated to workflow of backup agent and storage server

  • The index chunk is stored to the container and its new address is built to the BP-DAG, otherwise, it is discarded and its address pointer is copied from the fingerprint cache to the BP-DAG

Read more

Summary

INTRODUCTION

Data explosion [1] has been forcing backups to expand storage capacity, which makes modern enterprises face significant cost pressures and data management challenges. The key challenge for modern enterprises data protection is to construct storage-efficient backup systems with high performance on both data write and read throughputs. Most of the existing de-duplication systems use caching technique, which judiciously exploits duplicate locality within the backup stream to avoid the disk index bottleneck, and achieves high de-duplication throughput [9, 10]. In existing de-duplication systems file chunks are indexed by their fingerprints (i.e., hash pointers), which are called Content-Addressed Storage (CAS) [14]. In order to maintain high read throughput under various workloads, files were encoded as graphs called Bi-Pointer-based Directed Acyclic Graphs (BP-DAGs) whose nodes had variable-sized chunks of data and whose edges were hash plus address pointers.

THE STORAGE-EFFICIENT FILE BACKUP
System Architecture
De-duplication Backup Process
Write-Once Storage Policy
BI-POINTER-BASED DIRECTED ACYCLIC GRAPHS
The Structure of BP-DAGs
BP-DAGs Building
Restoring Files from BP-DAGs
EXPERIMENTAL EVALUATION
System Setup
Results and Discussions
CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.