An Efficient Indexing Mechanism for Data Deduplication

Tin Thein Thwel,Ni Lar Thein

doi:10.1109/ctit.2009.5423123

Abstract

At present, there is a vast amount of duplicated data or redundant data in storage systems. Data de-duplication can eliminate multiple copies of the same file and duplicated segments or chunks of data within those files. In these days, therefore, data de-duplication becomes an interesting field in storage environments especially in persistent data storage for data centers. Many data deduplication mechanisms have been proposed for efficient data deduplication in order to safe storage space. Current issue for data deduplication is to avoid full-chunk indexing to identify the incoming data is new, which is time consuming process. In this paper, we propose an efficient indexing mechanism for this problem using the advantage of B+ tree properties. In our proposed system, we will first separate the file into variable-length chunks using Two Thresholds Two Divisors chunking algorithm. ChunkIDs are then obtained by applying hash function to the chunks. The resulted ChunkIDs are used to build as indexing keys in B+ tree like index structure. So the searching time for the duplicate file chunks reduces from O (n) to O (log n), which can avoid the risk of full chunk indexing.

Full Text