Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Christopher Hoobin,Justin Zobel,Simon J Puglisi

doi:10.14778/2078331.2078341

Abstract

Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as gzip. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment

Lead the way for us

Journal: Proceedings of the VLDB Endowment	Publication Date: Nov 1, 2011
Citations: 70

Similar Papers

Effective Construction of Relative Lempel-Ziv Dictionaries
Kewen Liao ... Alistair Moffat
-
Kewen Liao, et. al.Kewen Liao ... Alistair Moffat
11 Apr 2016
11 Apr 2016

Analyzing the Interplay Between Random Shuffling and Storage Devices for Efficient Machine Learning
Zhi-Lin Ke ... Chia-Lin Yang
-
Zhi-Lin Ke, et. al.Zhi-Lin Ke ... Chia-Lin Yang
01 Mar 2021
01 Mar 2021

RLZAP: Relative Lempel-Ziv with Adaptive Pointers
Anthony J Cox ... Travis Gagie
-
Anthony J Cox, et. al.Anthony J Cox ... Travis Gagie
01 Jan 2015
01 Jan 2015

Overlapping tiling for fast random access of low-dimensional data from high-dimensional datasets
Zihong Fan ... Antonio Ortega
-
Zihong Fan, et. al.Zihong Fan ... Antonio Ortega
18 Jan 2009
18 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Abstract

Talk to us

Similar Papers

More From: Proceedings of the VLDB Endowment