Compressed data structures: Dictionaries and data-aware measures

Ankur Gupta,Wing-Kai Hon,Rahul Shah,Jeffrey Scott Vitter

doi:10.1016/j.tcs.2007.07.042

Abstract

In this paper, we propose measures for compressed data structures, in which space usage is measured in a data-aware manner. In particular, we consider the fundamental dictionary problem on set data, where the task is to construct a data structure for representing a set S of n items out of a universe U = { 0 , … , u − 1 } and supporting various queries on S . We use a well-known data-aware measure for set data called gap to bound the space of our data structures. We describe a novel dictionary structure that requires gap + O ( n log ( u / n ) / log n ) + O ( n log log ( u / n ) ) bits. Under the RAM model, our dictionary supports membership, rank, and predecessor queries in nearly optimal time, matching the time bound of Andersson and Thorup’s predecessor structure [A. Andersson, M. Thorup, Tight(er) worst-case bounds on dynamic searching and priority queues, in: ACM Symposium on Theory of Computing, STOC, 2000], while simultaneously improving upon their space usage. We support select queries even faster in O ( log log n ) time. Our dictionary structure uses exactly gap bits in the leading term (i.e., the constant factor is 1) and answers queries in near-optimal time. When seen from the worst-case perspective, we present the first O ( n log ( u / n ) ) -bit dictionary structure that supports these queries in near-optimal time under the RAM model. We also build a dictionary which requires the same space and supports membership, select, and partial rank queries even more quickly in O ( log log n ) time. We go on to show that for many (real-world) datasets, data-aware methods lead to a worthwhile compression over combinatorial methods. To the best of our knowledge, these are the first results that achieve data-aware space usage and retain near-optimal time.

Full Text