Data Cube Compression Techniques

Alfredo Cuzzocrea

doi:10.4018/9781605660103.ch058

Abstract

OnLine Analytical Processing (OLAP) research issues (Gray, Chaudhuri, Bosworth, Layman, Reichart & Venkatrao, 1997) such as data cube modeling, representation, indexing and management have traditionally attracted a lot of attention from the Data Warehousing research community. In this respect, as a fundamental issue, the problem of efficiently compressing the data cube plays a leading role, and it has involved a vibrant research activity in the academic as well as industrial world during the last fifteen years. Basically, this problem consists in dealing with massive-in-size data cubes that make access and query evaluation costs prohibitive. The widely accepted solution consists in generating a compressed representation of the target data cube, with the goal of reducing its size (given an input space bound) while admitting loss of information and approximation that are considered irrelevant for OLAP analysis goals (e.g., see (Cuzzocrea, 2005a)). Compressed representations are also referred-in-literature under the term “synopsis data structures”, i.e. succinct representations of original data cubes introducing a limited loss of information. Benefits deriving from the data cube compression approach can be summarized in a relevant reduction of computational overheads needed to both represent the data cube and evaluate resource-intensive OLAP queries, which constitute a wide query class allowing us to extract useful knowledge from huge amounts of multidimensional data repositories in the vest of summarized information (e.g., aggregate information based on popular SQL aggregate operators such as SUM, COUNT, AVG etc), otherwise infeasible by means of traditional OLTP approaches. Among such queries, we recall: (i) range-queries, which extract a sub-cube bounded by a given range; (ii) top-k queries, which extract the k data cells (such that k is an input parameter) having the highest aggregate values; (iii) iceberg queries, which extract the data cells having aggregate values above a given threshold. This evidence has given raise to the proliferation of a number of Approximate Query Answering (AQA) techniques, which, based on data cube compression, aim at providing approximate answers to resource-intensive OLAP queries instead of computing exact answers, as decimal precision is usually negligible in OLAP query and report activities (e.g., see (Cuzzocrea, 2005a)). Starting from these considerations, in this article we first provide a comprehensive, rigorous survey of main data cube compression techniques, i.e. histograms, wavelets, and sampling. Then, we complete our analytical contribution with a detailed theoretical review focused on spatio-temporal complexities of these techniques.

Full Text