Trend analysis is a fundamental type of analytical query in online analytical processing (OLAP) systems. In trend analysis, a key step is to identify k valuable attributes whose distributions in two subsets under different predicates significantly differ for further investigation, where the difference is measured by metric functions. However, the exact solution that involves scanning all records is prohibitively expensive, particularly when handling large datasets in the era of big data. To minimize unnecessary data access, the existing state-of-the-art solution TopKAttr adopts sampling to avoid the expensive data scan. However, their solution still has two main drawbacks. Firstly, their solution is tailored only for two limited metric functions: the Earth Mover distance and Euclidean distance, and cannot be generalized to more complicated metric functions. Besides, their solution still aims to return the exact top-k answers via the sampling method, which still causes high running costs as shown in our experiment. Motivated by these limitations, we propose a general approximation framework for attribute recommendation that efficiently returns the top-k attributes with theoretical guarantees while supporting an extensive range of metric functions, such as the Kolmogorov-Smirnov test (KS-test), Chebyshev distance, the Earth Mover distance, Euclidean distance, and with the potential to more metrics. The key to our framework is a new bound estimation strategy that can be applied to a wide spectrum of metrics, as we listed above. Based on our estimation framework, we further devise an efficient approximation algorithm with theoretical guarantees to answer the top-k queries, which is widely used in attribute recommendation. Extensive experiments on four real large datasets show that our framework gains up to an order of magnitude speed-up and consistently high accuracy compared to TopKAttr, providing a promising alternative for attribute recommendation in OLAP systems.
Read full abstract