ProbSky: Efficient Computation of Probabilistic Skyline Queries over Distributed Data

Ai-Te Kuo,Wei-Shinn Ku,Liang Tang,Xiao Qin,Haiquan Chen

doi:10.1109/tkde.2022.3151740

Abstract

Skyline queries have drawn great interest and been widely used in various application domains including multi-criteria decision making, search pruning, and personalized recommendation systems. Given multiple criteria, skyline queries return objects that are not dominated by any other objects. As an extension of traditional skyline queries, probabilistic skyline queries aim to cope with uncertain datasets. This paper presents a novel MapReduce-based framework, ProbSky, in support of fast parallel evaluation of probabilistic skyline queries on large high-dimensional data. ProbSky efficiently evaluates exact p-skyline queries on large uncertain data without compromising the quality of query results. From the theoretical point of view, we formally prove two pruning lemmas integrated with ProbSky to strengthen the early pruning capacity. ProbSky builds on top of three optimization techniques, namely, dominant instance pruning, grid-based partitioning, and pivot point-based acceleration. Extensive experiments on both real and synthetic datasets unveil that compared to the state-of-the-art, ProbSky speeds up the evaluation of exact p-skyline queries on large high-dimensional data by at least one order of magnitude in most cases. Our experimental results also validate that by balancing the memory consumption and execution time among machines, ProbSky is adroit at curbing the bottleneck effect that causes severe system performance deterioration.

Full Text