Top-k ranking with uncertain data

Chonghai Wang

doi:10.7939/r3ks36

Abstract

The goal of top-k ranking is to rank individuals so that the best k of them can be determined. Depending on the application domain, an individual can be a person, a product, an event, or just a collection of data or information for which an ordering makes sense. The problem of top-k ranking has profound commercial and social implications. In the context of databases, top-k ranking has been studied in two distinct directions, depending on whether the stored information is certain or uncertain. In the former, since there is little dispute on what top-k ranking is, the past research has focused on efficient query processing. In the latter case, a number of semantics based on possible worlds have been proposed and computational mechanisms investigated for what are called uncertain databases or probabilistic databases, where a tuple is associated with a membership probability indicating the level of confidence on the stored information. In this thesis, we study top-k ranking with uncertain data in two general areas. The first is on pruning for the computation of top-k tuples in a probabilistic database. We investigate the theoretical basis and practical means of pruning for the recently proposed, unifying framework based on parameterized ranking functions. As such, our results are applicable to a wide range of ranking functions. We show experimentally that pruning can generate orders of magnitude performance gains. In the second area of our investigation, we study the problem of top-k ranking for objects with multiple attributes whose values are modeled by probability distributions and constraints. We formulate a theory of top-k ranking for objects by a characterization of what constitutes the strength of an object, and show that a number of previous proposals for top-k ranking are special cases of our theory. We carry out a limited study on computation of top-k objects under our theory. We reveal the close connection between top-k ranking in this context and high-dimensional space studied in mathematics, in particular, the problem of computing the volumes of high-dimensional polyhedra expressed by linear inequations is a special case of top-k ranking of objects, and as such, the algorithms formulated for the former can be employed for the latter under the same conditions.

Full Text