Finding frequent items in probabilistic data

Qin Zhang,Ke Yi,Feifei Li

doi:10.1145/1376616.1376698

Abstract

Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One straightforward approach to identify the frequent items in a probabilistic data set is to simply compute the expected frequency of an item and decide if it exceeds a certain fraction of the expected size of the whole data set. However, this simple definition misses important information about the internal structure of the probabilistic data and the interplay among all the uncertain entities. Thus, we propose a new definition based on the possible world semantics that has been widely adopted for many query types in uncertain data management, trying to find all the items that are likely to be frequent in a randomly generated possible world. Our approach naturally leads to the study of ranking frequent items based on confidence as well.Finding likely frequent items in probabilistic data turns out to be much more difficult. We first propose exact algorithms for offline data with either quadratic or cubic time. Next, we design novel sampling-based algorithms for streaming data to find all approximately likely frequent items with theoretically guaranteed high probability and accuracy. Our sampling schemes consume sublinear memory and exhibit excellent scalability. Finally, we verify the effectiveness and efficiency of our algorithms using both real and synthetic data sets with extensive experimental evaluations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Finding frequent items in probabilistic data

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Real and synthetic data sets for benchmarking key-value stores focusing on various data types and sizes
Hyuk-Yoon Kwon
Data in Brief | VOL. 30
Hyuk-Yoon KwonHyuk-Yoon Kwon
20 Mar 2020
Data in Brief | VOL. 30

Place prioritization for biodiversity conservation using probabilistic surrogate distribution data
Sahotra Sarkar ... Susan Cameron
Diversity and Distributions | VOL. 10
Sahotra Sarkar, et. al.Sahotra Sarkar ... Susan Cameron
24 Feb 2004
Diversity and Distributions | VOL. 10

Analysis And ImplementationOf K-Mean And K-Medoids Algorithm For Large Dataset To Increase Scalability And Efficiency
...
-
, et. al. ...
01 Jan 2015
01 Jan 2015

Query Processing on Probabilistic Data: A Survey
Dan Suciu ... Guy Van Den Broeck
-
Dan Suciu, et. al.Dan Suciu ... Guy Van Den Broeck
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Finding frequent items in probabilistic data

Abstract

Talk to us

Similar Papers