Random sampling techniques for space efficient online computation of order statistics of large datasets

Gurmeet Singh Manku,Bruce G Lindsay,Sridhar Rajagopalan

doi:10.1145/304181.304204

Abstract

In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Random sampling techniques for space efficient online computation of order statistics of large datasets

Abstract

Talk to us

Similar Papers

More From: ACM SIGMOD Record

Lead the way for us

Journal: ACM SIGMOD Record	Publication Date: Jun 1, 1999
Citations: 116

Similar Papers

How can we benefit from regime information to make more effective use of long short-term memory (LSTM) runoff models?
Reyhaneh Hashemi ... Pierre Brigode
Hydrology and Earth System Sciences | VOL. 26
Reyhaneh Hashemi, et. al.Reyhaneh Hashemi ... Pierre Brigode
17 Nov 2022
Hydrology and Earth System Sciences | VOL. 26

Capacity Problem of Trapdoor Channel
K Kobayashi
-
K KobayashiK Kobayashi
01 Jan 2006
01 Jan 2006

A linear memory algorithm for Baum-Welch training
István Miklós ... Irmtraud M Meyer
BMC Bioinformatics | VOL. 6
István Miklós, et. al.István Miklós ... Irmtraud M Meyer
19 Sep 2005
BMC Bioinformatics | VOL. 6

Non-uniform Sampling Schemes for RF Bandpass Sampling Receiver
Dadi Mohamed Bechir ... Bouallegue Ridha
-
Dadi Mohamed Bechir, et. al.Dadi Mohamed Bechir ... Bouallegue Ridha
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Random sampling techniques for space efficient online computation of order statistics of large datasets

Abstract

Talk to us

Similar Papers

More From: ACM SIGMOD Record