Optimum algorithms for two random sampling problems

Jeffrey Scott Vitter

doi:10.1109/sfcs.1983.43

Abstract

Several fast new algorithms are presented for sampling n records at random from a file containing N records. The first problem we solve deals with sampling when N is known, and the the second problem considers the case when N is unknown. The two main results in this paper are Algorithms D and Z. Algorithm D solves the first problem by doing the sampling with a small constant amount of space and in O(n) time, on the average; roughly n uniform random variates are generated, and approximately n exponentiation operations are performed during the sampling The sample is selected sequentially and online; it answers an open problem in [Knuth 81]. Algorithm Z solves the second problem by doing the sampling using O(n) space, roughly n ln(N/n) uniform random variates and O(n(1 + log(N/n))) time, on the average. Both algorithms are time- and space-optimum and are short and easy to implement.

Full Text