Abstract

In data stream mining, a stream is a dataset of unknown size with continuously incoming elements, which is typically large enough so that a computer processing it does not have enough memory to hold it in its entirety and each element can be read only once and only in order. Classical sampling methods such as simple random sampling (SRS), stratified sampling and cluster sampling cannot be used on the stream data since the entire set is not available all at once and data cannot be reread. Vitter’s (1985) Algorithm R is a reservoir sampling method which can be used to select an SRS from a data stream. In this article, we propose Algorithm SR which extends Algorithm R to a stratified reservoir sampling method with optimal allocation. We prove that the proposed method is asymptotically equivalent to classical stratified random sampling with optimal allocation. Implementation results show that the proposed method is efficient and can outperform Algorithm R.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call