Density biased sampling

Christopher R. Palmer,Christos Faloutsos

doi:10.1145/335191.335384

Abstract

Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss small clusters. Many natural phenomena are known to follow Zipf's distribution and the inability of uniform sampling to find small clusters is of practical concern. Density Biased Sampling is proposed to probabilistically under-sample dense regions and over-sample light regions. A weighted sample is used to preserve the densities of the original data. Density biased sampling naturally includes uniform sampling as a special case. A memory efficient algorithm is proposed that approximates density biased sampling using only a single scan of the data. We empirically evaluate density biased sampling using synthetic data sets that exhibit varying cluster size distributions finding up to a factor of six improvement over uniform sampling.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Density biased sampling

Abstract

Talk to us

Similar Papers

More From: ACM SIGMOD Record

Lead the way for us

Journal: ACM SIGMOD Record	Publication Date: May 16, 2000
Citations: 100

Similar Papers

Density Biased Sampling with Locality Sensitive Hashing for Outlier Detection
Xuyun Zhang ... Qiang He
-
Xuyun Zhang, et. al.Xuyun Zhang ... Qiang He
01 Jan 2018
01 Jan 2018

Efficient biased sampling for approximate clustering and outlier detection in large data sets
G Kollios ... N Koudas
IEEE Transactions on Knowledge and Data Engineering | VOL. 15
G Kollios, et. al.G Kollios ... N Koudas
01 Sep 2003
IEEE Transactions on Knowledge and Data Engineering | VOL. 15

An efficient approximation scheme for data mining tasks
G Kollios ... N Koudas
-
G Kollios, et. al.G Kollios ... N Koudas
02 Apr 2001
02 Apr 2001

An Efficient Density Biased Sampling Algorithm for Clustering Large High-Dimensional Datasets
... Qin Wu
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 29
, et. al. ... Qin Wu
22 Nov 2015
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Density biased sampling

Abstract

Talk to us

Similar Papers

More From: ACM SIGMOD Record