Efficient biased sampling for approximate clustering and outlier detection in large data sets

G Kollios,S Berchtold,D Gunopulos,N Koudas

doi:10.1109/tkde.2003.1232271

Abstract

We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient biased sampling for approximate clustering and outlier detection in large data sets

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering

Lead the way for us

Journal: IEEE Transactions on Knowledge and Data Engineering	Publication Date: Sep 1, 2003
Citations: 231

Similar Papers

An efficient approximation scheme for data mining tasks
G Kollios ... N Koudas
-
G Kollios, et. al.G Kollios ... N Koudas
02 Apr 2001
02 Apr 2001

Density biased sampling
Christopher R. Palmer ... Christos Faloutsos
ACM SIGMOD Record | VOL. 29
Christopher R. Palmer, et. al.Christopher R. Palmer ... Christos Faloutsos
16 May 2000
ACM SIGMOD Record | VOL. 29

A parallel point cloud clustering algorithm for subset segmentation and outlier detection
Christian Teutsch ... Mark R Shortis
-
Christian Teutsch, et. al.Christian Teutsch ... Mark R Shortis
09 Jun 2011
09 Jun 2011

An Efficient Density Biased Sampling Algorithm for Clustering Large High-Dimensional Datasets
... Qin Wu
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 29
, et. al. ... Qin Wu
22 Nov 2015
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient biased sampling for approximate clustering and outlier detection in large data sets

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering