Abstract

Data clustering is a fundamental problem arising in many practical applications. In this paper, we present new geometric approximation and exact algorithms for the density-based data clustering problem in d-dimensional space ℝd (for any constant integer d ≥ 2). Previously known algorithms for this problem are efficient only when the specified range around each input point, called the δ-neighborhood, contains on average a constant number of input points. Different distributions of the input data points have significant impact on the efficiency of these algorithms. In the worst case when the data points are highly clustered, these algorithms run in quadratic time, although such situations might not occur very frequently on real data. By using computational geometry techniques, we develop faster approximation and exact algorithms for the density-based data clustering problem in ℝd. In particular, our approximation algorithm based on the ∊-fuzzy distance function takes O(n log n) time for any given fixed value ∊>0, and our exact algorithms take sub-quadratic time. The running times and output quality of our algorithms do not depend on any particular data distribution. We believe that our fast approximation algorithm is of considerable practical importance, while our sub-quadratic exact algorithms are more of theoretical interest. We implemented our approximation algorithm and the experimental results show that our approximation algorithm is efficient on arbitrary input point sets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call