Abstract

The concept of proximity curve and a new algorithm are proposed for obtaining clusters in a finite set of data points in the finite dimensional Euclidean space. Each point is endowed with a potential constructed by means of a multi-dimensional Cauchy density, contributing to an overall anisotropic potential function. Guided by the steepest descent algorithm, the data points are successively visited and removed one by one, and at each stage the overall potential is updated and the magnitude of its local gradient is calculated. The result is a finite sequence of tuples, the proximity curve, whose pattern is analysed to give rise to a deterministic clustering. The finite set of all such proximity curves in conjunction with a simulation study of their distribution results in a probabilistic clustering represented by a distribution on the set of dendrograms. A two-dimensional synthetic data set is used to illustrate the proposed potential-based clustering idea. It is shown that the results achieved are plausible since both the ‘geographic distribution’ of data points as well as the ‘topographic features’ imposed by the potential function are well reflected in the suggested clustering. Experiments using the Iris data set are conducted for validation purposes on classification and clustering benchmark data. The results are consistent with the proposed theoretical framework and data properties, and open new approaches and applications to consider data processing from different perspectives and interpret data attributes contribution to patterns.

Highlights

  • The history of work on clustering algorithms based on, and motivated by the gravitational models stretches back to the 1970s. They are too numerous to be exhaustively reviewed here; we review below a few relevant highlights, thereby placing our work into context and pointing out its distinguishing features

  • The new concept of the proximity curve has been introduced and used for finding clusterings in finite sets of data points in the Euclidean space endowed with individual potentials derived from a multivariate Cauchy distribution

  • The evaluation of the algorithm on a combination of Iris data set case studies reported in Section 4 proves that this new approach to consider data features’ contributions to the global field potential as a way to discover clusters is promising and valuable

Read more

Summary

Introduction

The history of work on clustering algorithms based on, and motivated by the gravitational (and electrostatic) models stretches back to the 1970s. They are too numerous to be exhaustively reviewed here; we review below a few relevant highlights, thereby placing our work into context and pointing out its distinguishing features. Starting with a set of data points where each point has a potential, in Shi et al (2002), an agglomerative hierarchical clustering algorithm is presented based on a comparison of distances between already defined clusters. The distance used in Shi et al (2002) is calculated as some weighted value of differences of potentials

Objectives
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call