Clustering with Proximity Graphs

Michail Kazimianec,Nikolaus Augsten

doi:10.4018/ijkbo.2013100105

Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Clustering with Proximity Graphs

Abstract

Talk to us

Similar Papers

More From: International Journal of Knowledge-Based Organizations

Lead the way for us

Journal: International Journal of Knowledge-Based Organizations	Publication Date: Oct 1, 2013
Citations: 3

Similar Papers

Exact and Efficient Proximity Graph Computation
Michail Kazimianec ... Nikolaus Augsten
-
Michail Kazimianec, et. al.Michail Kazimianec ... Nikolaus Augsten
01 Jan 2009
01 Jan 2009

PG-Skip: Proximity Graph Based Clustering of Long Strings
Michail Kazimianec ... Nikolaus Augsten
-
Michail Kazimianec, et. al.Michail Kazimianec ... Nikolaus Augsten
01 Jan 2010
01 Jan 2010

Clustering of Short Strings in Large Databases
Michail Kazimianec ... Arturas Mazeika
-
Michail Kazimianec, et. al.Michail Kazimianec ... Arturas Mazeika
01 Jan 2009
01 Jan 2009

Fast and Exact Outlier Detection in Metric Spaces
Daichi Amagata ... Makoto Onizuka
-
Daichi Amagata, et. al.Daichi Amagata ... Makoto Onizuka
09 Jun 2021
09 Jun 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering with Proximity Graphs

Abstract

Talk to us

Similar Papers

More From: International Journal of Knowledge-Based Organizations