DSimpleGraph: A Novel Distributed Clustering Algorithm for Exploring Very Large Scale Unknown Data Sets

Li Lu,Robert Grossman,Yunhong Gu

doi:10.1109/icdmw.2010.12

Abstract

Some of the major challenges in current clustering applications include: some data sets are so huge that it is difficult to load the entire data sets into memory for clustering, the data sets are often distributed over different locations for various reasons, which makes it impossible to process them centrally, and when lacking prior knowledge of the unknown data sets, it is troublesome to choose the appropriate parameters to feed into existing clustering algorithms. Therefore, a distributed clustering algorithm without too many parameters becomes rather appealing. Although some distributed clustering algorithms have been proposed, it is still a challenge for them to solve all of these problems. In this paper, we propose and implement a novel micro-cluster based distributed clustering algorithm called dSimpleGraph. An equivalence relation on two micro-clusters is defined. Relying on the relation, dSimpleGraph can efficiently cluster data on the local machines, moreover, it can easily generate a determined global view from local views. Only two scalar parameters are needed and the generated clusters can be any shape. Its MapReduce-style structure allows it to be easily implemented on existing distributed computing platforms. Extensive experimental studies show that dSimpleGraph is very fast and very suitable for exploring very large scale unknown data sets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DSimpleGraph: A Novel Distributed Clustering Algorithm for Exploring Very Large Scale Unknown Data Sets

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Density-accumulated arbitrary shaped clustering for large data sets
Huaqi Chen
-
Huaqi ChenHuaqi Chen
01 Dec 2013
01 Dec 2013

Ensemble based Distributed K-Modes Clustering
...
-
, et. al. ...
09 Apr 2015
09 Apr 2015

Clustering Large Databases in Distributed Environment
Malay K Pakhira
-
Malay K PakhiraMalay K Pakhira
01 Mar 2009
01 Mar 2009

An efficient clustering algorithm for mixed type attributes in large dataset
Jian Yin ... Jiang-Tao Ren
-
Jian Yin, et. al. Jian Yin ... Jiang-Tao Ren
01 Jan 2004
01 Jan 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DSimpleGraph: A Novel Distributed Clustering Algorithm for Exploring Very Large Scale Unknown Data Sets

Abstract

Talk to us

Similar Papers