A categorical data clustering algorithm and its efficient parallel implementation

Xiangwu Ding,Jia Tan,Mei Wang

doi:10.1109/iccsnt.2016.8070153

Abstract

For large-scale, high-dimensional, sparse categorical data clustering, compared with the traditional clustering algorithm, CLOPE has a great improvement in the quality of clustering and running speed. However, CLOPE algorithm itself also has some defects in clustering quality stability and does not distinguish the attribute clustering contribution between dimensions, besides, it needs to specify rejection factor r in advance. Therefore, this paper proposes a clustering algorithm for categorical data based on random sequence iteration and attribute weight (RW-CLOPE). RW-CLOPE uses the “shuffle” model to sort the raw data randomly to eliminates the effect of data input sequence on clustering quality. At the same time, based on the attribute entropy, the calculation method of attribute weights is proposed to distinguish the attribute clustering contribution of each dimensions, which is greatly improves the quality of data clustering. Finally, the RW-CLOPE algorithm has been implemented on the efficient cluster platform(Spark). Experiments on two different and real databases show that RW-CLOPE algorithm achieves better clustering quality than p-CLOPE algorithm when the number of datasets is the same. For the mushrooms dataset, when CLOPE obtains the best results, RW-CLOPE can achieve 68% larger profit value than CLOPE and 25% larger profit value than p-CLOPE. The execution time of RW-CLOPE algorithm is much shorter than p-CLOPE algorithm when dealing with massive data. When has enough computing resource, the more shuffle copies of data the more obvious the improvement of the execution time.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A categorical data clustering algorithm and its efficient parallel implementation

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Partition-and-merge based fuzzy genetic clustering algorithm for categorical data
Thi Phuong Quyen Nguyen ... R.J Kuo
Applied Soft Computing | VOL. 75
Thi Phuong Quyen Nguyen, et. al.Thi Phuong Quyen Nguyen ... R.J Kuo
19 Nov 2018
Applied Soft Computing | VOL. 75

MGR: An information theory based hierarchical divisive clustering algorithm for categorical data
Hongwu Qin ... Jasni Mohamad Zain
Knowledge-Based Systems | VOL. 67
Hongwu Qin, et. al.Hongwu Qin ... Jasni Mohamad Zain
27 Mar 2014
Knowledge-Based Systems | VOL. 67

A cluster ensemble method for clustering categorical data
Zengyou He ... Xiaofei Xu
Information Fusion | VOL. 6
Zengyou He, et. al.Zengyou He ... Xiaofei Xu
09 Apr 2004
Information Fusion | VOL. 6

Self-Expressive Kernel Subspace Clustering Algorithm for Categorical Data with Embedded Feature Selection
Hui Chen ... Qingshan Jiang
Mathematics | VOL. 9
Hui Chen, et. al.Hui Chen ... Qingshan Jiang
16 Jul 2021
Mathematics | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A categorical data clustering algorithm and its efficient parallel implementation

Abstract

Talk to us

Similar Papers