Abstract

For large-scale, high-dimensional, sparse categorical data clustering, compared with the traditional clustering algorithm, CLOPE has a great improvement in the quality of clustering and running speed. However, CLOPE algorithm itself also has some defects in clustering quality stability and does not distinguish the attribute clustering contribution between dimensions, besides, it needs to specify rejection factor r in advance. Therefore, this paper proposes a clustering algorithm for categorical data based on random sequence iteration and attribute weight (RW-CLOPE). RW-CLOPE uses the “shuffle” model to sort the raw data randomly to eliminates the effect of data input sequence on clustering quality. At the same time, based on the attribute entropy, the calculation method of attribute weights is proposed to distinguish the attribute clustering contribution of each dimensions, which is greatly improves the quality of data clustering. Finally, the RW-CLOPE algorithm has been implemented on the efficient cluster platform(Spark). Experiments on two different and real databases show that RW-CLOPE algorithm achieves better clustering quality than p-CLOPE algorithm when the number of datasets is the same. For the mushrooms dataset, when CLOPE obtains the best results, RW-CLOPE can achieve 68% larger profit value than CLOPE and 25% larger profit value than p-CLOPE. The execution time of RW-CLOPE algorithm is much shorter than p-CLOPE algorithm when dealing with massive data. When has enough computing resource, the more shuffle copies of data the more obvious the improvement of the execution time.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call