Abstract
For large-scale, high-dimensional, sparse categorical data clustering, compared with the traditional clustering algorithm, CLOPE has a great improvement in the quality of clustering and running speed. However, CLOPE algorithm itself also has some defects in clustering quality stability and does not distinguish the attribute clustering contribution between dimensions, besides, it needs to specify rejection factor r in advance. Therefore, this paper proposes a clustering algorithm for categorical data based on random sequence iteration and attribute weight (RW-CLOPE). RW-CLOPE uses the “shuffle” model to sort the raw data randomly to eliminates the effect of data input sequence on clustering quality. At the same time, based on the attribute entropy, the calculation method of attribute weights is proposed to distinguish the attribute clustering contribution of each dimensions, which is greatly improves the quality of data clustering. Finally, the RW-CLOPE algorithm has been implemented on the efficient cluster platform(Spark). Experiments on two different and real databases show that RW-CLOPE algorithm achieves better clustering quality than p-CLOPE algorithm when the number of datasets is the same. For the mushrooms dataset, when CLOPE obtains the best results, RW-CLOPE can achieve 68% larger profit value than CLOPE and 25% larger profit value than p-CLOPE. The execution time of RW-CLOPE algorithm is much shorter than p-CLOPE algorithm when dealing with massive data. When has enough computing resource, the more shuffle copies of data the more obvious the improvement of the execution time.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.