Abstract

A novel random sample partition-based clustering ensemble (RSP-CE) algorithm is proposed in this paper to handle the big data clustering problems. There are three key components in RSP-CE algorithm, i.e., generating the base clustering results on RSP data blocks, harmonizing the based clustering results with maximum mean discrepancy (MMD) criterion, and refining the RSP clustering results. RSP data blocks have the consistent sample distributions with the whole big data and thus provide the possibility for using base clustering results on different data subsets to approximate the clustering result on whole big data. The experimental results in comparison with other 5 well-known clustering ensemble algorithms on 4 big data sets show that RSP-CE algorithm obtains the better normalized mutual information (NMI) values and Fowlkes-Mallows Index (FMI) values with the less training time consumptions and thus demonstrate that RSP-CE algorithm is a viable approach to deal with the big data clustering problems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call