Abstract
With the explosive growing of data, there are challenges to deal with the large scale complex data. Many clustering algorithms have been proposed. Such as Affinity Propagation (AP) clustering Algorithm, AP takes similarity between pairs of data point as input measures. AP is a fast and efficient clustering algorithm for large dataset compared with the existing clustering algorithm. As the scale of data grows more explosively, the time efficiency of AP algorithm cannot be satisfied. Therefore, AP clustering algorithm based on Spark platform (Spark-AP) is proposed in this paper. Firstly, a dataset is partitioned into several Resilient Distributed Datasets (RDD) on a strategy and select the exemplars of each RDD. Then exemplars are merged and are used to next AP clustering algorithm, which forms a set of high-quality exemplars after convergence. Experiments show that Spark-AP performs better both in processing scale and processing time.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.