Abstract

Traditional K-means distributed clustering algorithm has many problems in clustering big data, such as unstable clustering results, poor clustering results and low execution efficiency. In this paper, a density based initial clustering center selection method is proposed to improve the K-means distributed clustering algorithm. The algorithm uses the sample density, the distance between clusters and the cluster compact density, defines the product of the three as the difference weight density, and finds the sample point with the maximum difference weight density as the initial cluster center, so as to solve the problem of randomness and low quality of initial cluster center selection. At the same time, this paper uses spark parallel computing framework to implement the improved algorithm to further improve the processing performance of the algorithm in big data clustering.The experimental results show that the improved k-means distributed clustering algorithm based on spark parallel computing framework has higher execution efficiency, accuracy and good stability in big data clustering analysis.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.