Abstract

In order to improve the problem that the single-machine serial programming model is not ideal for mass data clustering, we combine big data technology with text clustering related technologies. Implement distributed storage and calculations for text data, parallelization of text vectors and parallel clustering using clustering algorithms based on the Map Reduce programming model. The traditional k-means clustering algorithm is a typical algorithm for solving clustering problems. It has better with good scalability and scalability for processing large data sets, but the initial center of the algorithm is chosen randomly, and the algorithm is unstable every time. To solve the above problems, firstly, based on the idea of density segmentation and sampling thought, the initial clustering center is selected and optimized. Secondly, parallel sampling of the data set to find the best candidate cluster center by referring to the sample maximum and minimum method search and consolidate data objects in parallel. Finally, the selected initial cluster center is replaced by the central point randomly selected by the k-means algorithm, and the clustering algorithm is parallelized. Experiments show that the improved k-means algorithm can effectively reduce the number of iterations.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.