Abstract

Following with the growth of massive data, clustering research, one of the core problems of big dataisfaced with more and more problems such as high computing complexity and lack of resource. It has proposed an improved parallel K-means algorithm based on Hadoop. To overcomethe problem that the traditional K-means algorithm often has local optimal solution due to the randomness choice of initial center, we introduce Canopy algorithm to initialize clustering center andapply K-means algorithm on canopy. Meanwhile, clusters are merged among canopies. The result is stable and iteration number is less. In addition, the parallel implementation methods and strategies of the improved algorithm are presented, combining with the distributed computing model of MapReduce. And a new method of text clustering is introduced by improving the similarity of measurement. The experiment results indicate the validity and scalability of our method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.