Abstract

With the development of Internet, traditional data mining algorithms have been unable to adapt to information mining under large data volume. This paper combines the latest cloud computing technology to improve the traditional data mining algorithm, and uses Hadoop platform to improve the parallel processing ability of the algorithm. The K-Means algorithm relies on the initial k-value and the initial center point and is combined with the Hadoop platform features. Before the K-Means algorithm clusters, the Hadoop platform is used to sample the initial data, and the neighborhood density is used to determine the initial center point. Then cluster again. Based on the previous analysis of the defect of K-Means algorithm, this paper proposes a sampling-based secret, the improved K-Means algorithm. The initial k value and the center point are determined by the sample and density, and the defect of specifying the k value and the initial center point in the initial stage is solved. The K-Means algorithm will be improved MapReduce, and the ability to process data in parallel using Hadoop will improve the scalability of the K-Means algorithm. Finally, the algorithm proved to be more scalable during the experiment.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call