Large-scale Data Mining Method based on Clustering Algorithm Combined with MAPREDUCE

Yulun Zhang,Lei Yang,Chenxu Zhang,Hongyang Li

doi:10.62051/8p9b3106

Abstract

With the continuous deepening and development of information technology, the diversity and amount of information in data continue to grow. Effectively mining these text data to extract valuable content has become an urgent task in the field of data research. This study combines the MapReduce distributed system with the K-means clustering algorithm to meet the challenges of large-scale data mining. At the same time, the paper use a distributed caching mechanism to solve the problem of repeated application of resources for multiple MapReduce collaborative operations and improve data mining efficiency. The combination of MapReduce's distributed computing and the advantages of K-means clustering algorithm provides an efficient and scalable method for large-scale data mining. Experimental results combining internal and external indicators show that the advantage of combining K-means with MapReduce is to fully utilize the distributed and parallel computing characteristics of MapReduce, providing users with an efficient and scalable data mining tool. Through this research, the paper provide new methods and insights for large-scale data mining, improving the efficiency and accuracy of data mining.

Full Text