Abstract
Due to the rapid development of information technology and network technology, there is a lot of data, but the phenomenon of lack of knowledge is becoming more and more serious. Data mining technology has developed vigorously in this environment, and it has shown more and more vitality. Based on Spark programming model, this paper designs the parallel extension of fuzzy c-means. In order to enhance the performance of fuzzy c-means parallel expansion, the improvement strategy of k-means during the initialization phase is borrowed, and k-means// is extended to fuzzy c-means to obtain better clustering performance. Combined with Spark's programming model, this paper can obtain extended parallel fuzzy c-means algorithm. Several experiments on the data set of the algorithm proposed in this paper have shown good scalability and parallelism, effectively expanding fuzzy c-means clustering to distributed applications, greatly increasing the scale of the data processed by the algorithm. This improves the robustness of the algorithm and the adaptability of the algorithm to the shape and structure of the data, so that the parallel and scalable clustering algorithm can more effectively perform cluster analysis on big data. Three algorithms were simulated on MATLAB platform. We use simple data sets and complex two-dimensional data sets, and compare with the traditional fuzzy c-means algorithm and fuzzy c-means algorithm based on fuzzy entropy. Experiments show that the scalable parallel fuzzy c-means algorithm not only greatly improves the anti-noise performance, but also improves the convergence speed, and it can automatically determine the optimal number of clusters.
Highlights
With the rapid development and increasing popularity of the Internet, modern society is generating data at unimaginable speeds
2) THE MAIN PROBLEMS OF DATA MINING The main problems of data mining are mainly in the following areas: (1) Mining methods and user interaction issues: This reflects the type of knowledge mined, the ability to mine knowledge at multiple granularities, the use of domain knowledge, specific mining and knowledge display
The large capacity of many databases, the widespread distribution of data, and the computational complexity of some data mining algorithms are factors that facilitate the development of parallel and distributed data mining algorithms
Summary
With the rapid development and increasing popularity of the Internet, modern society is generating data at unimaginable speeds. INDEX TERMS Artificial intelligence, data mining, cluster analysis, scalable parallel fuzzy c-means, cloud computing. By adding an initialization process to the cluster center value based on probability, the speed of the k-means algorithm is significantly improved.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.