Abstract

Data sampling is to sample hidden, previously unknown knowledge and rules that have potential value for decision-making from massive data. Cluster analysis is an important research topic in the field of data sampling. How to extract the information and knowledge that people care about, unknown, and help to analyze the decision-making process from massive data is a problem that people urgently need to solve. In this paper, after using the genetic algorithm-based method to obtain a better initial clustering center, the DPDGA algorithm divides the data set according to the obtained initial clustering center point. For each local data set obtained by division, the parameters MinPts of each local data set are calculated, and then each local data set is clustered using the DBSCAN algorithm, and finally the clustering results of each local data set are merged. Aiming at the shortcomings of the DBSCAN algorithm, this paper proposes a DBSCAN algorithm that uses particle swarm optimization to divide data and uses the MapReduce model to perform parallel calculations. The DPDPSO algorithm first uses the particle swarm optimization algorithm to obtain the optimal initial clustering center, and then partitions the data set according to the optimal initial clustering center. After partitioning, the DBSCAN’s own k-dist graph is used to determine the ε and MinPts, finally merge each partition according to certain rules, and merge data points that may be mistaken for noise points. Research shows that when the amount of data reaches about 7M, the clustering method used in this article may take up to 1 hour. If the number of iterations is less, the time it takes will be less.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call