Abstract
Apriori algorithm is one of the most classical algorithms used to mine frequent term sets, but because of the need to scan the calculation method of transaction set repeatedly, the computational efficiency of the algorithm is seriously reduced and it is difficult to parallelize processing. With the advent of the era of big data, the data scale is increasing. In order to solve this problem, this paper proposes a Apriori parallelization processing method based on vertical data format and Spark computing framework GC-Apriori algorithm. Using the vertical data format, reduce the duplication between things, improve the efficiency of data storage, and the efficiency of frequent item set mining. At the same time, the broadcast variable mechanism of Spark is used to improve the overall computing efficiency. Comparing the performance with other distributed Apriori algorithms on the same scale, the computational efficiency of GC-Apriori algorithm is improved. The results show that the algorithm effectively improves the efficiency of frequent term set mining of Apriori algorithm in distributed environment.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have