An Improved Apriori Algorithm Based on the Spark Platform

Congshuai Xia,Gang Fang,Wenqiang Gao,Qian Zhao

doi:10.18686/aitr.v2i3.4406

Congshuai Xia, Gang Fang + Show 2 more

https://doi.org/10.18686/aitr.v2i3.4406

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Apriori algorithm is one of the most classical algorithms used to mine frequent term sets, but because of the need to scan the calculation method of transaction set repeatedly, the computational efficiency of the algorithm is seriously reduced and it is difficult to parallelize processing. With the advent of the era of big data, the data scale is increasing. In order to solve this problem, this paper proposes a Apriori parallelization processing method based on vertical data format and Spark computing framework GC-Apriori algorithm. Using the vertical data format, reduce the duplication between things, improve the efficiency of data storage, and the efficiency of frequent item set mining. At the same time, the broadcast variable mechanism of Spark is used to improve the overall computing efficiency. Comparing the performance with other distributed Apriori algorithms on the same scale, the computational efficiency of GC-Apriori algorithm is improved. The results show that the algorithm effectively improves the efficiency of frequent term set mining of Apriori algorithm in distributed environment.

Full Text