Abstract

Apriori algorithm is a classical algorithm in the field of data mining. It is widely used in the research of mining the association rules, but it also has some disadvantages. In this paper, for the three main steps in the execution of the Apriori algorithm, we propose a novel method that combines the storage structure of the prefixed-itemset with the database optimization to improve the Apriori algorithm on the Hadoop cluster. First, we used the storage structure of the prefixed-itemset to improve the implementation methods of the connection step and the pruning step in the traditional Apriori algorithm to increase the execution efficiency of the algorithm. Second, we changed the storage schema of the database. And we converted the original transaction database into the transaction-state matrix to transform the storage pattern of transaction data and enhance the efficiency of the traditional Apriori algorithm in the counting step. Then, we combined the properties of the frequent itemsets to improve the iterative termination condition of the algorithm, thus reduced the running time of the algorithm. Finally, we performed MapReduce parallelization improvement on the Apriori algorithm optimized by the above steps based on the Hadoop distributed architecture. The experimental results show that compared with the traditional Apriori algorithm, the improved Apriori algorithm on the Hadoop cluster has improved the execution efficiency greatly.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.