Abstract

The discovery of frequent itemsets is one of the very important topics in data mining. Frequent itemset discovery techniques help in generating qualitative knowledge which gives business insight and helps the decision makers. In the Big Data era the need for a customizable algorithm to work with big data sets in a reasonable time becomes a necessity. In this paper we propose a new algorithm for frequent itemset discovery that could work in distributed manner with big datasets. Our approach is based on the original Buddy Prima algorithm and the Greatest Common Divisor (GCD) calculation between itemsets which exist in the transaction database. The proposed algorithm introduces a new method to parallelize the frequent itemset mining without the need to generate candidate itemsets and also it avoids any communication overhead between the participated nodes. It explores the parallelism abilities in the hardware in case of single node operation. The proposed approach could be implemented using map-reduce technique or Spark. It was successfully applied on different size transactions DBs and compared with two well-known algorithms: FP-Growth and Parallel Apriori with different support levels. The experiments showed that the proposed algorithm achieves major time improvement over both algorithms especially with datasets having huge number of items.

Highlights

  • Frequent itemsets discovery “is one of the most important techniques in data mining” (Zhengui Li, 2012)

  • In this paper we propose a parallelizable algorithm for Frequent itemset mining (FIM) that could deal with big data sets exploiting the multicore feature of the hardware

  • We used Retail dataset to show this capability for the proposed algorithm POBPA

Read more

Summary

Introduction

Frequent itemsets discovery “is one of the most important techniques in data mining” (Zhengui Li, 2012). It can find out the association relationships among events or data objects that are hidden in the data, even if the associated events or objects seems not related at all. Literature contains many approaches that tackle the FIM problem like Apriori, FP-Growth, multi-level frequent itemsets, DHP (Direct Hashing and Pruning), maximal association rule mining, primitive association rules, softmatching rules and Buddy Prima. An association rules cheese, chips (80%) states that four out of five customers that bought cheese bought chips Such rules can be useful for decisions concerning products pricing, promotions, store layout and many others.

Literature Review
Prime Numbers Representation Algorithms
Data Preparation
Frequent Itemsets Deduction using GCD
Experimental Results
Conclusion and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.