An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data

Mohamed Reda Al-Bana,Nermin Abdelhakim Othman,Marwa Salah Farhan

doi:10.3390/data7010011

Mohamed Reda Al-Bana, Nermin Abdelhakim Othman + Show 1 more

Open Access

https://doi.org/10.3390/data7010011

Copy DOI

Journal: Data	Publication Date: Jan 14, 2022
Citations: 9	License type: CC BY 4.0

Affiliation: Helwan University, British University in Egypt

Abstract

Frequent itemset mining (FIM) is a common approach for discovering hidden frequent patterns from transactional databases used in prediction, association rules, classification, etc. Apriori is an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori is used to scan the dataset multiple times to generate big frequent itemsets with different cardinalities. Apriori performance descends when data gets bigger due to the multiple dataset scan to extract the frequent itemsets. Eclat is a scalable version of the Apriori algorithm that utilizes a vertical layout. The vertical layout has many advantages; it helps to solve the problem of multiple datasets scanning and has information that helps to find each itemset support. In a vertical layout, itemset support can be achieved by intersecting transaction ids (tidset/tids) and pruning irrelevant itemsets. However, when tids become too big for memory, it affects algorithms efficiency. In this paper, we introduce SHFIM (spark-based hybrid frequent itemset mining), which is a three-phase algorithm that utilizes both horizontal and vertical layout diffset instead of tidset to keep track of the differences between transaction ids rather than the intersections. Moreover, some improvements are developed to decrease the number of candidate itemsets. SHFIM is implemented and tested over the Spark framework, which utilizes the RDD (resilient distributed datasets) concept and in-memory processing that tackles MapReduce framework problem. We compared the SHFIM performance with Spark-based Eclat and dEclat algorithms for the four benchmark datasets. Experimental results proved that SHFIM outperforms Eclat and dEclat Spark-based algorithms in both dense and sparse datasets in terms of execution time.

Highlights

Published: 14 January 2022We are currently living in the big data age
Spark-basedvertical verticallayout layout algorithm depending on calculating hand, dEclat algorithm depending on calculating difdifferences between itemsets
We discovered that the SHFIM is better suited to datasets containing thousands of variable-length transactions because, at high and low min sup, the SHFIM can adapt to datasets containing thousands, if not millions, of variablelength transactions without a memory leak or spilling due to the enhancements made

Summary

Introduction

We are currently living in the big data age. Companies and individuals gather and store all these generated data to analyze them for insight, knowledge, and decision making. We have been swamped with big data, not just because we already have large amounts of data that need to be processed and because the amount of data is rapidly growing every moment. The concept of big data has some properties that are collectively known as the “3Vs” model. Volume is defined as the amount of data; enormous amounts of data are generated and gathered. Velocity refers to the high rate at which data are created, gathered, and processed (streams, batch, near-real-time, and real-time). Variety indicates the different types of data: audio, images/video, and text; conventional structured data; and mixed data. There are two more features added to the “3Vs” model and known as the “5Vs” model

Objectives

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data

Lead the way for us

Similar Papers

Indexed Enhancement on GenMax Algorithm for Fast and Less Memory Utilized Pruning of MFI and CFI
C Chandrasekar ... C Sathya
International Journal of Computer Applications | VOL. 41
C Chandrasekar, et. al.C Chandrasekar ... C Sathya
31 Mar 2012
International Journal of Computer Applications | VOL. 41

On the Efficient Representation of Datasets as Graphs to Mine Maximal Frequent Itemsets
Zahid Halim ... Muhammad Ghufran Khan
IEEE Transactions on Knowledge and Data Engineering | VOL. 33
Zahid Halim, et. al.Zahid Halim ... Muhammad Ghufran Khan
18 Oct 2019
IEEE Transactions on Knowledge and Data Engineering | VOL. 33

Supporting efficient and scalable frequent pattern mining
Guimei Liu
-
Guimei LiuGuimei Liu
23 Dec 2014
23 Dec 2014

Mining Frequent Item and Item Sets Using Fuzzy Slices
Ms Poonam A Manjare ... Mrs R.R Shelke
international journal of engineering trends and technology | VOL. -
Ms Poonam A Manjare, et. al.Ms Poonam A Manjare ... Mrs R.R Shelke
25 Mar 2014
international journal of engineering trends and technology | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data