New Spark solutions for distributed frequent itemset and association rule mining algorithms

Carlos Fernandez-Basso,M Dolores Ruiz,Maria J Martin-Bautista

doi:10.1007/s10586-023-04014-w

Carlos Fernandez-Basso, M Dolores Ruiz + Show 1 more

Open Access

https://doi.org/10.1007/s10586-023-04014-w

Copy DOI

Abstract

AbstractThe large amount of data generated every day makes necessary the re-implementation of new methods capable of handle with massive data efficiently. This is the case of Association Rules, an unsupervised data mining tool capable of extracting information in the form of IF-THEN patterns. Although several methods have been proposed for the extraction of frequent itemsets (previous phase before mining association rules) in very large databases, the high computational cost and lack of memory remains a major problem to be solved when processing large data. Therefore, the aim of this paper is three fold: (1) to review existent algorithms for frequent itemset and association rule mining, (2)to develop new efficient frequent itemset Big Data algorithms using distributive computation, as well as a new association rule mining algorithm in Spark, and (3) to compare the proposed algorithms with the existent proposals varying the number of transactions and the number of items. To this purpose, we have used the Spark platform which has been demonstrated to outperform existing distributive algorithmic implementations.

Full Text