Using hashing and lexicographic order for Frequent Itemsets Mining on data streams

Lázaro Bustio-Martínez,Martín Letras-Luna,René Cumplido,Raudel Hernández-León,Claudia Feregrino-Uribe,José M Bande-Serrano

doi:10.1016/j.jpdc.2018.11.002

Abstract

Frequent Itemsets Mining is a Data Mining technique that has been employed to extract useful knowledge from datasets and, more recently, also from data streams. Data streams are unbounded and infinite flows of data arriving at high rates which cannot be stored for off-line processing; therefore, proposed algorithms for Frequent Itemsets Mining approaches from datasets cannot be used straightforwardly for Frequent Itemsets Mining from data streams. Frequent Itemsets Mining is a compute intensive task, hence developing custom hardware-based architectures to speed up this process is an active research topic. This paper introduces an algorithm for a hardware-based Frequent Itemsets Mining on data streams that uses the top-k frequent 1-itemsets detection as preprocessing. The received transactions are handled using hash functions, and the lexicographic order of items is used for obtaining frequent itemsets. The proposed algorithm is focused on discovering frequent itemsets in data streams composed of short transactions in large alphabets. Experimental results demonstrate that the proposed algorithm outperforms the processing time of the state-of-the-art algorithms used as the baseline.

Full Text