Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database

Yongheng Wang,Yan Jia,Shuqiang Yang

doi:10.1007/11563952_68

Abstract

Frequent itemsets mining is a common and useful task in data mining. But most of the current mining algorithms can’t be used in very large text database. In this paper, we propose a novel and efficient parallel algorithm parTFI which is used to find top-k frequent itemsets with specified minimum length in very large text database. Base on a simple data structure H-struct, parTFI uses a novel logical vertical data partition technique to mine top-k frequent itemsets at each mining server parallel. Our performance study shows that when processing very large sparse text database, parTFI outperforms Apriori and FP-growth, two efficient frequent iemsets mining algorithms, even when both are running with the better tuned min_support. Furthermore, by creating H-struct dynamically, parTFI can suit even huge dataset that most other algorithms can’t process.

Full Text