Author Identification Using Imbalanced and Limited Training Texts

Efstathios Stamatatos

doi:10.1109/dexa.2007.5

Abstract

Discovering association rules that identify relationships among sets of items is an important problem in data mining. Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore it has grasped significant research focus [1]. Discovery of frequently occurring subsets of items, called itemsets, is the core of many data mining methods. Most of the previous studies adopt Apriori- like algorithms, whom iteratively generate candidate itemsets and check their occurrence frequencies in the database. These approaches suffer from serious costs of repeated passes over the analyzed database. In this paper, we propose a new BDD-based (Binary Decision Diagram) data structure called TreeSupBDD. The TREESUPBDD extends the idea claimed by the authors of FP-TREE [9] and ITL-Tree [5] structures, aiming to improve storage compression and to allow frequent pattern mining without an explicit candidate itemset generation step. To address this problem, we propose a novel method, called TreeSupBDD- MlNE, for reducing database activity of frequent itemset discovery algorithms. The idea of TREESUPBDD-MlNE consists in using a Binary Decision Diagram and a tree for representing both database and frequent itemsets. The proposed method requires one scan over the source database : to create the associated tree and BDD and check discovered itemset supports. The originality of our work stands on the fact that the proposed algorithm extracts the frequent itemsets directly from the TreeSupBDD. Carried out experiments showed very encouraged results. Its performance improvements have been shown in a series of our experiments. We extend the binary decision diagram structure to store transaction groups and propose a new method to discover frequents itemsets. To study the trade-offs in the new representation of transactions in binary decision diagram, we compare the performance of our algorithm with the fastest Apriori [2] implementation algorithm and the latest extension of FP-Growth [15]. We have tested all the algorithms using different benchmark datasets. The performance study shows that the new algorithm significantly reduces the processing time for mining frequent itemsets from dense datasets that contain relatively long patterns and for low threshold. We discuss the performance results in detail and also the strengths and limitations of our algorithm.

Full Text