Abstract

Feature selection is often used in email classification to reduce the dimensionality of the feature space. In this study, a new document frequency and term frequency combined feature selection method (DTFS) is proposed to improve the performance of email classification. Firstly, an existing optimal document frequency based feature selection method (ODFFS) and a predetermined threshold are applied to select the most discriminative features. Secondly, an existing optimal term frequency based feature selection (OTFFS) method and another predetermined threshold are applied to select more discriminative features. Finally, ODFFS and OTFFS are combined to select the remaining features. In order to improve the convergence rate of parameter optimization, a metaheuristic method, namely global best harmony oriented harmony search (GBHS), is proposed to search these optimal predetermined thresholds. Experiments with fuzzy Support Vector Machine (FSVM) and Naïve Bayesian (NB) classifiers are applied on six corpuses: PU2, CSDMC2010, PU3, Lingspam, Enron-spam and Trec2007. Experimental results show that, DTFS outperforms other methods: such as Chi-squre, comprehensively measure feature selection, t-test based feature selection, term frequency based information gain, two-step based hybrid feature selection method and improved term frequency inverse document frequency method on six corpuses.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call