Abstract

Feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is commonly used in spam filtering. As many classifiers cannot deal with the features with large dimensions, the noisy, irrelevant and redundant data should be removed from the feature spaces. In this paper, a two-step based hybrid feature selection method, called TFSM, is proposed. Firstly, we select the most discriminative features by an existing document frequency based feature selection method (called ODFFS). Secondly, we select the remaining features by combining the ODFFS and a newly proposed term frequency based feature selection method (called NTFFS). Moreover, we propose a new optimizing meta-heuristic method, called GOPSO, to improve the convergence rate of standard particle swarm optimization. In the experiments, Support Vector Machine (SVM) and Naive Bayesian (NB) classifiers are used on four corpuses: PU2, PU3, Enron-spam and Trec2007. The experimental results show that, TFSM is significantly superior to information gain, comprehensively measure feature selection, t-test based feature selection, term frequency based information gain and improved term frequency inverse document frequency method on four corpuses when SVM and NB are applied respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call