Abstract

Feature selection plays an important role in Spam Filtering. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), and so on are commonly applied in spam filtering. Spam filtering can also be seen as a special two-class text categorization (TC) problem. Many existing experiments show IG is one of the most effective methods in text categorization task. However, what is the most effective method on spam filtering? As we all know there was not a systematic research about these feature selection methods on spam filtering. This paper is a comparative study of feature selection methods in spam filtering. The focus is on aggressive dimensionality reduction. We explore 2 classifiers (Naïve Bayes and SVM), and run our experiments on Chinese-spam collection. Six methods were evaluated, including term selection based on document frequency (DF), information gain(IG), χ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> feature selection method, expected cross entropy (ECE), the weight of evidence for text (WET) and odds ratio (ODD). We found ODD and WET most effective in our experiments. In contrast, IG and χ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> had relatively poor performance due to their bias towards favoring rare terms, and its sensitivity to probability estimation errors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call