A comparative study on feature selection in Chinese Spam Filtering

Yan Xu

doi:10.1109/icaict.2012.6398481

Yan Xu

https://doi.org/10.1109/icaict.2012.6398481

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Feature selection plays an important role in Spam Filtering. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), and so on are commonly applied in spam filtering. Spam filtering can also be seen as a special two-class text categorization (TC) problem. Many existing experiments show IG is one of the most effective methods in text categorization task. However, what is the most effective method on spam filtering? As we all know there was not a systematic research about these feature selection methods on spam filtering. This paper is a comparative study of feature selection methods in spam filtering. The focus is on aggressive dimensionality reduction. We explore 2 classifiers (Naïve Bayes and SVM), and run our experiments on Chinese-spam collection. Six methods were evaluated, including term selection based on document frequency (DF), information gain(IG), χ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> feature selection method, expected cross entropy (ECE), the weight of evidence for text (WET) and odds ratio (ODD). We found ODD and WET most effective in our experiments. In contrast, IG and χ <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> had relatively poor performance due to their bias towards favoring rare terms, and its sensitivity to probability estimation errors.

Full Text