Improved spam e-mail filtering based on committee machines and information theoretic feature extraction

V Zorkadis,M Panayotou,D.A Karras

doi:10.1109/ijcnn.2005.1555826

Abstract

A novel approach for spam e-mail filtering is herein considered based on the committee machines neural network models and on information theoretic feature extraction. An extensive experimental study is organized, the most extensive so far in the literature, based on widely accepted benchmarking e-mail data sets, comparing the proposed methodology with the naive Bayes spam filter as well as with the boosting tree methodology, the linear models based classification (classification via regression) and the nonlinear models based classification using simple neural network models, including multilayer perceptrons. Moreover, several feature extraction approaches based on information theory are evaluated. It is shown that the committee machines mail categorization performance is compared very favorably to the other rival methods performance, including the Bayes spam filter which is the most widely used approach in the e-mail services market. It is, also, found that the proposed information theoretic Boolean features present a remarkably high spam categorization performance compared to their analog counterparts performance.

Full Text