Abstract

A novel approach for spam e-mail filtering is herein considered based on information theoretic extraction of higher order features and the committee machines neural network models. An extensive experimental study is organized, the most extensive so far in the literature, based on widely accepted benchmarking e-mail data sets, comparing the proposed methodology with the Naïve Bayes spam filter as well as with the Boosting tree methodology, the linear models-based classification (classification via regression) and the nonlinear models-based classification using simple neural network models, including Multilayer Perceptrons. Moreover, several feature extraction approaches based on information theory are evaluated, comparing mainly the proposed higher order feature extraction methodology with information theoretic extraction of single features. It is shown that the former outperforms the latter and, moreover, that the proposed information theoretic Boolean features present a remarkably high spam categorization performance compared to that of their analog counterparts. Finally, it is shown that the committee machines mail categorization performance compares very favorably to the other rival methods' performance, including the Bayes spam filter which is the most widely used approach in the e-mail services market.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call