Efficient information theoretic extraction of higher order features for improving neural network-based spam e-mail categorization

V Zorkadis,D A Karras

doi:10.1080/09528130600975873

Abstract

A novel approach for spam e-mail filtering is herein considered based on information theoretic extraction of higher order features and the committee machines neural network models. An extensive experimental study is organized, the most extensive so far in the literature, based on widely accepted benchmarking e-mail data sets, comparing the proposed methodology with the Naïve Bayes spam filter as well as with the Boosting tree methodology, the linear models-based classification (classification via regression) and the nonlinear models-based classification using simple neural network models, including Multilayer Perceptrons. Moreover, several feature extraction approaches based on information theory are evaluated, comparing mainly the proposed higher order feature extraction methodology with information theoretic extraction of single features. It is shown that the former outperforms the latter and, moreover, that the proposed information theoretic Boolean features present a remarkably high spam categorization performance compared to that of their analog counterparts. Finally, it is shown that the committee machines mail categorization performance compares very favorably to the other rival methods' performance, including the Bayes spam filter which is the most widely used approach in the e-mail services market.

Full Text