Occam’s razor-based spam filter

Tiago A Almeida,Akebo Yamakami

doi:10.1007/s13174-012-0067-x

Abstract

Abstract Nowadays e-mail spam is not a novelty, but it is still an important rising problem with a big economic impact in society. Spammers manage to circumvent current spam filters and harm the communication system by consuming several resources, damaging the reliability of e-mail as a communication instrument and tricking recipients to react to spam messages. Consequently, spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle. Furthermore, we have conducted an empirical experiment on six public and real non-encoded datasets. The results indicate that the proposed filter is fast to construct, incrementally updateable and clearly outperforms the state-of-the-art spam filters.

Highlights

E-mail is one of the most popular, fastest and cheapest means of communication
We present a spam filtering approach based on the minimum description length (MDL) principle and compare its performance with seven different models of Naïve Bayes classifiers and the support vector machines (SVM)
We have presented a new spam filtering approach based on the MDL principle that has proved to be very fast to construct and incrementally updateable

Summary

Introduction

E-mail is one of the most popular, fastest and cheapest means of communication. It has become a part of everyday life for millions of people, changing the way we work and collabo. Similar to the work of Frank et al [27], Teahan and Harper [47] performed extensive experiments to evaluate the performance of different approaches for text categorization on the standard Reuters-21578 collection They compared compression-based algorithms, such as prediction by partial matching (PPM) with Naïve Bayes classifiers and SVM. Bratko et al [18] investigated the performance achieved by data compression models in spam filtering task They evaluate the filtering performance of two dif ferent compression algorithms: dynamic Markov compression (DMC) and prediction by partial matching (PPM). We present a spam filtering approach based on the MDL principle and compare its performance with seven different models of Naïve Bayes classifiers and the SVMs. Here, we carry out an evaluation with the practical purpose of filtering e-mail spams in order to compare the currently top-performer’s spam filters.

Basic concepts

Spam filtering based on minimum description length principle

Preprocessing and tokenization

Training method

Experimental results

Conclusions and further work