A Bayesian Topic Model for Spam Filtering

Zhiying Zhang

doi:10.12733/jics20102279

Abstract

Spam is one of the major problems of today’s Internet because it brings financial damage to companies and annoys individual users. Among those approaches developed to detect spam, the content-based machine learning algorithms are important and popular. However, these algorithms are trained using statistical representations of the terms that usually appear in the e-mails. Additionally, these methods are unable to account for the underlying semantics of terms within the messages. In this paper, we present a Bayesian topic model to address these limitations. We explore the use of semantics in spam filtering by representing e-mails as vectors of topics with a topic model: the Latent Dirichlet Allocation (LDA). Based upon this representation, the relationship between the topics and spam can be discovered by using a Bayesian method. We test this model on the Enron-Spam datasets and results show that the proposed model performs better than the baseline and can detect the internal semantics of spam messages.

Full Text