Spam Filtering Using Statistical Data Compression Models

Andrej Bratko ,Gordon V Cormack ,Bogdan Filipič ,Thomas R Lynam ,Blaž Zupan

doi:10.5555/1248547.1248644

Abstract

Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel approach to spam filtering based on adaptive statistical data compression models. The nature of these models allows them to be employed as probabilistic text classifiers based on character-level or binary sequences. By modeling messages as sequences, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We evaluate the filtering performance of two different compression algorithms; dynamic Markov compression and prediction by partial matching. The results of our empirical evaluation indicate that compression models outperform currently established spam filters, as well as a number of methods proposed in previous studies.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Spam Filtering Using Statistical Data Compression Models

Abstract

Talk to us

Similar Papers

More From: Journal of Machine Learning Research

Lead the way for us

Journal: Journal of Machine Learning Research	Publication Date: Dec 1, 2006
Citations: 177

Similar Papers

An Improved Text Categorization Methodology Based on Second and Third Order Probabilistic Feature Extraction and Neural Network Classifiers
D A Karras
-
D A KarrasD A Karras
01 Jan 2006
01 Jan 2006

TV Commercial Classification by using Multi-Modal Textual Information
Yantao Zheng ... Qi Tian
-
Yantao Zheng, et. al.Yantao Zheng ... Qi Tian
01 Jul 2006
01 Jul 2006

A Robust Meaning Extraction Methodology Using Supervised Neural Networks
D.A Karras ... B.G Mertzios
-
D.A Karras, et. al.D.A Karras ... B.G Mertzios
01 Jan 2002
01 Jan 2002

Study on Feature Selection and Weighting Based on Synonym Merge in Text Categorization
Zhenyu Lu ... Shuang Zhao
-
Zhenyu Lu, et. al.Zhenyu Lu ... Shuang Zhao
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Spam Filtering Using Statistical Data Compression Models

Abstract

Talk to us

Similar Papers

More From: Journal of Machine Learning Research