Abstract

AbstractWe address the problem of unsupervised and semi-supervised SMS (Short Message Service) text message SPAM detection. We develop a content-based Bayesian classification approach which is a modest extension of the technique discussed by Resnik and Hardisty in 2010. The approach assumes that the bodies of the SMS messages arise from a probabilistic generative model and estimates the model parameters by Gibbs sampling using an unlabeled, or partially labeled, SMS training message corpus. The approach classifies new SMS messages as SPAM or HAM (non-SPAM) by zero-thresholding their logit estimates. We tested the approach on a publicly available SMS corpora collected from the UK. Used in semi-supervised fashion, the approach clearly outperformed a competing algorithm, Semi-Boost. Used in unsupervised fashion, the approach outperformed a fully supervised classifier, an SVM (Support Vector Machine), when the number of training messages used by the SVM was small and performed comparably otherwise. We believe the approach works well and is a useful tool for SMS SPAM detection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call