A case for unsupervised-learning-based spam filtering

Feng Qian,Abhinav Pathak,Yinglian Xie,Zhuoqing Morley Mao,Yu Charlie Hu

doi:10.1145/1811039.1811090

Abstract

Traditional content-based spam filtering systems rely on supervised machine learning techniques. In the training phase, labeled email instances are used to build a learning model (e.g., a Naive Bayes classifier or support vector machine), which is then applied to future incoming emails in the detection phase. However, the critical reliance on the training data becomes one of the major limitations of supervised spam filters. Preparing labeled training data is often labor-intensive and can delay the learning-detection cycle. Furthermore, any mislabeling of the training corpus (e.g., due to spammers’ obfuscations) can severely affect the detection accuracy. Supervised learning schemes share one common mechanism regardless of their algorithm details: learning is performed on an individual email basis. This is the fundamental reason for requiring training data for supervised spam filters. In other words, in the learning phase these classifiers can never tell whether an email is spam or ham because they examine one email instance at a time. We investigate the feasibility of a completely unsupervised-learningbased spam filtering scheme which requires no training data. Our study is motivated by three key observations of the spam in today’s Internet. (1) The vast majority of emails are spam. (2) A spam email should always belong to some campaign [2, 3]. (3) The spam from the same campaign are generated from templates that obfuscate some parts of the spam, e.g., sensitive terms, leaving the other parts unmodified [3]. These observations suggest that in principle we can achieve unsupervised spam detection by examining emails at the campaign level. In particular, we need robust spam identification algorithms to find common terms shared by spam belonging to the same campaign. These common terms form signatures that can be used to detect future spam of the same campaign. This paper presents SpamCampaignAssassin (SCA), an online unsupervised spam learning and detection scheme. SCA performs accurate spam campaign identification, campaign signature generation, and spam detection using campaign signatures. To our knowledge, SCA is the first unsupervised spam filtering scheme that achieves accuracy comparable to the de-facto supervised spam filters by explicitly exploiting online campaign identification. The full paper describing SCA is available as a technical report [4].

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A case for unsupervised-learning-based spam filtering

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Training Logistic Regression Model by Enhanced Moth Flame Optimizer for Spam Email Classification
Mohamed Salb ... Miodrag Zivkovic
-
Mohamed Salb, et. al.Mohamed Salb ... Miodrag Zivkovic
14 Oct 2022
14 Oct 2022

An Analysis of Supervised Machine Learning Algorithms for Spam Email Detection
Tasnia Toma ... Samia Hassan
-
Tasnia Toma, et. al.Tasnia Toma ... Samia Hassan
08 Jul 2021
08 Jul 2021

Spam filtering using a logistic regression model trained by an artificial bee colony algorithm
Bilge Kagan Dedeturk ... Bahriye Akay
Applied Soft Computing | VOL. 91
Bilge Kagan Dedeturk, et. al.Bilge Kagan Dedeturk ... Bahriye Akay
16 Mar 2020
Applied Soft Computing | VOL. 91

Machine Learning Based Classification for Spam Detection
Serkan Keskin ... Onur Sevli
Sakarya University Journal of Science | VOL. 28
Serkan Keskin, et. al.Serkan Keskin ... Onur Sevli
30 Apr 2024
Sakarya University Journal of Science | VOL. 28

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A case for unsupervised-learning-based spam filtering

Abstract

Talk to us

Similar Papers