Algorithm Comparison on Email Spam Filtering Task

Yixuan Li

doi:10.54097/hset.v34i.5436

Abstract

Email has long been a major form of communication among organizations and individual users. In recent years, with the rise of internet use, email spamming has become increasingly common. Spamming has raised security concerns as it causes potential loss to the users with fake advertisements, invalid information, undetected virus, and other harmful information. Various techniques have been developed to facilitate spam filtering, using classifying algorithms to characterize emails into different categories. This article presents an investigation of how machine-learning-based algorithms are used in email spam filtering by providing some previous researches that have shown to be successful. The algorithms are in the range from supervised learning, including the Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Naïve Bayes (NB) to unsupervised learning such as artificial neural networks (ANN) and partitional clustering. This article also presents an experiment that compares different implementations of such algorithms, including the SVM, NB, and K-Nearest Neighbor (KNN). The results reflect that NB gave the highest accuracy. A second attempt of the same experiment was conducted, with an improved data cleaning procedure and larger testing sets. The data collected from the second attempt again show that the NB implementation gave the highest accuracy in detecting spam emails.

Full Text