Highly Accurate Spam Detection with the Help of Feature Selection and Data Transformation

Hidayet Takcı,Fatema Nusrat

doi:10.34028/iajit/20/1/4

Abstract

The amount of spam is increasing rapidly while the popularity of emails is increasing. This situation has led to the need to filter spam emails. To date, many knowledge-based, learning-based, and clustering-based methods have been developed for filtering spam emails. In this study, machine-learning-based spam detection was targeted, and C4.5, ID3, RndTree, C-Support Vector Classification (C-SVC), and Naïve Bayes algorithms were used for email spam detection. In addition, feature selection and data transformation methods were used to increase spam detection success. Experiments were performed on the UC Irvine Machine Learning Repository (UCI) spambase dataset, and the results were compared for accuracy, Receiver Operating Characteristic (ROC) analysis, and classification speed. According to the accuracy comparison, the C-SVC algorithm gave the highest accuracy with 93.13%, followed by the RndTree algorithm. According to the ROC analysis, the RndTree algorithm gave the best Area Under Curve (AUC) value of 0.999, while the C4.5 algorithm gave the second-best result. The most successful methods in terms of classification speed are Naïve Bayes and RndTree algorithms. In the experiments, it was seen that feature selection and data transformation methods increased spam detection success. The binary transformation that increased the classification success the most and the feature selection method was forward selection.

Full Text