Abstract

Spam is no more garbage but risk since it recently includes virus attachments and spyware agents which make the recipients’ system ruined, therefore, there is an emerging need for spam detection. Many spam detection techniques based on machine learning algorithms have been proposed. As the amount of spam has been increased tremendously using bulk mailing tools, spam detection techniques should deal with it. For spam detection, parameters optimization and feature selection have been proposed to reduce processing overheads with guaranteeing high detection rates. However, the previous approaches have not taken into account variable importance and optimal number of features and there are no approaches using both of them together so far. In this paper, we propose an optimal spam detection model based on Random Forests (RF) which enables parameters optimization and feature selection. We optimize two parameters of RF to maximize the detection rates. We provide the variable importance of each feature so that it is easy to eliminate the irrelevant features. Furthermore, we decide an optimal number of selected features using two methods; (i) only one parameters optimization during overall feature selection, (ii) parameters optimization in every feature elimination phase. We carry out experiments on the Spambase dataset and show the feasibility of our approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call