Abstract

Spam e-mail documents classification is a very challenging task for e-mail users, especially non IT users. Billionsof people using the internet and face the problem of spam e-mails. The automatic identification and classificationof spam e-mails help to reduce the problem of e-mail users in managing a large amount of e-mails. This work aimsto do a significant contribution by building a robust model for classification of spam e-mail documents using datamining techniques. In this paper, we use Enorn1 data set which consists of spam and ham documents collectedfrom Kaggle repository. We propose an Ensemble Model-1 that is an ensemble of Multilayer Perceptron (MLP),Naïve Bayes and Random Forest (RF) to obtain better accuracy for the classification of spam and hame-mail documents.Experimental results reveal that the proposed Ensemble Model-1 outperforms other existing classifiers aswell as other proposed ensemble models in terms of classification accuracy. The suggested and proposed EnsembleModel-1 produces a high accuracy of 97.25% for classification of spam e-mail documents.

Highlights

  • Many of the previous research work on data mining have focused on structured data

  • We propose four ensemble models namely Ensemble Model-1, Ensemble Model-2, Ensemble Model-3 and Ensemble Model-4, where Ensemble Model-1 is a combination of Multilayer Perceptron (MLP), Naïve Bayes (NB) and Random Forest (RF), Ensemble Model-2 is a combination of MLP, NB and Support Vector Machine (SVM), Ensemble Model-3 is a combination of SVM, NB and RF and Ensemble Model-4 is a combination of MLP, NB, RF and SVM

  • We propose an ensemble model and check the robustness and efficiency of the model

Read more

Summary

Introduction

Many of the previous research work on data mining have focused on structured data. text databases store a valuable section of available information. The text database is a collection of huge amount of documents collected from various sources like news stories, books, digital library, research papers, e-mails, web pages and various social media sites. These days, a vast majority of data in government, industry, business, and different organizations are put away electronically, as text databases [23]. The main reason why spam e-mails are continuously increasing in mailbox is lack of awareness among the Internet users Due to this problem, the spam e-mail text (documents) classification is of significance in research work

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call