Abstract

Nowadays, e-mail is widely used for communication over Internet. A large amount of Internet traffic is of e-mail data. A lot of companies and organizations use e-mail services to promote their products and services. It is very important to filter out spam messages to save users’ precious time. Machine learning methods plays vital role in spam detection, but it faces the problem of high dimensionality of feature vector. So feature reduction methods are very important for better results from machine learning approaches. In this paper, Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Information Gain (IG) methods are used for feature reduction. Further, e-mail messages are classified as spam or ham message using seven different classifiers namely Naive Baysian, AdaBoost, Random Forest, Support Vector Machine, J48, Bagging, and JRip. Comparative study of these techniques is done on TREC 2007 Spam e-mail Corpus with different feature size.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call