Comparative Study of Feature Reduction and Machine Learning Methods for Spam Detection

Basant Agarwal,Namita Mittal

doi:10.1007/978-81-322-1602-5_81

Abstract

Nowadays, e-mail is widely used for communication over Internet. A large amount of Internet traffic is of e-mail data. A lot of companies and organizations use e-mail services to promote their products and services. It is very important to filter out spam messages to save users’ precious time. Machine learning methods plays vital role in spam detection, but it faces the problem of high dimensionality of feature vector. So feature reduction methods are very important for better results from machine learning approaches. In this paper, Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Information Gain (IG) methods are used for feature reduction. Further, e-mail messages are classified as spam or ham message using seven different classifiers namely Naive Baysian, AdaBoost, Random Forest, Support Vector Machine, J48, Bagging, and JRip. Comparative study of these techniques is done on TREC 2007 Spam e-mail Corpus with different feature size.

Full Text