Optimization for overfitting problems in spam email classification based on parameter adjusting

Yusi Wei,Douhao Ma,Juncheng Dong,Mahmoud Alshawabkeh

doi:10.1117/12.2626447

Abstract

Spam email classification can effectively help us to reject useless information. Different machine learning methods have been proposed to deal with spam email classification problems. However, exploring the key parameters adjusting in the machine learning methods for spam email classification is still insufficient. Therefore, in this paper, we adjusted the key parameters of three classical machine learning methods, including SVM, Random Forest and Logistic Regression in the specific spam email classification tasks. Many tests have been done to evaluate the adjusting operation for different classifiers. The results show that bigger C value of Logistic Regression is, the higher accuracy would be. However, if C value is too big and the model would be overfitting. The amount and the depth of trees may influence the accuracy of Random Forest, where if the amount is bigger and the accuracy would be higher under a limitation. In addition, if the depth of trees is too big, although the accuracy of the model is high, and the model would be overfitting. The Linear kernel function in Logistic Regression has the best performance in the spam email classification task. Our research has a great significance to show how to adjust the parameters for classifiers in the specific classification tasks.

Full Text