Improving Naive Bayes by Reducing the Importance of Low-Frequency Words Based on Entropy of Words for Spam Email Classification

Phaiboon Trikanjananun,Arjin Numsomran,Vittaya Tipsuwannaporn

doi:10.23919/iccas55662.2022.10003787

Abstract

The Naive Bayes algorithm (NB algorithm) is a popular one for spam email classification due to fast training, using simple techniques and high accuracy. One of many research improving NB algorithms are the AWF-NB algorithm. In this paper, we call the research an AWF-algorithm for convenient mention. The AWF-NB algorithm focuses on solving the equally important word in each class because it is not always the case. Another problem of the NB algorithm to solve this problem, the AWF-NB extremely reduces the importance of words in the class that has lower importance. However, this action will lead to reducing the accuracy in cases that slightly differ among the importance of words in each class. Therefore, the goal of the research is to improve the AWF-NB algorithm by reducing the importance of words based on entropy of words. We compute the entropy of a word to decide if it should be reduced in importance. The experimental results on ten spam email datasets from Kaggle website indicated that the RIWE-NB algorithm can remarkably increase the classification accuracy of the NB algorithm and the AWF-NB algorithm in majority datasets while the execution time is still conserved.

Full Text