Quick online spam classification method based on active and incremental learning

Lizhou Feng,Wanli Zuo,Youwei Wang

doi:10.3233/ifs-151707

Abstract

In order to improve the classification speed without sacrificing the email classification accuracy seriously, a novel online spam classification method is proposed. Firstly, the conceptions of term frequency based interest sets are introduced, and emails are classified by combining term frequency based interest sets and NaBayes classifier. Secondly, based on the active learning theory, a novel boundary density based email classification certainty evaluating method is proposed to select and recommend emails to users for labeling by combining the user interests. Finally, the emails which are labeled and classified with the greatest possibilities are used for retraining based on the incremental learning theory. In the experiments, Support Vector Machine (SVM), NaBayesian (NB) and K-Nearest Neighbors (KNN) classifiers are used on two corpuses: Trec2007 and Enron-spam. Comparing with six typical active learning based incremental learning methods, the proposed method greatly reduces the consuming time of email classification while guaranteeing the accuracy. Moreover, the proposed method brings very small sample labeling burden to the users, proving its high value on online application.

Full Text