Abstract

Short message service (SMS) is a most favored communication service people use in daily life. However, this service is being misused by spammers. Rule based systems (RBS) and content based filtering (CBF) techniques have been developed to filter out spam messages. New rules can be easily added into RBS, but the throughput usually reduces as the rules increase. The bag-of-words (BoW) assumption based CBF techniques ignore the word order, which use machine learning methods to extract features from SMS message body according to word frequency and distribution. Striving to improve performance, researchers developed hybrid models that made algorithms ever-more complex. In addition, frequently conducting the time consuming models training and deployment forces the anti-spam industry still rely mainly on rule-based systems with unsolved throughput issue. A discrete Hidden Markov Model (HMM) was proposed in our previous study to address these issues, and the HMM method achieved a comparable performance to the deep learning methods. To further improve the performance of HMM method, we propose a new approach to weight and label words in SMS for formatting the observation sequence in HMM method. The weighted feature enhanced HMM achieves higher accuracy, and much faster training and filtering speed for meeting the anti-spam industry requirement. The performance comparison with other machine learning methods is conducted on the same open respiratory data set maintained by University of California, Irvine (UCI). Experimental results show that the weighted features enhanced HMM outperforms the LSTM (long short-term memory model) and close to CNN (convolutional neural network) in terms of classification accuracy. In addition, a Chinese SMS data set is used to further validate filtering accuracy and filtering speed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call