With the advancement of technology and the widespread use of mobile phones and wireless communication, SMS has become the most popular texting method due to its high response rate, affordability, and no internet connection requirement. A survey found that 3.5 billion users, or 80% of active users worldwide, use SMS for communication. SMS, however, has also attracted spammers, resulting in an explosion in spam messages, especially in Asia. Users are annoyed, lose money, and waste their time by receiving spam messages intended to serve various purposes, such as advertising, adult content, smishing, and fraud. Spam messages are a problem for users and providers, which calls for a mechanism to identify and filter them out. With supervised machine learning techniques, we propose a novel approach to classify spam and ham messages based on complex network theory. The proposed approach integrates complex network based features with statistical TF-IDF and grammatical rules features. Also, an under-sampling method has been employed in order to cope with the imbalanced data issue. We evaluated the performance of several supervised learners in terms of accuracy, precision, recall, F1-score, and AUC. In our experiments, Random Forest successfully classified spam messages more accurate than statistical methods that only extracted TF-IDF features.
Read full abstract