Abstract

This paper investigates the impact of several feature extraction and feature selection approaches on filtering of short message service (SMS) spam messages in two different languages, namely Turkish and English. The entire feature set of filtering framework consists of the features originated from the bag-of-words (BoW) model along with the ensemble of structural features (SF) specific to spam problem. The distinctive BoW features are identified using information theoretic feature selection methods. Various combinations of the BoW and SF are then fed into widely used pattern classification algorithms to classify SMS messages. The filtering framework is evaluated on both Turkish and English SMS message datasets. For this purpose, as part of the study, the first publicly available Turkish SMS message collection is constituted as well. Comprehensive experimental analysis on the respective datasets revealed that the combinations of BoW and SFs, rather than BoW features alone, provide better classification performance on both datasets. Effectiveness of the utilized feature selection methods however slightly differs in each language. DOI: http://dx.doi.org/10.5755/j01.eee.19.5.1829

Highlights

  • In recent years, Short Message Service (SMS) has become one of the most common communication methods due to rapid increase in the number of mobile phone users worldwide

  • Selection of BoW features were carried out using CHI2 and Gini index (GI) methods, where the number of selected features ranged from 1% to 100% of the entire BoW features

  • In case of Turkish messages, the highest Micro-F1 score was approximately 0.98. This score was obtained using SF2, and 50% of BoW features selected by CHI2, which were together applied on support vector machine (SVM) classifier

Read more

Summary

INTRODUCTION

Short Message Service (SMS) has become one of the most common communication methods due to rapid increase in the number of mobile phone users worldwide. A framework utilizing the content based filtering and challenge-response was introduced in [6] Another SMS anti-spam system combining behavior-based social network and temporal analysis was presented in [7]. In regard to the abovementioned studies, this paper extensively analyses the effects of several feature extraction and feature selection methods together on filtering SMS spam messages in two different languages, namely Turkish and English. The selected features are combined with the structural features, and fed into two distinct pattern classification algorithms, namely k-nearest neighbor and support vector machine, to classify SMS messages as either spam or legitimate.

DATASETS
FEATURE EXTRACTION
FEATURE SELECTION
CLASSIFICATION
EXPERIMENTAL WORK
Part B
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call