Development of content‐based SMS classification application by using Word2Vec‐based feature extraction

Serkan Ballı,Onur Karasoy

doi:10.1049/iet-sen.2018.5046

Abstract

While mobile instant messaging applications such as WhatsApp, Messenger, Viber offer benefits to phone users such as price, easy usage, stable, collective and direct communication, SMS (short message service) is still considered a more reliable privacy-preserving technology for mobile communication. This situation directs the institutions that want to perform the product promotion such as advertising, informing, promotion etc. to use SMS. However, spam messages sent from unknown sources constitute a serious problem for SMS recipients. In this study, a content-based classification model which uses the machine learning to filter out unwanted messages is proposed. From the selected dataset, the model to be used in the classification is created with the help of Word2Vec word embedding tool. Thanks to this model, two new features are revealed for calculating the distances of messages to spam and ham words. The performances of the classification algorithms are compared by taking these two new features into consideration. The random forest method succeeded with a correct accuracy rate of 99.64%. In comparison to other studies using the same dataset, more successful correct classification percentage is achieved.

Full Text