Deep learning‐based smishing message identification using regular expression feature generation

Siddhartha Shankar Paul,Vrihas Pathak,Aakanksha Sharaff

doi:10.1111/exsy.13153

Abstract

AbstractThe increase in the number of undesired SMS termed smishing message and the data imbalance problem has generated a great demand for the development of more reliable anti‐spam filters. State of the art machine learning approaches are being employed to recognize and separate spam messages. Most recent studies target message classification by using numerous properties and features of the words but fail to consider the circumstantial features like long‐range dependencies between the words that are extremely important in identifying smishing messages. The idea is to develop an intelligent model that will distinguish between smishing messages and ham messages, by adopting a combined approach of regular expression (Regex), machine learning (ML) and deep learning (DL) models. Regex rules are generated using the dataset's spam messages for the purpose of refining the dataset. Support vector machine (SVM), Multinomial Naive Bayes and Random Forest are included under machine learning models and long short‐term memory (LSTM), bidirectional long short‐term memory (Bi‐LSTM), stacked LSTM and stacked Bi‐LSTM are included under deep learning models. The comparison between machine learning models and deep learning models is also carried out based on the performance evaluation parameters namely accuracy, precision, recall and F1 score of the models. It is observed that deep learning models perform better than machine learning models and the introduction of a regular expression to the dataset increases the efficiency of both the deep learning models and machine learning models.

Full Text