Malicious Text Identification: Deep Learning from Public Comments and Emails

Asma Baccouche,Daniel Sierra-Sosa,Adel Elmaghraby,Sadaf Ahmed

doi:10.3390/info11060312

Asma Baccouche, Daniel Sierra-Sosa + Show 2 more

Open Access

https://doi.org/10.3390/info11060312

Copy DOI

Journal: Information	Publication Date: Jun 10, 2020
Citations: 28	License type: CC BY 4.0

Affiliation: University of Louisville

Abstract

Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.

Highlights

Spam is a trending internet dysfunction that has been affecting social networks and websites [1,2].Replying with out-of-context comments on social media is, in general, a sign of an attempt to induce users to open malicious links or disturb the reader with marketing
We have focused on identifying YouTube spam comments and Nigerian fraudulent emails, by designing a binary text classification model based on Long Short-Term Memory (LSTM) architecture with pre-trained word embeddings Word2vec model
We present a two-part system that is based on LSTM neural network models for text classification

Summary

Introduction

Spam is a trending internet dysfunction that has been affecting social networks and websites [1,2].Replying with out-of-context comments on social media is, in general, a sign of an attempt to induce users to open malicious links or disturb the reader with marketing. Information phishing was initially used for marketing, but it degenerated into harmful internet interactions that lead users into serious security threats using means such as emails, comments, blogs, and messages [3]. Detecting spam has several purposes including security and creating better user experiences on the communication platforms [4]. Phishing is common in spam and fraud communications These communications include emails, social media, and video streaming services, among others. Filtering these malicious messages could be as simple as a binary text classification aiming to determine whether a text is harmful or legitimate. We highlight advances in information security, text classification, and neural networks and their applications in malicious text filtering and multi-domain learning. Several mechanisms have been introduced to analyze the relevance of the information before being considered for important decisions in varying domains [21]

Methods

Results

Conclusion