Cost-Based Heterogeneous Learning Framework for Real-Time Spam Detection in Social Networks With Expert Decisions

Jaeun Choi,Chunmi Jeon

doi:10.1109/access.2021.3098799

Abstract

With the widespread use of social networks, spam messages against them have become a major issue. Spam detection methods can be broadly divided into expert-based and machine learning-based detection methods. When experts participate in spam detection, the detection accuracy is fairly high. However, this method is highly time-consuming and expensive. Conversely, methods using machine learning have the advantage of automation, but their accuracy is relatively low. This paper proposes a spam-detection framework that combines and fully exploits the advantages of both methods. To reduce the workload of the experts, all messages are first analyzed via a primary machine learning filter, and those that are determined to be normal messages are allowed through, whereas suspicious messages are flagged. The flagged messages are subsequently analyzed by an expert to enhance the overall system accuracy. In the filtering process, cost-based machine learning is used to prevent the fatal error of misidentifying a spam message as a normal message. In addition, to obviate the continuously evolving spam trends, a module that periodically updates the expert-diagnosis results on the training dataset is incorporated into the framework. The results of experiments conducted, on an imbalanced dataset of spam tweets and normal tweets in a ratio similar to the actual situation in real life, indicate that the proposed framework has a spam-detection rate of almost 92.8%, which is higher than that of the conventional machine learning technique. Furthermore, the proposed framework delivered stable high performance even in an environment where social network messages changed continuously, unlike the conventional technique, which exhibited large performance deviations.

Highlights

The number of Internet users globally is estimated to be approximately 4.9 billion, which is approximately 63% of the global population of 7.7 billion [1]
A real dataset was used to verify the framework in a real environment, and an imbalanced dataset of spam tweets and normal tweets in a ratio similar to the real situation was used
This paper proposed a sophisticated framework in which experts and machine learning algorithms collaborate to detect spam tweets effectively

Summary

Introduction

The number of Internet users globally is estimated to be approximately 4.9 billion, which is approximately 63% of the global population of 7.7 billion [1]. Social networks that allow users to communicate anytime and anywhere are becoming a part of everyday life for many people around the world, among the 3.5 billion smartphone users worldwide [2]. According to Visual Capitalist, a market research firm in the US, the number of monthly active users of Facebook, the social network that had the largest number of users in 2020, is as many as 2.6 billion. Instagram and Twitter have monthly active users of approximately 1 billion and 0.3 billion, respectively [3]. It has been found that one out of every 21 tweets on Twitter can be categorized as spam, and autobots account for approximately 15% of Twitter users [7].

Methods

Results

Conclusion