Scarcity-aware spam detection technique for big data ecosystem

Chinmay Chakraborty,Woo Hyun Park,Dong Ryeol Shin,Nawab Muhammad Faseeh Qureshi,Isma Farah Siddiqui

doi:10.1016/j.patrec.2022.03.021

Abstract

To expand their business, companies in the industry use the big data ecosystem for handling enormous amounts of information. For this purpose, text data must be analyzed while ensuring data security and organizing authenticated and valuable data using spam filters. Several methods are available such as Word2Vec, bag-of-words, BERT, and term frequency-inverse document frequency (TF-IDF). However, none of these resolve the data scarcity issue that may result in the presence of incomplete information in collected documents. A technique that groups each document by subject and applies approximation using statistical methods is required to effectively solve this problem. This study proposes a natural language processing-based technique for spam detection that alters topics using a least-squares model and uses gradient-descent and altering-least-squares (AMALS) models to estimate missing data through TF-IDF and uniform-distribution. A performance evaluation demonstrates that the proposed technique outperforms 98% than the existing industrial TF-IDF model in predicting spam in big data ecosystems.

Full Text