Abstract

To expand their business, companies in the industry use the big data ecosystem for handling enormous amounts of information. For this purpose, text data must be analyzed while ensuring data security and organizing authenticated and valuable data using spam filters. Several methods are available such as Word2Vec, bag-of-words, BERT, and term frequency-inverse document frequency (TF-IDF). However, none of these resolve the data scarcity issue that may result in the presence of incomplete information in collected documents. A technique that groups each document by subject and applies approximation using statistical methods is required to effectively solve this problem. This study proposes a natural language processing-based technique for spam detection that alters topics using a least-squares model and uses gradient-descent and altering-least-squares (AMALS) models to estimate missing data through TF-IDF and uniform-distribution. A performance evaluation demonstrates that the proposed technique outperforms 98% than the existing industrial TF-IDF model in predicting spam in big data ecosystems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call