Enhancing Big Social Media Data Quality for Use in Short-Text Topic Modeling

Belal Abdullah Hezam Murshed,Mufeed Ahmed Naji Saif,Sumaia Mohammed Al-Ghuribi,Suresha Mallappa,Jemal Abawajy,Fahd A Ghanem

doi:10.1109/access.2022.3211396

Belal Abdullah Hezam Murshed, Mufeed Ahmed Naji Saif + Show 4 more

Open Access

https://doi.org/10.1109/access.2022.3211396

Copy DOI

Abstract

With the emergence of microblogging platforms and social media applications, large amounts of user-generated data in the form of comments, reviews, and brief text messages are produced every day. Microblog data is typically of poor quality; hence improving the quality of the data is a significant scientific and practical challenge. In spite of the relevance of the problem, there has been not much work so far, especially in regard to microblog data quality for Short-Text Topic Modelling (STTM) purposes. This paper addresses this problem and proposes an approach called the social media data cleansing model (SMDCM) to improve data quality for STTM. We evaluate SMDCM using six topic modelling methods, namely the Latent Dirichlet Allocation (LDA), Word-Network Topic Model (WNTM), Pseudo-document-based Topic Modelling (PTM), Biterm Topic Model (BTM), Global and Local word embedding-based Topic Modeling (GLTM), and Fuzzy Topic modelling (FTM). We used the Real-world Cyberbullying Twitter (RW-CB-Twitter) and the Cyberbullying Mendeley (CB-MNDLY) datasets in the evaluation. The results proved the efficiency of the GLTM and WNTM over the other STTM models when applying the SMDCM techniques, which achieved optimum topic coherence and high accuracy values on RW-CB-Twitter and CB-MNDLY datasets.

Full Text