A two-stage framework for Arabic social media text misinformation detection combining data augmentation and AraBERT

Ebtsam A Mohamed,Walaa N Ismail,Osman Ali Sadek Ibrahim,Eman M G Younis

doi:10.1007/s13278-024-01201-4

Abstract

Misinformation can profoundly impact the reputation of an entity, and eliminating its spread has become a critical concern across various applications. Social media, often a primary source of information, can significantly influence individuals’ perspectives through content from less credible sources. The utilization of machine-learning (ML) algorithms can facilitate automated, large-scale analysis of textual content, contributing to the rapid and efficient processing of extensive datasets for informed decision-making. Since the performance of ML models is highly affected by the size of the training data, many research papers have presented different approaches to solve the problem of limited dataset size. The data augmentation (DA) approach is one of these strategies, aiming to enhance ML model performance by increasing the amount of training data. DA generates new instances by applying different transformations to the original data instances. While many DA techniques have been investigated for various languages, such as English, achieving an enhancement of the classification model’s performance on the new augmented dataset compared to the original dataset, there is a lack of studies on the Arabic language due to its unique characteristics. This paper introduces a novel two-stage framework designed for the automated identification of misinformation in Arabic textual content. The first stage aims to identify the optimal representation of features before feeding them to the ML model. Diverse representations of tweet content are explored, including N-grams, content-based features, and source-based features. The second stage focuses on investigating the DA effect through the back-translation technique applied to the original training data. Back-translation entails translating sentences from the target language (in this case, Arabic) into another language and then back to Arabic. As a result of this procedure, new examples for training are created by introducing variances in the text. The study utilizes support vector machine (SVM), naive Bayes, logistic regression (LR), and random forest (RF) as baseline algorithms. Additionally, AraBERT transformer pre-trained language models are used to relate the instance’s label and feature representation of the input. Experimental outcomes demonstrate that misinformation detection, coupled with data augmentation, enhances accuracy by a noteworthy margin 5 to 12% compared to baseline machine-learning algorithms and pre-trained models. Remarkably, the results show the superiority of the N-grams approach over traditional state-of-the-art feature representations concerning accuracy, recall, precision, and F-measure metrics. This suggests a promising avenue for improving the efficacy of misinformation detection mechanisms in the realm of Arabic text analysis.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A two-stage framework for Arabic social media text misinformation detection combining data augmentation and AraBERT

Abstract

Talk to us

Similar Papers

More From: Social Network Analysis and Mining

Lead the way for us

Journal: Social Network Analysis and Mining	Publication Date: Mar 8, 2024
License type: CC BY 4.0

Similar Papers

Machine Learning Models for Blood Glucose Level Prediction in Patients With Diabetes Mellitus: Systematic Review and Network Meta-Analysis.
Kui Liu ... Changsheng Chen
JMIR Medical Informatics | VOL. 11
Kui Liu, et. al.Kui Liu ... Changsheng Chen
20 Nov 2023
JMIR Medical Informatics | VOL. 11

Does Artificial Intelligence Outperform Natural Intelligence in Interpreting Musculoskeletal Radiological Studies? A Systematic Review.
Olivier Q Groot ... Michiel E R Bongers
Clinical Orthopaedics & Related Research | VOL. 478
Olivier Q Groot, et. al.Olivier Q Groot ... Michiel E R Bongers
30 Jul 2020
Clinical Orthopaedics & Related Research | VOL. 478

Predicting the stages of liver fibrosis with multiphase CT radiomics based on volumetric features.
Enming Cui ... Wansheng Long
Abdominal Radiology | VOL. 46
Enming Cui, et. al.Enming Cui ... Wansheng Long
22 Mar 2021
Abdominal Radiology | VOL. 46

Comparison of Severity of Illness Scores and Artificial Intelligence Models That Are Predictive of Intensive Care Unit Mortality: Meta-analysis and Review of the Literature.
Cristina Barboi ... Andreas Tzavelis
JMIR Medical Informatics | VOL. 10
Cristina Barboi, et. al.Cristina Barboi ... Andreas Tzavelis
31 May 2022
JMIR Medical Informatics | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A two-stage framework for Arabic social media text misinformation detection combining data augmentation and AraBERT

Abstract

Talk to us

Similar Papers

More From: Social Network Analysis and Mining