Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

Ijazul Haq,Peng Tang,Jie Guo,Weidong Qiu

doi:10.7717/peerj-cs.1617

Ijazul Haq, Peng Tang + Show 2 more

Open Access

https://doi.org/10.7717/peerj-cs.1617

Copy DOI

Journal: PeerJ Computer Science	Publication Date: Oct 18, 2023
Citations: 3	License type: CC BY 4.0

Affiliation: Shanghai Jiao Tong University

Abstract

Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. This study aims to develop an AI model for the automatic detection of offensive textual content in Pashto. To achieve this goal, we have developed a benchmark dataset called the Pashto Offensive Language Dataset (POLD), which comprises tweets collected from Twitter and manually classified into two categories: “offensive” and “not offensive”. To discriminate these two categories, we investigated the classic deep learning classifiers based on neural networks, including CNNs and RNNs, using static word embeddings: Word2Vec, fastText, and GloVe as features. Furthermore, we examined two transfer learning approaches. In the first approach, we fine-tuned the pre-trained multilingual language model, XLM-R, using the POLD dataset, whereas, in the second approach, we trained a monolingual BERT model for Pashto from scratch using a custom-developed text corpus. Pashto BERT was then fine-tuned similarly to XLM-R. The performance of all the deep learning and transformer learning models was evaluated using the POLD dataset. The experimental results demonstrate that our pre-trained Pashto BERT model outperforms the other models, achieving an F1-score of 94.34% and an accuracy of 94.77%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

Abstract

Talk to us

Similar Papers

More From: PeerJ Computer Science

Lead the way for us

Similar Papers

Natural language based analysis of SQuAD: An analytical approach for BERT
Zekeriya Anil Guven ... Murat Osman Unalir
Expert Systems with Applications | VOL. 195
Zekeriya Anil Guven, et. al.Zekeriya Anil Guven ... Murat Osman Unalir
31 Jan 2022
Expert Systems with Applications | VOL. 195

Pretrained domain-specific language model for natural language processing tasks in the AEC domain
Zhe Zheng ... Jia-Rui Lin
Computers in Industry | VOL. 142
Zhe Zheng, et. al.Zhe Zheng ... Jia-Rui Lin
21 Jun 2022
Computers in Industry | VOL. 142

Fine-tuning Pre-trained Language Models to Detect In-game Trash Talks
Daniel Fesalbon ... Arvin De La Cruz
International Journal For Multidisciplinary Research | VOL. 6
Daniel Fesalbon , et. al.Daniel Fesalbon ... Arvin De La Cruz
13 Mar 2024
International Journal For Multidisciplinary Research | VOL. 6

GreenPLM: Cross-Lingual Transfer of Monolingual Pre-Trained Language Models at Almost No Cost
Qingcheng Zeng ... Jie Yang
-
Qingcheng Zeng, et. al.Qingcheng Zeng ... Jie Yang
01 Aug 2023
01 Aug 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

Abstract

Talk to us

Similar Papers

More From: PeerJ Computer Science