USAD: An Intelligent System for Slang and Abusive Text Detection in PERSO-Arabic-Scripted Urdu

Nauman Ul Haq,Ahmad Almogren,Bushra Shafi,Bashir Hayat,Arshad Ahmad,Rafiullah Khan,Mohib Ullah

doi:10.1155/2020/6684995

Abstract

The use of slang, abusive, and offensive language has become common practice on social media. Even though social media companies have censorship polices for slang, abusive, vulgar, and offensive language, due to limited resources and research in the automatic detection of abusive language mechanisms other than English, this condemnable act is still practiced. This study proposes USAD (Urdu Slang and Abusive words Detection), a lexicon-based intelligent framework to detect abusive and slang words in Perso-Arabic-scripted Urdu Tweets. Furthermore, due to the nonavailability of the standard dataset, we also design and annotate a dataset of abusive, offensive, and slang word Perso-Arabic-scripted Urdu as our second significant contribution for future research. The results show that our proposed USAD model can identify 72.6% correctly as abusive or nonabusive Tweet. Additionally, we have also identified some key factors that can help the researchers improve their abusive language detection models.

Highlights

The dataset is supplied to the testing module for the automatic classification of the data using an abusive lexicon. is section discusses the results of the proposed Urdu Slang and Abusive Word Detection (USAD) model’s effectiveness in the automatic detection of abusive Urdu Tweet
We proposed the USAD model for automatic detection of abusive Tweets posted in Perso-Arabic-scripted Urdu
We used a lexicon of abusive Urdu words composed of 1250 words and a testing dataset consisting of 1200 manually annotated Tweets (365 abusive and 835 nonabusive). e results show that the proposed USAD model can identify 72.6% of Tweets as abusive or nonabusive correctly with the precision of 55.21%

Summary

Urdu and Perso-Arabic Script

Urdu is one of the South Asian region’s popular languages and Pakistan’s national and official language [15]. Hindi, a mutually intelligible language of the Urdu language, is written in Devanagari script [19]. Due to complex morphological and grammatical structures, diacritics [21], and limited linguistics resources, the Urdu language is mostly neglected by the research community. In this regard, the first ever 8 bit encoding standard for Urdu, “Urdu Zabta Takhti (UZT) 1.01,” was developed and accepted by the Government of Pakistan in 2000 [22]. Only a single study is available for offensive language detection in Roman Urdu [5]. Erefore, the detection of abusive and offensive language in Perso-Arabic-scripted Urdu is still an open issue

Related Work

Experimentation Preliminary

Conclusions

Conflicts of Interest

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Complexity	Publication Date: Nov 30, 2020
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

USAD: An Intelligent System for Slang and Abusive Text Detection in PERSO-Arabic-Scripted Urdu

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complexity

Lead the way for us

Similar Papers

Bad Language and Scholarly Publishing: Use It or Lose It?
Jacqueline Owens
Nurse Author & Editor | VOL. 29
Jacqueline OwensJacqueline Owens
01 Sep 2019
Nurse Author & Editor | VOL. 29

The use of slang words in online learning context of EFL class
Rizky Lutviana ... Siti Mafulah
EnJourMe (English Journal of Merdeka) : Culture, Language, and Teaching of English | VOL. 6
Rizky Lutviana, et. al.Rizky Lutviana ... Siti Mafulah
31 Jul 2021
EnJourMe (English Journal of Merdeka) : Culture, Language, and Teaching of English | VOL. 6

A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media
Muhammad Okky Ibrohim ... Indra Budi
Procedia Computer Science | VOL. 135
Muhammad Okky Ibrohim, et. al.Muhammad Okky Ibrohim ... Indra Budi
01 Jan 2018
Procedia Computer Science | VOL. 135

Slang Language in Indonesian Social Media
I Gede Budiasa
Lingual: Journal of Language and Culture | VOL. 11
I Gede BudiasaI Gede Budiasa
26 May 2021
Lingual: Journal of Language and Culture | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

USAD: An Intelligent System for Slang and Abusive Text Detection in PERSO-Arabic-Scripted Urdu

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complexity