Abstract

The use of slang, abusive, and offensive language has become common practice on social media. Even though social media companies have censorship polices for slang, abusive, vulgar, and offensive language, due to limited resources and research in the automatic detection of abusive language mechanisms other than English, this condemnable act is still practiced. This study proposes USAD (Urdu Slang and Abusive words Detection), a lexicon-based intelligent framework to detect abusive and slang words in Perso-Arabic-scripted Urdu Tweets. Furthermore, due to the nonavailability of the standard dataset, we also design and annotate a dataset of abusive, offensive, and slang word Perso-Arabic-scripted Urdu as our second significant contribution for future research. The results show that our proposed USAD model can identify 72.6% correctly as abusive or nonabusive Tweet. Additionally, we have also identified some key factors that can help the researchers improve their abusive language detection models.

Highlights

  • The dataset is supplied to the testing module for the automatic classification of the data using an abusive lexicon. is section discusses the results of the proposed Urdu Slang and Abusive Word Detection (USAD) model’s effectiveness in the automatic detection of abusive Urdu Tweet

  • We proposed the USAD model for automatic detection of abusive Tweets posted in Perso-Arabic-scripted Urdu

  • We used a lexicon of abusive Urdu words composed of 1250 words and a testing dataset consisting of 1200 manually annotated Tweets (365 abusive and 835 nonabusive). e results show that the proposed USAD model can identify 72.6% of Tweets as abusive or nonabusive correctly with the precision of 55.21%

Read more

Summary

Urdu and Perso-Arabic Script

Urdu is one of the South Asian region’s popular languages and Pakistan’s national and official language [15]. Hindi, a mutually intelligible language of the Urdu language, is written in Devanagari script [19]. Due to complex morphological and grammatical structures, diacritics [21], and limited linguistics resources, the Urdu language is mostly neglected by the research community. In this regard, the first ever 8 bit encoding standard for Urdu, “Urdu Zabta Takhti (UZT) 1.01,” was developed and accepted by the Government of Pakistan in 2000 [22]. Only a single study is available for offensive language detection in Roman Urdu [5]. Erefore, the detection of abusive and offensive language in Perso-Arabic-scripted Urdu is still an open issue

Related Work
Experimentation Preliminary
Conclusions
Conflicts of Interest
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call