Automatic Cyberbullying detection has remained very challenging task since social media content and conversations are usually posted in unstructured free-text form leaving behind the language norms. The major concern and gap in formulating cyberbullying detection strategies is scarcity of available linguistic resources typically for newly evolved languages. Roman Urdu has recently emerged and hence is a resource poor language. Urdu has been widely known as the national language of Pakistan. However, because of socio-cultural and multilingual aspects, Roman Urdu is used widely on the Internet by Asians and more specifically Pakistanis. To fulfil the above stated gap, this research work presents guidelines for data annotation process and developed two linguistic resources: (i) Annotated corpus in Roman Urdu Language for cyberaggression and offensive language detection. The process of data annotation involved bilingual annotators instead of crowdsourcing. It has the benefit of correctly annotating instances that constitute clear cases of cyberbullying without compromising data quality. The developed corpus is highly balanced (with almost negligible skew) unlike most of the existing corpuses even in mature languages. (ii) Processing textual information for NLP tasks involves Stop-word elimination as a sub phase. Stop words carry least semantic information and increase feature space as compared to the other tokens and index terms in corpora. We have developed domain specific stop words for Roman Urdu Language considering all the lexical variants and typically in the context of aggression detection and collected data. The work has been carried out using python programming language and Pycharm IDE.
Read full abstract