Abstract
Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character $n$ -gram counts or word $n$ -gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word $n$ -gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.
Highlights
T HE EMERGENCE of the Internet and communication technology has enabled online social networks to become a significant part of our daily lives, as the number of social media users is growing exponentially
Some users manipulate the Twitter platform to threaten other people and to promote violence by posting threatening content. This has led to a growing body of research investigating the spread of threatening content in social media, among others by examining threatening language and by attempting to detect this type of content [8,9,10]. Given the distress this can cause in online users, furthering research in automatic threatening language identification is of utmost importance to tackle this problem at the scale of a large social media platform like Twitter
Precision, Recall, and F1 scores are presented for all models: Logistic Regression (LR), Multilayer Perceptron (MLP), AdaBoost, Random Forest (RF), Support Vector Machine
Summary
T HE EMERGENCE of the Internet and communication technology has enabled online social networks to become a significant part of our daily lives, as the number of social media users is growing exponentially. Some users manipulate the Twitter platform to threaten other people and to promote violence by posting threatening content (i.e., content expressing an intent to cause harm to others) This has led to a growing body of research investigating the spread of threatening content in social media, among others by examining threatening language and by attempting to detect this type of content [8,9,10]. Given the distress this can cause in online users, furthering research in automatic threatening language identification is of utmost importance to tackle this problem at the scale of a large social media platform like Twitter.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.