Abstract
Offensive language identification is to detect the hurtful tweets, derogatory comments, swear words on social media. As an emerging growth of social media communication, offensive language detection has received more attention in the last years; we focus to perform the task on English, Danish and Greek. We have investigated which can be effect more on pre-trained models BERT (Bidirectional Encoder Representation from Transformer) and Machine Learning Approaches. Our investigation shows the difference performance between the three languages and to identify the best performance is evaluated by the classification algorithms. In the shared task SemEval-2020, our team SSN_NLP_MLRG submitted for three languages that are Subtasks A, B, C in English, Subtask A in Danish and Subtask A in Greek. Our team SSN_NLP_MLRG obtained the F1 Scores as 0.90, 0.61, 0.52 for the Subtasks A, B, C in English, 0.56 for the Subtask A in Danish and 0.67 for the Subtask A in Greek respectively.
Highlights
Offensive language detection is the process of identifying and detecting the user generated offensive content or comments that are insult, hurt, profanity, and racism and targeted the individual or group in the massive social media (Marcos Zampieri et al, 2019)
We have investigated the results of the three languages and compared the results of the machine learning approaches with the performance of the pre-trained Bert models
Offensive Language: (Hamdy Mubarak et al, 2020) reported the lexical features, static and deep contextualized embedding for the Support Vector Machine classifiers to detect Arabic offensive language and determine the topics, dialects, genders which are associated with the offensive tweets. (Ltekin, 2020) presents the corpus of offensive language in the turkish language for the shared task semeval 2020. (Zampieri et al, 2019; Thenmozhi D et al, 2019) reported the shared task SemEval 2019 task 6 for the offensive language identification and catergorization in social media. (Hui-Po Su et al, 2017)presented rephrasing rules to revise the input sentence by using not offensive words on realword social websites. (Ritesh Kumar et al,2018) reported the Workshop on Trolling, Aggression and Cyberbullying (TRAC) and presents Aggression Identification in Social Media
Summary
Offensive language detection is the process of identifying and detecting the user generated offensive content or comments that are insult, hurt, profanity, and racism and targeted the individual or group in the massive social media (Marcos Zampieri et al, 2019). OffensEval@SemEval2019 (Zampieri et al, 2019b) shared task focuses on identification and categorization of offensive content in social media. OffensEval@SemEval2020 (Zampieri et al, 2020) shared task focuses on multilingual identification and categorization of offensive content for five different languages in social media. Recent research area has reported to identifying the offensive content (Zeses Pitenis et al, 2020; Gudbjartur Ingi Sigurbergsson and Leon Derczynski, 2019) and abusive comment (Kenneth Steimel et al, 2019) in different languages. (Gudbjartur Ingi Sigurbergsson and Leon Derczynski, 2019) used Logistic regression and more different attention in BiLSTM models to detect and categorize the offensive language in the English and Danish annotated Facebook, reddit dataset. Our team SSN_NLP_MLRG participated in the shared task OffensEval@SemEval2020 task 12 for the three languages that are English, Greek and Danish languages It focuses on the multilingual offensive language detection, categorization of offensive language and target identification.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have