Abstract

This study addresses the urgent issue of toxic language, prevalent on social media platforms, focusing on the detection of toxic comments on popular Italian Facebook pages. We build upon the framework suggested by the LiLaH project: a standardized framework for analyzing hateful content in multiple languages, including Dutch, English, French, Slovene, and Croatian. We start by examining the linguistic features of Italian toxic language on social media. Our analysis reveals that toxic comments in Italian tend to be longer and have fewer unique emojis compared to non-toxic comments, while both exhibit similar lexical diversity. To evaluate the impact of linguistic features on state-of-the-art models’ performance, we fine-tune three pre-trained language models (PoliBERT, UmBERTo, and bert-base-italian-xxl-uncased). Despite their significant correlation with comments’ toxicity, the inclusion of linguistic features worsens the best model’s performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.