Abstract
Machine translation (MT) is widely used to translate content on social media platforms aiming to improve accessibility. A great part of the content circulated on social media is user-generated and often contains non-standard spelling, hashtags, and emojis that pose challenges to MT systems. This leads to many mistranslated instances that are presented to users of these platforms, hindering their understanding of content written in other languages. In this paper, we investigate the impact of MT on offensive language identification. We pose that MT and potential mistranslations have an important and mostly under-explored impact on social media tasks such as sentiment analysis and offensive language identification. We create MT-Offense, a novel dataset containing English originals and translations in Arabic, Hindi, Marathi, Sinhala, and Spanish produced by multiple open-access Neural Machine Translation systems. We evaluate the performance of various offensive language models on both original and MT content in different training and test set combinations. We report the F1 scores of the models. Our results show that (1) offensive language identification models perform better on original data than on MT data, and (2) the use of MT data in training helps models better identify offensive language in MT content compared to models trained exclusively on original data.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have