Abstract
Large-scale automated content moderation on major social media platforms continues to be highly controversial. Moderation and curation are central to the value propositions that platforms provide, but companies have struggled to convincingly demonstrate that their automated systems are fair and effective. For a long time, the limitations of automated content classifiers in dealing with borderline cases have seemed intractable. With the recent expansion in the capabilities and availability of large language models, however, there is reason to suspect that more nuanced automated assessment of content in context may now be possible. In this paper, we set out to understand how the emergence of generative AI tools might transform industrial content moderation practices. We investigate whether the current generation of pre-trained foundation models may expand the established boundaries of the types of tasks that are considered amenable to automation in content moderation. This paper presents the results of a pilot study into the potential use of GPT4 for content moderation. We use the hate speech decisions of Meta’s Oversight Board as examples of covert hate speech and counterspeech that have proven difficult for existing automated tools. Our preliminary results suggest that, given a generic prompt and Meta’s hate speech policies, GPT4 can approximate the decisions and accompanying explanations of the Oversight Board in almost all current cases. We interrogate several clear challenges and limitations, including particularly the sensitivity of variations in prompting, options for validating answers, and generalisability to examples with unseen content.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have