Abstract

Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, GPT-2 and GPT-3. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended the state of the art for many natural language processing tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show here that recent LMs also contain human-like biases of what is right and wrong to do, reflecting existing ethical and moral norms of society. We show that these norms can be captured geometrically by a ‘moral direction’ which can be computed, for example, by a PCA, in the embedding space. The computed ‘moral direction’ can rate the normativity (or non-normativity) of arbitrary phrases without explicitly training the LM for this task, reflecting social norms well. We demonstrate that computing the ’moral direction’ can provide a path for attenuating or even preventing toxic degeneration in LMs, showcasing this capability on the RealToxicityPrompts testbed. Large language models identify patterns in the relations between words and capture their relations in an embedding space. Schramowski and colleagues show that a direction in this space can be identified that separates ‘right’ and ‘wrong’ actions as judged by human survey participants.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.