Abstract

Word embedding models have been shown to be effective in performing a wide variety of Natural Language Processing (NLP) tasks such as identifying audiences for web advertisements, parsing resumes to select promising job candidates, and translating documents from one language to another. However, it has been demonstrated that NLP systems learn gender bias from the corpora of documents on which they are trained. It is increasingly common for pre-trained models to be used as a starting point for building applications in a wide range of areas including critical decision making applications. It is also very easy to use a pre-trained model as the basis for a new application without careful consideration of the original nature of the training set. In this paper, we quantify the degree to which gender bias differs with the corpora used for training. We look especially at the impact of starting with a pre-trained model and fine-tuning with additional data. Specifically, we calculate a measure of direct gender bias on several pre-trained models including BERT’s Wikipedia and Book corpus models as well as on several fine-tuned General Language Understanding Evaluation (GLUE) benchmarks. In addition, we evaluate the bias from several more extreme corpora including the Jigsaw identity toxic dataset that includes toxic speech biased against race, gender, religion, and disability and the RtGender dataset that includes speech specifically labelled by gender. Our results reveal that the direct gender bias of the Jigsaw toxic identity dataset is surprisingly close to that of the base pre-trained Google model, but the RtGender dataset has significantly higher direct gender bias than the base model. When the bias learned by an NLP system can vary significantly with the corpora used for training, it becomes important to consider and report these details, especially for use in critical decision-making applications.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.