Abstract
Large text corpora used for creating word embeddings (vectors which represent word meanings) often contain stereotypical gender biases. As a result, such unwanted biases will typically also be present in word embeddings derived from such corpora and downstream applications in the field of natural language processing (NLP). To minimize the effect of gender bias in these settings, more insight is needed when it comes to where and how biases manifest themselves in the text corpora employed. This paper contributes by showing how gender bias in word embeddings from Wikipedia has developed over time. Quantifying the gender bias over time shows that art related words have become more female biased. Family and science words have stereotypical biases towards respectively female and male words. These biases seem to have decreased since 2006, but these changes are not more extreme than those seen in random sets of words. Career related words are more strongly associated with male than with female, this difference has only become smaller in recently written articles. These developments provide additional understanding of what can be done to make Wikipedia more gender neutral and how important time of writing can be when considering biases in word embeddings trained from Wikipedia or from other text corpora.
Highlights
Word embeddings are vectors that represent the meaning of words and their relation
The box plots indicate the distribution of Word Embedding Association Test (WEAT) scores for random words, which changes little over time and whose mean seems close to zero, indicating that random words are almost unbiased on average
Arts and Family seem to have strong biases since they fall outside the box plots, while biases in Science seem milder, as its WEAT score is comparable to those of random sets of words
Summary
Word embeddings are vectors that represent the meaning of words and their relation. Word embeddings can be used to search in documents, to analyze sentiment and to classify documents [Mikolov et al, 2013a, Nalisnick et al, 2016, Parikh et al, 2018, Jang et al, 2019]. These embeddings are typically created using unsupervised learning from a large corpus of text [Krishna and Sharada, 2019]. Changes in word embedding can be useful for detecting minor changes in the meaning of words at small time scales [Kutuzov et al, 2018]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.