Abstract

Publicly available off-the-shelf word embeddings that are often used in productive applications for natural language processing have been proven to be biased. We have previously shown that this bias can come in different forms, depending on the language and the cultural context. In this work, we extend our previous work and further investigate how bias varies in different languages. We examine Italian and Swedish word embeddings for gender and origin bias, and demonstrate how an origin bias concerning local migration groups in Switzerland is included in German word embeddings. We propose BiasWords, a method to automatically detect new forms of bias. Finally, we discuss how cultural and language aspects are relevant to the impact of bias on the application and to potential mitigation measures.

Highlights

  • Algorithms and data-based applications are highly sensitive to bias in the underlying training data and can contain risks of discrimination for different groups of the society

  • We confirmed a bias for the WEAT5 experiment for Italian and Swedish word embeddings

  • We investigated on bias detection on regional languages— in particular for Romansh

Read more

Summary

Introduction

Algorithms and data-based applications are highly sensitive to bias in the underlying training data and can contain risks of discrimination for different groups of the society. Learning from existing real-world data and making smart decisions based on such data amplifies cultural stereotypes (Barocas and Selbst, 2016) due to historical bias encoded inside the data. It is, relevant to provide metrics to measure and mitigate the bias in training data (Sun et al, 2019)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call