Abstract

Recent studies in the field of Machine Translation (MT) and Natural Language Processing (NLP) have shown that existing models amplify biases observed in the training data. The amplification of biases in language technology has mainly been examined with respect to specific phenomena, such as gender bias. In this work, we go beyond the study of gender in MT and investigate how bias amplification might affect language in a broader sense. We hypothesize that the ‘algorithmic bias’, i.e. an exacerbation of frequently observed patterns in combination with a loss of less frequent ones, not only exacerbates societal biases present in current datasets but could also lead to an artificially impoverished language: ‘machine translationese’. We assess the linguistic richness (on a lexical and morphological level) of translations created by different data-driven MT paradigms – phrase-based statistical (PB-SMT) and neural MT (NMT). Our experiments show that there is a loss of lexical and syntactic richness in the translations produced by all investigated MT paradigms for two language pairs (EN-FR and EN-ES).

Highlights

  • The idea of translation entailing a transformation is widely recognised in the field of Translation Studies (TS) (Ippolito, 2014)

  • We use the following metrics: an adapted version of the Lexical Frequency Profile (LFP), three standard metrics commonly used to assess diversity –type/token ratio (TTR), Yule’s I and measure of textual lexical diversity (MTLD), and three new metrics based on synonym frequency in translations

  • We explore the effects of machine translation (MT) algorithms on the richness and complexity of language

Read more

Summary

Introduction

The idea of translation entailing a transformation is widely recognised in the field of Translation Studies (TS) (Ippolito, 2014). Some of the features that characterize translated texts are defined as simplification, explicitation, normalization and leveling out (Baker, 1999). Empirical evidence of the existence of translationese can be found in studies showing that machine learning techniques can be employed to automatically distinguish between human translated and original text by looking at lexical and grammatical information (Baroni and Bernardini, 2006; Koppel and Ordan, 2011). Translationese differs from original texts due to a combination of factors including intentional (e.g. explicitation and normalization) and unintentional ones (e.g. unconscious effects of the source language input on the target language produced). Unlike other work on (human) translationese (or even related work on ‘Post-editese’), we delve into the effects of machine translation (MT) algorithms on language, i.e. Unlike other work on (human) translationese (or even related work on ‘Post-editese’), we delve into the effects of machine translation (MT) algorithms on language, i.e. ‘machine translationese’

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call