Abstract

State-of-the-art (SOTA) neural machine translation (NMT) systems translate texts at sentence level, ignoring context: intra-textual information, like the previous sentence, and extra-textual information, like the gender of the speaker. As a result, some sentences are translated incorrectly. Personalised NMT (PersNMT) and document-level NMT (DocNMT) incorporate this information into the translation process. Both fields are relatively new and previous work within them is limited. Moreover, there are no readily available robust evaluation metrics for them, which makes it difficult to develop better systems, as well as track global progress and compare different methods. This thesis proposal focuses on PersNMT and DocNMT for the domain of dialogue extracted from TV subtitles in five languages: English, Brazilian Portuguese, German, French and Polish. Three main challenges are addressed: (1) incorporating extra-textual information directly into NMT systems; (2) improving the machine translation of cohesion devices; (3) reliable evaluation for PersNMT and DocNMT.

Highlights

  • Neural machine translation (NMT) represents stateof-the-art (SOTA) results in many domains (Sutskever et al, 2014; Vaswani et al, 2017; Lample et al, 2020), with some authors claiming human parity (Hassan et al, 2018)

  • We present the research on Personalised NMT (PersNMT)

  • Many machine translation evaluation (MTE) metrics have been proposed over the years, much owing to the yearly WMT Metrics task (Mathur et al, 2020)

Read more

Summary

Introduction

Neural machine translation (NMT) represents stateof-the-art (SOTA) results in many domains (Sutskever et al, 2014; Vaswani et al, 2017; Lample et al, 2020), with some authors claiming human parity (Hassan et al, 2018). Traditional methods process texts in short units like the utterance or sentence, isolating them from the entire dialogue or document, as well as ignoring extra-textual information (e.g. who is speaking, who they are talking to). This can result in a translation hypothesis’ meaning or function being significantly different from the reference or make the text incohesive or illogical. When translating “I didn’t go.” into Polish, the machine translation (MT) model must guess the gender of I, as this information is not rendered in the English sentence. Previous research on cohesion within DocNMT has revealed that verb phrase ellipsis, coreference and reiteration (a type of lexical cohesion) may be erroneous in MT (e.g. Tiedemann and Scherrer, 2017; Bawden et al, 2018; Voita et al, 2020)

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call