Translation of Multi-Word Units from Portuguese into French by MT@EC and eTranslation
 This paper aims to determine the extent to which the shift from statistical machine translation (SMT) to neural machine translation (NMT) improved the performance of European Union machine translation systems between 2015 and 2021 in terms of multi-word unit translation and domain coverage. To do so, we chose to test these systems on machine translation into French of multi-word units expressing quantitative and qualitative progression in Portuguese from Portugal. These units consist of the 2-gram ‘cada vez’ and a comparative adjective or adverb (cada vez COMP), and their word-for-word translation into French is not idiomatic (*chaque fois COMP). The most frequent translation into French is ‘de COMP en COMP’. This implies that these multi-word units must be translated ‘en bloc’, but their identification is not straightforward. On the one hand, COMP is not fixed and may include one (mais / plus, menos / moins, maior / plus grand, menor / plus petit, melhor / meilleur – mieux, pior / pire – plus mal) or several words (mais or menos N, ADJ, ADV). On the other hand, the 2-gram ‘cada vez’ can be part of other multi-word units expressing iteration (de cada vez (que)/(à) chaque fois (que)), or ‘dropper’ ([a certain quantity] de cada vez/ à la fois), This raises the challenge of ambiguity, well known to biotranslators and still often problematic for NMT. Moreover, units expressing quantitative or qualitative progression may raise other translation challenges when they are coordinate (with or without repetition of the 2-gram ‘cada vez’), when they are split (cada vez (…) COMP), or when they combine with verbs or nouns to form extended translation units whose translation into French can result in a more concise solution we refer to as ‘lexicalisation’. We established a biotranslation model based on a manually aligned French-Portuguese parallel literary corpus and online searchable French-Portuguese aligned corpora (translation memories). We selected a sample of occurrences of these multi-word units including several translation challenges. These occurrences were selected from a Portuguese journalistic corpus. They belong therefore to general language, whereas the EU's translation memories cover the domains dealt with by its institutions, which represents an additional challenge, considering the critical importance of domain coverage in the data to NMT performance quality. The selected occurrences were translated into French by the EU SMT system in 2015 (MT@EC) and 2019 (eTranslation Legacy) and by eTranslation (the EU NMT system) in 2019 and 2021. Firstly, MT output was analysed according to two general criteria: ‘non- literality’, that is translation into French without ‘chaque’, and acceptability from a semantic point of view, that is MT output without any false meaning, opposite meaning or nonsense. Then we looked at specific challenges, some of which could lead to original solutions, worthy of a professional human translator, such as lexicalisation, change of grammatical category or ‘recategorisation’ and ‘naturalisation’, that is phraseological or syntactic rearrangement that makes the target text more idiomatic. The results show that MT is improving, especially according to the criterion of non-literality. Original solutions are still rare, but they are diversifying in NMT output. Nevertheless, NMT remains imperfect, not least because of the inherent ambiguity of natural languages and the inevitable gaps in the data on which these systems are based. The results also demonstrate the importance of human intervention in the maintenance of the systems learning automatically, since the quality of SMT system’s output decreases between 2015 and 2019, when all efforts were focused on improving EU NMT system. Finally, results reveal the dangers of using English as a pivot language when translating from one Romance language into another, and the need to train future translators in NMT and post-editing.
Read full abstract