Abstract

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the NMT context vectors, i.e. output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.

Highlights

  • End-to-end neural machine translation systems (NMT) emerged in 2013 [1] as a promising alternative to statistical and rule-based systems

  • In this article we provide evidence of the interlingual nature of the context vectors generated by a multilingual neural machine translation system and study their power in the assessment of mono- and cross-language similarity

  • Comparisons with word vectors show that context vectors are able to capture better the semantics in the two settings

Read more

Summary

Introduction

End-to-end neural machine translation systems (NMT) emerged in 2013 [1] as a promising alternative to statistical and rule-based systems. RQ4: How representations evolve throughout the training These questions are addressed by means of statistics on cosine similarities between pairs of sentences both in a monolingual and a cross-language setting. The second part of the paper is devoted to an application of the findings gathered in the first part: we explore the use of the “interlingua” representations to extract parallel sentences from comparable corpora In this context, comparable corpora are text data on the same topic that are not direct translations of each other but may contain fragments that are translation equivalents; e.g., Wikipedia or news articles on the same subject in different languages.

Background
Related Work
NMT Systems Description
Context Vectors in Multilingual NMT Systems
Graphical Analysis
Source vs Source–Target Semantic Representations
Representations throughout Training
Similarity Assessments
Use Case
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.