An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

Cristina Espana-Bonet,Josef Van Genabith,Adam Csaba Varga,Alberto Barron-Cedeno

doi:10.1109/jstsp.2017.2764273

Cristina Espana-Bonet, Josef Van Genabith + Show 2 more

Open Access

https://doi.org/10.1109/jstsp.2017.2764273

Copy DOI

Abstract

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the NMT context vectors, i.e. output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.

Highlights

End-to-end neural machine translation systems (NMT) emerged in 2013 [1] as a promising alternative to statistical and rule-based systems
In this article we provide evidence of the interlingual nature of the context vectors generated by a multilingual neural machine translation system and study their power in the assessment of mono- and cross-language similarity
Comparisons with word vectors show that context vectors are able to capture better the semantics in the two settings

Summary

Introduction

End-to-end neural machine translation systems (NMT) emerged in 2013 [1] as a promising alternative to statistical and rule-based systems. RQ4: How representations evolve throughout the training These questions are addressed by means of statistics on cosine similarities between pairs of sentences both in a monolingual and a cross-language setting. The second part of the paper is devoted to an application of the findings gathered in the first part: we explore the use of the “interlingua” representations to extract parallel sentences from comparable corpora In this context, comparable corpora are text data on the same topic that are not direct translations of each other but may contain fragments that are translation equivalents; e.g., Wikipedia or news articles on the same subject in different languages.

Background

Related Work

NMT Systems Description

Context Vectors in Multilingual NMT Systems

Graphical Analysis

Source vs Source–Target Semantic Representations

Representations throughout Training

Similarity Assessments

Use Case

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Journal of Selected Topics in Signal Processing	Publication Date: Nov 15, 2017
Citations: 102	License type: CC BY 3.0

R Discovery Prime

R Discovery Prime

An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Journal of Selected Topics in Signal Processing

Lead the way for us

Similar Papers

Multilingual Neural Translation

-

14 Feb 2020
14 Feb 2020

Improving Performance of NMT Using Semantic Concept of WordNet Synset
Fangxu Liu ... Yufeng Chen
-
Fangxu Liu, et. al.Fangxu Liu ... Yufeng Chen
01 Jan 2019
01 Jan 2019

Improving Performance of NMT Using Semantic Concept of WordNet Synset
Fangxu Liu ... Yujie Zhang
-
Fangxu Liu, et. al.Fangxu Liu ... Yujie Zhang
20 Sep 2018
20 Sep 2018

Development of English-to-Igala machine translation system using neural machine translation with attention mechanism
F A Sani
Journal of Management and Technology | VOL. 19
F A SaniF A Sani
15 Aug 2023
Journal of Management and Technology | VOL. 19

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Journal of Selected Topics in Signal Processing