Abstract

Most neural machine translation models are implemented as a conditional language model framework composed of encoder and decoder models. This framework learns complex and long-distant dependencies, but its deep structure causes inefficiency in training. Matching vector representations of source and target sentences improves the inefficiency by shortening the depth from parameters to costs and generalizes NMTs with a different perspective to cross-entropy loss. In this paper, we propose matching methods to derive the cost based on constant word-embedding vectors of source and target sentences. To find the best method, we analyze the impact of the methods with varying structures, distance metrics, and model capacity in a French to English translation task. An optimally configured method is applied to English translation tasks from and to French, Spanish, and German. In the tasks, the method showed performance improvement by 3.23 BLEU at maximum, with an improvement of 0.71 on average. We evaluated the robustness of this method to various embedding distributions and models, such as conventional gated structures and transformer networks, and empirical results showed that it has a higher chance to improve performance in those models.

Highlights

  • Most decoders of neural machine translation (NMT) are conditional language models, which sequentially generate target words in the condition of a given source sentence

  • To implement the two ideas as a single neural network added to existing NMTs, we introduce sentence representation matching, where we will call the sentence representation a concept implying the semantics of source or target sentences

  • We raised the issue of inefficiency in training the encoder of NMTs implemented as a conditional language model

Read more

Summary

Introduction

Most decoders of neural machine translation (NMT) are conditional language models, which sequentially generate target words in the condition of a given source sentence. The remarkable research which improved the performance of NMT included the bidirectional LSTM using both forward and backward sequences [6,7,8], attention model to learn explicit alignment models [9,10], rare word modeling to estimate unknown words by an explicit model and alignment model [11], and argumentation methods to overcome a lack of data [12,13] Those works have been made more rigorous by adopting many advanced methods such as batch normalization [14], ensemble, beam search, input feature specialization, and input feeding. The very deep transformer model demonstrated higher performance than the vanilla transformer [17], and the pre-trained model demonstrated remarkable performance [18,19,20]

Objectives
Methods
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call