Abstract

Powerful deep learning approach frees us from feature engineering in many artificial intelligence tasks. The approach is able to extract efficient representations from the input data, if the data are large enough. Unfortunately, it is not always possible to collect large and quality data. For tasks in low-resource contexts, such as the Russian ⟶ Vietnamese machine translation, insights into the data can compensate for their humble size. In this study of modelling Russian ⟶ Vietnamese translation, we leverage the input Russian words by decomposing them into not only features but also subfeatures. First, we break down a Russian word into a set of linguistic features: part-of-speech, morphology, dependency labels, and lemma. Second, the lemma feature is further divided into subfeatures labelled with tags corresponding to their positions in the lemma. Being consistent with the source side, Vietnamese target sentences are represented as sequences of subtokens. Sublemma-based neural machine translation proves itself in our experiments on Russian-Vietnamese bilingual data collected from TED talks. Experiment results reveal that the proposed model outperforms the best available Russian ⟶ Vietnamese model by 0.97 BLEU. In addition, automatic machine judgment on the experiment results is verified by human judgment. The proposed sublemma-based model provides an alternative to existing models when we build translation systems from an inflectionally rich language, such as Russian, Czech, or Bulgarian, in low-resource contexts.

Highlights

  • Many neural models have been introduced for machine translation [1,2,3,4,5]

  • In practice, there are many cases of scarce data, such as Russian ⟶ Vietnamese translation tasks. e language pair is of low resource

  • In addition to machine judgment with automatic BLEU scores, we semantically studied a limit number of translation results by the two best models: the model with source-word decomposition and the proposed sublemma-based model

Read more

Summary

Introduction

Many neural models have been introduced for machine translation [1,2,3,4,5]. They have different architectures, they all follow the sequence-to-sequence pattern. E source sequences are processed by the neural models; the models generate corresponding sequences of target units. E most intuitive representation of source/target units is words. If the bilingual datasets used to train neural machine translation (NMT) models are large enough, the models will be able to learn reliable statistics of source/target words. A word can have different forms according to its grammatical role in sentences. A word can have different forms according to its grammatical role in sentences. e property leads to a high chance that we will meet word forms which do not occur frequently enough in humble-size training datasets

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call