Abstract
In this paper we describe our joint submission (JU-Saarland) from Jadavpur University and Saarland University in the WMT 2019 news translation shared task for English–Gujarati language pair within the translation task sub-track. Our baseline and primary submissions are built using Recurrent neural network (RNN) based neural machine translation (NMT) system which follows attention mechanism. Given the fact that the two languages belong to different language families and there is not enough parallel data for this language pair, building a high quality NMT system for this language pair is a difficult task. We produced synthetic data through back-translation from available monolingual data. We report the translation quality of our English–Gujarati and Gujarati–English NMT systems trained at word, byte-pair and character encoding levels where RNN at word level is considered as the baseline and used for comparison purpose. Our English–Gujarati system ranked in the second position in the shared task.
Highlights
1 Introduction to increase the size of the parallel training dataset
We described our joint participation of Jadavpur University and Saarland University in the WMT 2019 news translation task for English–Gujarati and Gujarati–English
The released training data set is completely different in-domain compared to the development set and the size is not anywhere close to the sizable amount of training data which is typically required for the success of Neural Machine translation (NMT) systems
Summary
Dungarwal et al (Dungarwal et al, 2014) developed a statistical method for machine translation, where phrase based method for Hindi-English and factored based method for English-Hindi SMT system was used. They had shown improvements to the existing SMT systems using pre-procesing and post-processing components that generated morphological inflections correctly. Ramesh et al (Ramesh and Sankaranarayanan, 2018) demonstrated how an existing model like bidirectional recurrent neural network can be used to generate parallel sentences for non-English languages like English-Tamil and English-Hindi, which belong to low-resource language pair, to improve the SMT and the NMT systems. Choudhary et al (Choudhary et al, 2018) has shown how to build NMT system for low resource parallel corpus language pair like English-Tamil using techniques like word embeddings and Byte-PairEncoding (Sennrich et al, 2016b) to handle OutOf-Vocabulary Words
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.