Pseudo Transfer Learning by Exploiting Monolingual Corpus: An Experiment on Roman Urdu Transliteration

Tafseer Ahmed,Muhammad Yaseen Khan

doi:10.1007/978-981-15-5232-8_36

Abstract

This paper shares two things: an efficient experiment for “pseudo” transfer learning by using huge monolingual (mono-script) dataset in a sequence-to-sequence LSTM model; and application of proposed methodology to improve Roman Urdu transliteration. The research involves echoing monolingual dataset, such that in the pre-training phase, the input and output sequences are ditto, to learn the target language. This process gives target language based initialized weights to the LSTM model before training the network with the original parallel data. The method is beneficial for reducing the requirement of more training data or more computational resources because these are usually not available to many research groups. The experiment is performed for the character-based Romanized Urdu script to standard (i.e., modified Perso-Arabic) Urdu script transliteration. Initially, a sequence-to-sequence encoder-decoder model is trained (echoed) for 100 epochs on 306.9K distinct words in standard Urdu script. Then, the trained model (with the weights tuned by echoing) is used for learning transliteration. At this stage, the parallel corpus comprises 127K pairs of Roman Urdu and standard Urdu tokens. The results are quite impressive, the proposed methodology shows BLEU accuracy of 80.1% in 100 epochs of training parallel data (preceded by echoing the mono-script data for 100 epochs), whereas, the baseline model trained solely on parallel corpus yields ≈76% BLEU accuracy in 200 epochs.

Full Text